Source: Getty
Policy Proposal

Platforms Should Use Algorithms to Help Users Help Themselves

Social media platforms generally rely on human moderation to remove prohibited content. Yet what if moderation could happen before content is even posted?

by Christopher Paul and Hilary Reininger
Published on July 20, 2021

All social media platforms are based on the posting and sharing of user-generated content. The trouble starts when this content is either false or harmful or both. Each platform has specific rules against objectionable content, but enforcing these rules can be extremely difficult at scale.1

Social media users generate massive volumes of content, which then spreads at extraordinary speeds. Yet platforms generally rely on a slow process of human moderation to remove prohibited content. In most cases, moderators review content only after it has already been posted and then identified as potentially objectionable (either by other users or an algorithm).2 This post hoc process means that hateful, violent, or false material can spread wildly before it is flagged, reviewed, and finally removed.

What if moderation could happen before the content is even posted? There is a way: platforms could build systems that prompt users to self-moderate before they post objectionable content. Platforms are cautiously experimenting with this approach but have only done so in a few narrow contexts (like when users want to share content that platforms have already determined is false) and haven’t yet applied state-of-the-art technology. Platforms should prompt users regarding a wide range of problematic content—from hate speech to harassment—that would be identified using artificial intelligence. The technology already exists; major platforms only need to tailor, scale up, refine, and employ it.

The Post Hoc Moderation Process

Platforms have long relied on users to help them identify harmful content. Most platforms have buttons allowing users to report inappropriate content, which then cues platform moderators to review it for possible removal. While this approach has led to the removal of great quantities of prohibited content, the material will have already potentially spread unabated. Further, users sometimes report content just because they disagree with it; this abuse of the function can harass or silence others and clutter the cueing system for platform moderators.

To supplement user reporting, platforms have algorithms that flag content for human review. Several platforms currently use image recognition tools and natural language processing classifiers to help moderators filter and prioritize possible objectionable content for evaluation.3 Facebook, YouTube, and Twitter all use some form of algorithmic process to support post hoc content moderation.4 In a few areas, like child sex abuse material and copyright infringement, algorithmic screening tools (such as PhotoDNA and ContentID) are accurate enough to remove content automatically, without human intervention.5 But in most other areas, human review is still needed to interpret and apply complex platform rules and consider the context of individual posts.6

Platforms face similar limitations in their efforts to label false or misleading content or redirect users to more accurate information. These systems, too, are mostly post hoc. Humans must first flag disputed content and assess it before it can be labeled.

The Power of Prompts

Some platforms have begun experimenting with prompts that encourage users to moderate or fact-check their own draft content before it is posted.

In June 2020, Facebook began to prompt users to confirm their desire to share content that is over ninety days old (which may be outdated or misleading). In August, the system expanded to ask users to confirm and to view additional credible information when seeking to share information related to COVID-19.7 In October 2020, Twitter began asking users to view credible information on a topic before amplifying content labeled as misleading information.8 Twitter also rolled out a feature that prompts users to consider reading an article before retweeting it.9 TikTok has announced plans to warn users about unsubstantiated or unverified content, with a prompt to confirm when a user seeks to share such content.10

Such prompts have at least three virtues. First, they may help users pause and engage in what Daniel Kahneman calls “system 2” thinking—higher-level cognitive reflection.11 In fact, research has shown that pop-up warnings requiring user interaction to dismiss them can positively change user behavior.12 Second, if such self-moderation occurs, it would be in advance of posting, before potentially harmful material can spread. Finally, these prompts preserve users’ freedom of expression, as they allow users to ignore the warnings and post the questionable material anyway.

Yet despite their obvious benefits, platforms only use prompts in a few narrow contexts, such as COVID-19 misinformation. Platforms also tend to rely on crude heuristics to trigger prompts—like whether content is more than ninety days old or has already been flagged as misleading. For these reasons, current prompts apply to only a tiny fraction of material that violates platform policies. Given the scale of the content moderation challenge, platforms need more comprehensive and flexible systems. Technology can help.

Adopting Algorithmic Prompts

AI can help platforms identify a much larger set of posts that potentially violate a wide range of community standards and would then trigger a prompt to the user. Using platforms’ archives of previously removed content, algorithms could be trained to assess the likelihood that new content may violate community standards regarding hate speech, violence, disinformation, or other forms of harm. If a draft post met a certain threshold—for example, a 75 percent likelihood of violating platform policies—then the user would see a prompt that would say something like, “Warning: this content has been flagged by our algorithms as potentially violating our policy banning [specific violation]. Press ‘confirm’ to post, or press ‘edit’ to revise.”

Of course, these algorithms would not be perfectly accurate, but that would be okay. If the user disagrees with the warning, they could simply confirm their desire to post the content. In that case, the post could be cued for eventual post hoc human review. And platforms would likely face less blowback for subsequently banning or sanctioning users who were given fair warning that their content might violate platform policies and chose to post it anyway.

Similar AI technology already exists and has been successfully applied to analogous contexts. Several news outlets have used Jigsaw’s Perspective API algorithm to detect potentially “toxic” user-generated comments on articles and then prompt users to reconsider before posting. This approach has resulted in reductions in toxicity and acrimony in comment threads: 40 percent of Vox readers who received a prompt about their comments chose to change them.13 Major social media platforms could use similar algorithms, training them to spot violations of their specific community standards.

In fact, Facebook has already developed an algorithm that detects 97 percent of hate speech on its platform and flags it for human moderation. This same algorithm could be used to generate user prompts before the hate speech is ever posted.14

Challenges Remain

We don’t know exactly why platforms have so far declined to implement such prompts, but adoption would come with some challenges. While the algorithms do not need to be perfect, they do need to be reasonably good. Too many erroneous prompts could cause users to begin ignoring them or lose trust in the platform’s broader moderation regime. Additionally, algorithms would need to work in multiple languages and global contexts. The accuracy of various AI tools for identifying objectionable content is highly dependent on the context and the type of content trying to be recognized, so the use of these tools might not be successful for all types of moderation.15 But the systems could be gradually expanded and refined over time.

Also, the algorithms would need to operate almost instantaneously to avoid a delay between the user pressing “submit” and the determination to either show a prompt or to allow them to proceed. Too long a delay would be noticed by users and would likely lead to uproar. One approach would be to have the algorithm work in the background as users prepare their content, so that a determination is nearly ready when the post is finished. Jigsaw’s example suggests that prompting systems with negligible delays are feasible in at least some contexts, though platforms like Facebook operate at a much larger scale than newspaper comment sections so their implementation might be more demanding.

Finally, there is a risk of exploitation by bad actors. Those who intentionally and willfully post misleading or dangerous material will not be deterred by an algorithmic warning. Instead, they could use the warnings to help them craft harmful posts that fall just below the threshold of algorithmic detection. Such posts would then escape human moderation until they were manually flagged by users. Yet while bad actors might be able to game the system in the short term, the platforms will be playing the game too. Any cases of exploitation, once ultimately identified, could be used to refine algorithmic thresholds.

Conclusion

Moderating user-generated content at scale may seem like an impossible task. But that is only true if platforms continue to rely mainly on post hoc moderation. Another option exists: platforms can help users help themselves. User prompts are designed to reduce the spread of harmful content while respecting freedom of expression and are immediate and reasonably effective. The precedent for user prompts already exists, and the technology needed to expand them into new contexts is available. All that remains is for platforms to take action.

Christopher Paul is a senior social scientist at the nonprofit, nonpartisan RAND Corporation and professor at the Pardee RAND Graduate School.

Hilary Reininger is an assistant policy analyst at RAND Corporation and a doctoral student at Pardee RAND Graduate School.

Notes

1 For an overview of platform community standards, see Jon Bateman, Natalie Thompson, and Victoria Smith, “How Social Media Platforms’ Community Standards Address Influence Operations,” Carnegie Endowment for International Peace, April 1, 2021, https://carnegieendowment.org/2021/04/01/how-social-media-platforms-community-standards-address-influence-operations-pub-84201.

2 Spandana Singh, “An Analysis of How Internet Platforms Are Using Artificial Intelligence to Moderate User-Generated Content,” New America, July 22, 2019, https://www.newamerica.org/oti/reports/everything-moderation-analysis-how-internet-platforms-are-using-artificial-intelligence-moderate-user-generated-content/.

3 Ibid.

4 Robert Gorwa, Reuben Binns, and Christian Katzenbach, “Algorithmic Content Moderation: Technical and Political Challenges in the Automation of Platform Governance,” Big Data and Society 7, no. 1 (2020).

5 See https://www.microsoft.com/en-us/photodna and https://support.google.com/youtube/answer/2797370?hl=en&ref_topic=9282364.

6 See, for example, Ben Bradford et al., “Report of the Facebook Data Transparency Advisory Group,” https://academyhealth.org/sites/default/files/facebookdatatransparencyadvisorygroupreport52119.pdf.

7 John Hegeman, “Providing People With Additional Context About Content They Share,” Facebook, June 25, 2020, https://about.fb.com/news/2020/06/more-context-for-news-articles-and-other-content/.

8 Vijaya Gadde and Kayvon Beykpour, “Additional Steps We’re Taking Ahead of the 2020 US Election,” Twitter (blog), October 9, 2020, accessed February 11, 2021, https://blog.twitter.com/en_us/topics/company/2020/2020-election-changes.html.

9 James Vincent, “Twitter Is Bringing Its ‘Read Before You Retweet’ Prompt to All Users,” The Verge, September 25, 2020, https://www.theverge.com/2020/9/25/21455635/twitter-read-before-you-tweet-article-prompt-rolling-out-globally-soon.

10 Alex Hern, “TikTok to Introduce Warnings on Content to Help Tackle Misinformation,” TikTok, February 4, 2021, https://www.theguardian.com/technology/2021/feb/04/tiktok-to-introduce-warnings-on-content-to-help-tackle-misinformation.

11 Daniel Kahneman, Thinking, Fast and Slow (New York: Farrar, Straus and Giroux, 2013).

12 In the area of browser warnings, see Robert W. Reeder et al., “An Experience Sampling Study of User Reactions to Browser Warnings in the Field,” Proceedings of the 2018 ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), Apr. 2018, DOI: 10.1145/3173574.3174086; specific to disinformation warnings, see Ben Kaiser et al., “Adapting Security Warnings to Counter Online Disinformation,” Forthcoming, https://arxiv.org/pdf/2008.10772.pdf.

13 Jigsaw, “Perspective Is Reducing Toxicity in the Real World,” The Current, no. 003 (2020), https://jigsaw.google.com/the-current/toxicity/case-studies/.

14 Mike Schroepfer, “Update on Our Progress on AI and Hate Speech Detection,” Facebook, February 11, 2021, https://about.fb.com/news/2021/02/update-on-our-progress-on-ai-and-hate-speech-detection/.

15 Singh, “An Analysis of How Internet Platforms Are Using Artificial Intelligence to Moderate User-Generated Content,” 17–19.

Carnegie does not take institutional positions on public policy issues; the views represented herein are those of the author(s) and do not necessarily reflect the views of Carnegie, its staff, or its trustees.