Author: Mona Elswah, Project Fellow, Center for Democracy & Technology
Imagine a digital world where non-harmful content is frequently taken down while inciting, hateful messages remain untouched. This is the reality for many, particularly users in the Global South, and especially so for those who speak indigenous and “low-resource languages,” which have limited available digital data for training AI models and receive less investment than English and selected European languages. Content moderation plays a crucial role in shaping online discourse, determining which voices in the expansive digital landscape are amplified and which are silenced. However, we know very little about how content moderation systems operate for non-English languages in the Global South. To investigate this, the Center for Democracy & Technology (CDT) is leading a research project funded by the Internet Society Foundation’s Research Grant Program. The project examines content moderation in four regions and their respective languages. This blog is about our first case study, which focuses on the ways different social media companies moderate Maghrebi Arabic, a group of dialects spoken by an estimated 100 million people. This research was conducted in collaboration with the Tunisian-based nonprofit organization Digital Citizenship.
What is wrong with the Moderation of Maghrebi Arabic dialects?
Based on interviews with content moderators, representatives from tech companies, and digital rights advocates, focus group sessions with influencers and frequent social media users, and an online survey of frequent social media users, we found that users in the Maghreb region (includes Tunisia, Morocco, and Algeria among others) are increasingly distrustful of social media companies, particularly those based in the U.S. This reflects a post-colonial perception of U.S.-based platforms as disconnected from local realities and values. In contrast, many users view TikTok, owned by Chinese company ByteDance, as a more “friendly” platform and host for their content, especially for politically sensitive material where local stances may differ from those of the West.
To navigate this distrust in platforms and platform moderation, users have adopted various strategies. First, they employ “algospeak”, a technique to evade algorithms and keep their content online and avoid takedowns. Second, when traditional reporting mechanisms fail them, users are more likely to leverage their social capital and use “mass reporting” to remove content they view as harmful. Finally, users often rely on third-party escalation, more often than not facilitated by civil society organizations, to compensate for inadequate reporting channels, a practice that has become institutionalized by tech companies, especially in the Global South.
In addition to this distrust from users, when we investigated the errors that led to many complaints from the Maghreb region, we found two main problems related to human and automated content moderation. The first one is related to human content moderation. In the Arab world, generally, there are at least six main vendors that contract to moderate content for social media companies. Moderators at these vendors are assigned content from across all Arab countries regardless of their own nationalities. In other words, an Egyptian moderator might be tasked with Tunisian content and vice versa, which leads to misunderstandings of cultural nuance and many linguistic inaccuracies. In the interviews we conducted, moderators told us that, “most of the videos, most of the mistakes and the errors were done, they were due to the lack of [understanding of] the language.”
Automated content moderation also has errors that impact its quality. Maghrebi Arabic is a low-resource language, meaning that there is less high-quality training data available online to train classifiers. We were able to meet with several Natural Language Processing (NLP) scientists based in the Maghreb region, as well as other NLP scientists who worked at tech companies. They identified three main challenges in developing reliable classifiers for Maghrebi Arabic dialects. First, there is the presence of Arabizi, which involves writing Arabic in Latin letters to compensate for insufficient software and hardware support, and code-switching, which is the use of more than one language in the same sentence. Second, the lack of investment from social media companies and a broader lack of political interest in this region have resulted in a resource gap that Modern Standard Arabic does not face. An NLP scientist at a social media company told us, “Due to the lack of data, a lot of the models that are trained on Moroccan Arabic or even Algerian Arabic do not perform so well compared to the ones trained on Modern Standard Arabic.”
The third challenge is the lack of diversity in the NLP teams at the tech companies who develop and evaluate automated content moderation systems. Content creators in the Maghreb have noted that Instagram, for example, appears to automatically hide comments containing “Allah Akbar,” due to a perceived link to terrorism, even though it is a common phrase for prayer and joy. An NLP researcher attributed this issue to a lack of oversight during the training of the classifiers due to the lack of diversity in NLP teams.
How do we fix this? Building trust and capacity
While addressing content moderation bias is a lengthy process that requires considerable effort, willingness, and motivation, it is achievable, and there is significant potential for improvement. Tech companies have the financial resources and access to the means necessary to address many of the aforementioned issues. Additionally, the Maghreb region is home to many qualified NLP experts who have adopted creative and out-of-the-box initiatives to curate datasets for their dialects, aiming to address the resourcedness gap. Leveraging their expertise is essential for tech companies.
Furthermore, tech companies should prioritize diversity when hiring NLP researchers. They should also encourage their outsourced content moderation vendors to create better working environments for moderators and to hire a more diverse pool of individuals who can review the various Maghrebi Arabic dialects.
Tech companies, especially U.S.-based ones, are losing the trust of users in the Maghreb region. Users face significant challenges when trying to communicate with these companies about reporting violations, appealing decisions, or addressing urgent issues like hacking, leaving them feeling vulnerable. To improve this situation, tech companies should establish robust communication channels through regional offices that are fast, responsive, and culturally sensitive. The reporting and appeal processes need to be more organized and transparent, allowing users to track their reports and understand the review process.
Additionally, companies should engage more with civil society organizations who advocate for users, fostering a safer online environment. Users and civil society groups have repeatedly expressed frustration over unexplained content removals and shadowbanning, which contribute to a climate of self-censorship and hinder free expression. Lastly, tech companies must prioritize transparency in content moderation decisions, including clear communication about the reasons behind content removals and whether these actions were taken by algorithms or human moderators. Such measures would help rebuild user trust and alleviate feelings of undue censorship.
Disclaimer: Viewpoints expressed in this post are those of the authors and may or may not reflect official Internet Society Foundation positions.