
Breadcrumb
While the West debates AI ethics, millions of Arabic and Persian speakers are being left behind, and left unprotected, by the very platforms they rely on to speak, resist, and connect.
Elon Musk’s xAI now controls X. It’s the latest and most striking example of how the firewall between human expression and machine learning has collapsed. Our digital words are now data mines for machines that don’t understand our language, history, or struggle.
Social media has long been a lifeline for our communities: a space to bypass censorship, challenge state narratives, and communicate in our own languages. But now, as generative AI models are trained on this vast digital record, the same platforms are turning our voices into raw material — without context, protection, or consent.
And when those voices are already treated unequally, the AI that learns from them can’t help but replicate that bias.
Take Meta, for example. In 2020, an internal survey revealed that while Facebook detected 40% of hate speech in English, that number dropped to just 6% for Arabic-language hate speech.
Entire communities in the region who do not communicate in English are left to fend for themselves in digital spaces rife with abuse, misinformation, and coordinated harassment.
This is despite Arabic being spoken by nearly 60% of the population across the Middle East and North Africa and serving as the official language in more than 20 countries. If detection rates for Arabic are this low, one can only imagine how much worse the situation is for other regional languages, such as Farsi, that receive even less attention from platforms and AI systems alike.
Social media platforms increasingly rely on AI-powered moderation to regulate user content. But the AI tools deployed for content moderation are often not well trained for non-English, non-European languages and contexts. This makes them unable to reliably detect hate speech, disinformation, and digital abuse, especially in regions where linguistic and cultural context is essential.
In her study Does AI Understand Arabic?, Mona Elswah describes this as “inconsistent moderation.” On the same platform, a political cartoon in Arabic might be flagged and removed, while actual hate speech in another dialect goes unchecked. In Persian, the situation is even worse, because it’s barely acknowledged. AI systems are significantly undertrained on Persian-language content, and when moderation does occur, it is often inconsistent or inaccurate.
And now, the same flawed data that fuels these abuses is being used to train generative AI.
Generative AI tools, like OpenAi’s ChatGPT, Meta’s LLaMA, and Musk’s Grok, are designed to write, converse, and create. They’re trained on massive amounts of real-world user data, much of it scraped from social media. That includes our posts, our private messages, even our deleted content.
This data harvesting already raises red flags around misinformation, bias, harmful outputs, and the processing of personal data. But when that training data comes from platforms that already fail to moderate content, the result is even more harmful. These models don’t just learn language —they learn the power structures encoded in that language. They inherit the same biases, amplify the same exclusions, and reinforce the same asymmetries.
Grok, Elon Musk’s in-house AI tool on X, is a cautionary tale. Trained on unfiltered platform content after Musk gutted moderation teams, Grok quickly made headlines for generating racist and misogynistic responses. In January 2025, it produced images of Black football players picking cotton and surrounded by monkeys. A month later, it used misogynistic slurs in Hindi, dismissing the backlash with: “I was just having fun, but lost control.”
These aren’t random glitches, they’re a direct outcome of training AI on toxic, unchecked content.
Now imagine what happens when these systems are trained on Arabic and Persian-language data that was never properly moderated to begin with. The violence, disinformation, and hate that go unflagged on social media become part of the machine’s core knowledge. It’s not just reproduced, it’s normalised.
In much of the Middle East and North Africa, the stakes are even higher. State surveillance, censorship, and internet filtering already restrict online speech. In countries like Iran, where access to the global internet is limited, users face a double bind: monitored by their governments and abandoned by the platforms they use.
Because of sanctions, censorship, and a lack of legal safeguards, users from politically isolated contexts have no meaningful recourse when harmed online. Terms of service don’t protect them. Transparency reports are rarely available in their native languages.
We are caught between algorithmic indifference and authoritarian control.
For years, the worst consequence of language marginalisation might have been that your joke got mistranslated or your post was wrongly flagged. But we’ve moved beyond that.
Now, generative AI is shaping what people know, what they believe, and how they act, at scale. It’s being used to write news, filter content, and train other machines. If Arabic and Persian are underrepresented, misrepresented, or distorted in these systems, the damage is long-term.
AI risks reinforcing authoritarian narratives. It threatens already fragile civic spaces. It accelerates the spread of political disinformation — from Arabic troll farms to Persian smear campaigns. It undermines trust. And worst of all, it pushes millions of users further into the margins of the digital world.
The answer isn’t to abandon AI. It’s to demand better.
Generative AI must be trained on multilingual, culturally competent data. That means real investment in language equity. It means hiring regional experts and researchers to build, test, and evaluate these systems. It means translating safety tools and making transparency accessible to those outside the Anglophone bubble.
Most of all, we must acknowledge that the digital future is political. Languages are not just tools of communication. They’re vessels of memory, culture, and resistance. If Arabic and Persian speakers are excluded from the architecture of AI, we’re not just being ignored. We’re being erased.
And we’ve been erased enough.
Ameneh Dehshiri is a London-based Iranian cyber law expert with more than ten years of experience in digital rights, AI governance, and internet policy.
Follow her on X: @Dehshiri4Law
Have questions or comments? Email us at: editorial-english@newarab.com
Opinions expressed in this article remain those of the author and do not necessarily represent those of The New Arab, its editorial board or staff.