AIResearch AIResearch
Back to articles
Data

AI Detectors Fail on Polished Arabic Text

A new study reveals that AI detection tools frequently misclassify human-written Arabic articles as AI-generated when they are slightly polished by language models, raising concerns about false accusations and fairness.

AI Research
March 26, 2026
3 min read
AI Detectors Fail on Polished Arabic Text

AI detection tools are widely used to identify content generated by artificial intelligence, but a new study shows they struggle with a common practice: when human writers use AI to lightly polish their work. Researchers found that these detectors often misclassify polished Arabic articles as entirely AI-generated, leading to false accusations of plagiarism and undermining trust in the tools. This issue is particularly critical for Arabic, a language with rich morphology and dialects that poses unique s for AI systems.

The key finding from the research is that even minor AI polishing of human-written Arabic text causes significant errors in detection. The study evaluated 10 large language models (LLMs) and 4 commercial AI detectors on two datasets. For pure AI-generated vs. human-authored articles, the best LLM, Claude-4 Sonnet, achieved 83.51% accuracy, while the best commercial model, Originality.AI, reached 92% accuracy. However, when human articles were slightly polished—changing only 10% of the text—performance dropped dramatically. Claude-4 Sonnet's accuracy fell to 57.63% for articles polished by LLaMA-3 70B, and Originality.AI's accuracy plummeted to 12% for articles polished by Mistral or Gemma-3.

Ology involved constructing two datasets to test the detectors. The first dataset contained 800 Arabic articles, half AI-generated and half human-authored, used to evaluate 14 LLMs and commercial detectors. The best 8 models were selected for further testing. The second dataset, Ar-APT, included 400 human-authored articles polished by 10 LLMs at four levels (10%, 25%, 50%, and 75% polishing), totaling 16,400 samples. This dataset assessed whether slight polishing affected detection decisions. The researchers used prompts to instruct LLMs to act as detectors, responding only with "AI" or "Human," and manually tested commercial models on subsets due to API limitations.

Analysis showed that all detectors were adversely affected by polishing. For example, under 10% polishing, Claude-4 Sonnet's false positive rate increased from 16.49% to up to 34.97% for articles polished by Mistral. Commercial models fared worse: Originality.AI's false positive rate jumped from 8% to 88% for articles polished by Mistral or Gemma-3. The study also measured how well LLMs preserved meaning during polishing, using cosine similarity and Jaccard similarity scores. Most LLMs maintained high cosine similarity (above 92% at 10% polishing), indicating the meaning was retained, but detectors still misclassified these texts. Figures in the paper, such as those showing detection rates for GPT-4o and Claude-4 Sonnet across polishing levels, illustrate these declines.

Are significant for authors, educators, and organizations relying on AI detectors. False accusations of AI plagiarism can harm credibility and trust, especially in academic and professional settings where authenticity is paramount. The study highlights the need for more robust detectors that can distinguish between AI-generated content and human writing that has been lightly enhanced. This is particularly urgent for Arabic, which has been understudied compared to English, despite its linguistic complexities. The researchers call for developers to address this issue to prevent unfair penalties and maintain the integrity of AI detection tools.

Limitations of the study include the manual testing of commercial models on only 100 articles due to lack of API access, which may not capture full dataset variability. Additionally, the research focused on Arabic, and may differ for other languages. The study also notes that some LLMs, like Qwen-3, performed poorly in polishing Arabic text, which could skew . Future work should explore multilingual approaches and develop specialized models for detecting polished content without false positives. The datasets are publicly available to encourage further research in this critical area.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn