Arabic Dialect Identification (ADI) has traditionally been modeled as a single-label classification task. However, recent work argues that ADI should be framed as a multi-label classification problem, as a single utterance may simultaneously sound natural to speakers from multiple countries. Despite this recognition, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training.
By analyzing models trained on single-label ADI data, we demonstrate that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi).
We then train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system, representing a significant advancement in multi-label Arabic dialect identification.
Comprehensive approach to MLADI that explicitly models dialectal overlap, recognizing that sentences can be acceptable in multiple Arabic dialects simultaneously.
Novel dataset construction combining GPT-4o predictions with binary classifiers, guided by ALDi scores to balance precision and recall across dialectal complexity ranges.
Two curriculum strategies based on ALDi scores and label cardinality, progressively exposing the model to increasingly ambiguous dialectal instances.
Achieves 69.04% macro F1 on MLADI benchmark, surpassing previous best system by 14 percentage points and outperforming larger multilingual models.
Deep analysis reveals that negative samples in single-label datasets often represent valid multi-dialectal cases, informing better pseudo-labeling strategies.
Covers 18 country-level Arabic dialects including Algeria, Egypt, Iraq, Jordan, Lebanon, Morocco, Saudi Arabia, Syria, and others across the Arab world.
MARBERT (Abdul-Mageed et al., 2021) fine-tuned for multi-label classification with frozen bottom 8 layers
Binary cross-entropy loss with curriculum learning, 3 epochs, batch size 24, sigmoid threshold 0.3
GPT-4o for intermediate ALDi ranges (0.11-0.77), binary classifiers for extremes (<0.11 or >0.77)
Arabic Level of Dialectness scores guide both data construction and curriculum ordering for optimal learning
Two strategies: ALDi-based (dialectal complexity) and cardinality-based (number of valid dialect labels)
Combined NADI 2020, 2021, 2023 datasets covering ~50,000 tweets across 18 Arabic dialects
MLADI Test Set Results (1,000 sentences, 11 dialects):
• LAHJATBERT + ALDi CL: 69.0% macro F1 (65.0% precision, 76.4% recall)
• LAHJATBERT + Cardinality CL: 66.6% macro F1 (59.3% precision, 81.0% recall)
• LAHJATBERT (no curriculum): 68.0% macro F1 (69.0% precision, 69.7% recall)
• Aya-32B: 54.5% macro F1
• Elyadata: 52.4% macro F1
• NADI 2024 Baseline: 47.0% macro F1
The ALDi-based curriculum achieves the best overall balance, while the cardinality-based curriculum maximizes recall. All LAHJATBERT variants significantly outperform previous approaches, demonstrating the effectiveness of our pseudo-labeling and curriculum learning strategies.
@misc{mekky2026curriculumlearningpseudolabelingimprove,
title={Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models},
author={Ali Mekky and Mohamed El Zeftawy and Lara Hassan and Amr Keleg and Preslav Nakov},
year={2026},
eprint={2602.12937},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.12937},
}