Curriculum Learning and Pseudo-Labeling Improve the Generalization of
Multi-Label Arabic Dialect Identification Models

Ali Mekky1*Mohamed El Zeftawy1*Lara Hassan1*
Amr Keleg1Preslav Nakov1
1Mohamed Bin Zayed University of Artificial Intelligence
* Equal contribution.
Accepted at VarDial 2026 (12th Workshop on NLP for Similar Languages, Varieties and Dialects, co-located with EACL 2026)
69.04%
Macro F1 Score
+14%
Over strongest previously reported system
18
Arabic Dialects
Top #1
MLADI Leaderboard

Abstract

Arabic Dialect Identification (ADI) has traditionally been modeled as a single-label classification task. However, recent work argues that ADI should be framed as a multi-label classification problem, as a single utterance may simultaneously sound natural to speakers from multiple countries. Despite this recognition, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training.

By analyzing models trained on single-label ADI data, we demonstrate that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi).

We then train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system, representing a significant advancement in multi-label Arabic dialect identification.

Key Highlights

🎯

Multi-Label Framework

Comprehensive approach to MLADI that explicitly models dialectal overlap, recognizing that sentences can be acceptable in multiple Arabic dialects simultaneously.

🤖

Hybrid Pseudo-Labeling

Novel dataset construction combining GPT-4o predictions with binary classifiers, guided by ALDi scores to balance precision and recall across dialectal complexity ranges.

📚

Curriculum Learning

Two curriculum strategies based on ALDi scores and label cardinality, progressively exposing the model to increasingly ambiguous dialectal instances.

📊

State-of-the-Art Results

Achieves 69.04% macro F1 on MLADI benchmark, surpassing previous best system by 14 percentage points and outperforming larger multilingual models.

🔍

Training Dynamics Analysis

Deep analysis reveals that negative samples in single-label datasets often represent valid multi-dialectal cases, informing better pseudo-labeling strategies.

🌍

Broad Coverage

Covers 18 country-level Arabic dialects including Algeria, Egypt, Iraq, Jordan, Lebanon, Morocco, Saudi Arabia, Syria, and others across the Arab world.

Key Contributions

  • In-depth Analysis: Comprehensive examination of limitations in reusing single-label ADI datasets for multi-label dialect acceptability, demonstrating systematic issues with negative sample selection.
  • Pseudo-Labeled Dataset: Construction of a multi-label training dataset by aggregating predictions from GPT-4o and 18 binary dialect classifiers, guided by ALDi scores for optimal precision-recall balance.
  • LAHJATBERT Model Family: Introduction of BERT-based multi-label models trained with curriculum learning strategies, achieving state-of-the-art performance on the MLADI benchmark.

⚙️ Technical Approach

Base Model

MARBERT (Abdul-Mageed et al., 2021) fine-tuned for multi-label classification with frozen bottom 8 layers

Training Strategy

Binary cross-entropy loss with curriculum learning, 3 epochs, batch size 24, sigmoid threshold 0.3

Pseudo-Labeling

GPT-4o for intermediate ALDi ranges (0.11-0.77), binary classifiers for extremes (<0.11 or >0.77)

ALDi Integration

Arabic Level of Dialectness scores guide both data construction and curriculum ordering for optimal learning

Curriculum Types

Two strategies: ALDi-based (dialectal complexity) and cardinality-based (number of valid dialect labels)

Dataset Size

Combined NADI 2020, 2021, 2023 datasets covering ~50,000 tweets across 18 Arabic dialects

Performance Comparison

MLADI Test Set Results (1,000 sentences, 11 dialects):

LAHJATBERT + ALDi CL: 69.0% macro F1 (65.0% precision, 76.4% recall)
LAHJATBERT + Cardinality CL: 66.6% macro F1 (59.3% precision, 81.0% recall)
LAHJATBERT (no curriculum): 68.0% macro F1 (69.0% precision, 69.7% recall)
Aya-32B: 54.5% macro F1
Elyadata: 52.4% macro F1
NADI 2024 Baseline: 47.0% macro F1

The ALDi-based curriculum achieves the best overall balance, while the cardinality-based curriculum maximizes recall. All LAHJATBERT variants significantly outperform previous approaches, demonstrating the effectiveness of our pseudo-labeling and curriculum learning strategies.

Citation
@misc{mekky2026curriculumlearningpseudolabelingimprove,
      title={Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models}, 
      author={Ali Mekky and Mohamed El Zeftawy and Lara Hassan and Amr Keleg and Preslav Nakov},
      year={2026},
      eprint={2602.12937},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.12937}, 
}