seq2seq-arabic-dialect-lemmatization

This repository contains the code and resources associated with the paper on Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models.

Available Models

This paper evaluates a large number of models for dialectal Arabic lemmatization.
We only release the top-performing dialectal lemmatization models covering:

Egyptian Arabic (EGY)
Gulf Arabic (GLF)
Levantine Arabic (LEV)

All other models in the paper follow the exact same pipeline and evaluation framework; the only difference is which model is loaded at runtime.
If you need access to any of the additional models reported in the paper, please contact:

Mostafa Saeed – mostafa.saeed@nyu.edu

Environment Setup

We recommend using Conda to manage the environment.

git clone https://github.com/CAMeL-Lab/seq2seq-arabic-dialect-lemmatization.git
cd seq2seq-arabic-dialect-lemmatization

# 1. Create a new conda environment
conda create -n dialectal_lemma_env python=3.12.7

# 2. Activate the environment
conda activate dialectal_lemma_env

# 3. Install Python dependencies
pip install -r requirements.txt

Setting Up the Egyptian Morphological Database and Licensed Datasets

To run any of the top-performing dialectal lemmatizers for EGY, GLF, or LEV , you must first obtain the Egyptian CALIMA morphological database. This resource cannot be distributed directly due to licensing constraints, but can be reconstructed locally using Muddler.

The Egyptian morph_db is derived from the LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 and then run the following:

muddler unmuddle \
  -s "LDC2010L01.tgz source" \
  -m "data and morph DB/EGY data/arz_magold_dev.muddle" \
  "data and morph DB/EGY data/arz_magold_dev.csv"

The Egyptian dataset used in this paper is derived from the licensed resource: BOLT Egyptian Arabic Treebank. After obtaining bolt_arz-df_LDC2018T23.tgz, you can reconstruct the train, dev, and test sets using the encrypted .muddle files included in this repository by running the following.

For the Training Set

muddler unmuddle \
  -s "bolt_arz-df_LDC2018T23.tgz source" \
  -m "data and morph DB/EGY data/arz_magold_train.muddle" \           
  "data and morph DB/EGY data/arz_magold_train.csv"

For the Development Set

muddler unmuddle \
  -s "bolt_arz-df_LDC2018T23.tgz source" \
  -m "data and morph DB/EGY data/arz_magold_dev.muddle" \           
  "data and morph DB/EGY data/arz_magold_dev.csv"

For the Test Set

muddler unmuddle \
  -s "bolt_arz-df_LDC2018T23.tgz source" \
  -m "data and morph DB/EGY data/arz_magold_test.muddle" \           
  "data and morph DB/EGY data/arz_magold_test.csv"

Running the Best Dialectal Lemmatizer

After preparing the required datasets and morphological databases, you can run the best-performing lemmatizer using:

python run_best_dialectal_lemmatizer.py \
    --dialect glf \
    --data_path data/GLF/gumar_dev.csv \
    --output glf_predictions.csv

Argument	Description
`--dialect`	Select the target dialect. Options: `{egy, glf, lev}`
`--data_path`	Path to the input dataset (CSV file for EGY, GLF, or LEV)
`--output`	Path where the predicted lemmas will be saved

Citation

If you use this repository, models, or datasets in your research, please cite:

@inproceedings{saeed-habash-2025-lemmatizing,
    title     = "Lemmatizing Dialectal {A}rabic with Sequence-to-Sequence Models",
    author    = "Saeed, Mostafa  and Habash, Nizar",
    booktitle = "Proceedings of The Third Arabic Natural Language Processing Conference",
    month     = nov,
    year      = "2025",
    address   = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    abstract  = "Lemmatization for dialectal Arabic poses many challenges due to the lack of orthographic standards and limited morphological analyzers. This work explores the effectiveness of Seq2Seq models for lemmatizing dialectal Arabic, both without analyzers and with their integration. We assess how well these models generalize across dialects and benefit from related varieties. Focusing on Egyptian, Gulf, and Levantine dialects with varying resource levels, our analysis highlights both the potential and limitations of data-driven approaches. The proposed method achieves significant gains over baselines, performing well in both low-resource and dialect-rich scenarios."
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data and morph DB		data and morph DB
README.md		README.md
lemma_seq2seq_inference.py		lemma_seq2seq_inference.py
requirements.txt		requirements.txt
run_best_dialectal_lemmatizer.py		run_best_dialectal_lemmatizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

seq2seq-arabic-dialect-lemmatization

Available Models

Environment Setup

Setting Up the Egyptian Morphological Database and Licensed Datasets

For the Training Set

For the Development Set

For the Test Set

Running the Best Dialectal Lemmatizer

Citation

About

Uh oh!

Releases

Packages

Languages

CAMeL-Lab/seq2seq-arabic-dialect-lemmatization

Folders and files

Latest commit

History

Repository files navigation

seq2seq-arabic-dialect-lemmatization

Available Models

Environment Setup

Setting Up the Egyptian Morphological Database and Licensed Datasets

For the Training Set

For the Development Set

For the Test Set

Running the Best Dialectal Lemmatizer

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages