AdarEdit is a domain-specialized graph foundation model for predicting A-to-I RNA editing sites. Unlike generic foundation models that treat RNA as linear sequences, AdarEdit represents RNA segments as graphs where nucleotides are nodes connected by both sequential and base-pairing edges, enabling the model to learn biologically meaningful sequence–structure patterns.
- Dual architecture: Baseline (data-driven) and bio-aware (biochemically-informed) models
- Graph-based RNA representation: Captures both sequence and secondary structure information.
- High accuracy: F1 ≈ 0.90 (bio-aware), AUROC/AUPRC ≈ 0.96 (combined tissue).
- Cross-species generalization: Works on evolutionarily distant species even without Alu elements.
- Mechanistic interpretability: Graph attention highlights influential structural motifs.
- Foundation model behavior: A single model generalizes across tissues and conditions.
Baseline Model:
Node features (8-dim): Base identity, pairing status, relative position, target flag Edges: Sequential + base-pairing (uniform)
Bio-aware Model:
Node features (22-dim): Baseline + trinucleotide context, stem-loop geometry, pairing energies Edges: Typed (canonical/wobble/sequential) with learned embeddings Additional: Parallel 1D CNN (3-mer/5-mer filters), optional global attention
AdarEdit model architecture showing RNA-to-graph conversion and Graph Attention Network processing
First, clone this repository.
You may use the file environment.yml to create anaconda environment with the required packages.
- Save the
environment.ymlfile in your project directory, then run the following command:
conda env create -f environment.yml
- Activate the Environment:
conda activate rnagnn
The Script/Human_Alu/Data_preparation/Classification_Data_Creation.py script creates classification datasets for each tissue:
Process:
- Read Alu pair regions from BED file (chr1,start1,end1,chr2,start2,end2,strand)
- Extract RNA sequences for each Alu pair using genome FASTA
- Connect Alu pairs with "NNNNNNNNNN" linker sequence
- Predict secondary structure using ViennaRNA fold_compound
- Extract editing levels from tissue-specific editing files
- Filter sites with >100 read coverage
- Generate full context sequences with structural annotations
Input:
--pair_region: BED file with Alu pair coordinates
--genome: Human genome FASTA file
--editing_site_plus/minus: Editing level files
--editing_level: Minimum editing threshold (e.g., 10.0)
Output:
data_for_prepare_classification.csv
for tissue in Brain_Cerebellum Artery_Tibial Liver Muscle_Skeletal; do
python Script/Human_Alu/Data_preparation/Classification_Data_Creation.py \
--pair_region data/raw/alu_pairs.bed \
--genome data/raw/hg38.fa \
--editing_site_plus data/raw/${tissue}_editing_plus.tsv \
--editing_site_minus data/raw/${tissue}_editing_minus.tsv \
--editing_level 10.0 \
--output_dir data/data_for_model_input/tissues
done
The Script/Human_Alu/Data_Preparation/build_cross_splits.R script creates balanced train/validation splits: Process:
- Load per-tissue CSV files from data/data_for_model_input/tissues/
- Label editing sites: "yes" (≥15%) vs "no" (<1%)
- Create balanced datasets (equal yes/no samples)
- Generate all tissue-pair combinations for cross-validation
- Remove training examples from validation sets to prevent data leakage
Input:
--data_dir: Per-tissue CSV files
--train_size: Training samples per tissue (default: 19,200)
--valid_size: Validation samples per tissue (default: 4,800)
--yes_cutoff: Editing threshold for positive class (default: 10%)
--no_cutoff: Non-editing threshold for negative class (default: 1%)
Output:
Cross-tissue directories - Training files: {train_tissue}_train.csv Validation files: {valid_tissue}_valid.csv Summary report: cross_split_summary.csv
Rscript Script/Human_Alu/Data_Preparation/build_cross_splits.R \
--data_dir data/data_for_model_input/tissues/ \
--output_dir data/data_for_model_input/tissues/cross_splits/ \
--train_size 19200 \
--valid_size 4800 \
--yes_cutoff 10 \
--no_cutoff 1 \
--seed 42
Cross-species datasets are constructed using a multi-step pipeline that processes editing sites from three evolutionarily distant species lacking Alu elements: Target Species:
- Strongylocentrotus purpuratus (Sea urchin) - Echinoderm
- Ptychodera flava (Acorn worm) - Hemichordate
- Octopus bimaculoides (Octopus) - Mollusk
The cross-species data construction follows a 6-step pipeline that processes raw editing sites into training-ready datasets:
Key Processing Steps:
- Editing Level Extraction: Parse sequencing data to calculate A-to-I editing ratios
- Spatial Clustering: Merge editing sites within 1kb distance, retain clusters with >5 sites
- Density Selection: Extract sequence regions with highest editing site density
- Structure Prediction: Predict RNA secondary structure using RNAfold
- Quality Filtering: Apply coverage (≥100 reads) and editing level (≥10%) thresholds
- Dataset Preparation: Create balanced train/validation splits with equal edited/non-edited sites
Step-by-step Pipeline:
All scripts live under: Script/Evolution/
Conda environment
Use the provided environment file:
conda env create -f Script/Evolution/environment_evolution.yml
conda activate evolution
This environment includes (or expects) the following:
- Python ≥ 3.10, pandas, pyfaidx
- ViennaRNA (with Python bindings; RNA / ViennaRNA)
- bedtools
- R + packages: data.table, dplyr, readr, tidyr, optparse, stringr, tools
- RNAstructure CLI tools: dot2ct, draw (on PATH)
- bpRNA (bpRNA.pl accessible; see below)
The scripts will try to auto-detect DATAPATH (e.g., $CONDA_PREFIX/share/rnastructure/data_tables). If needed, set it explicitly:
export DATAPATH="$CONDA_PREFIX/share/rnastructure/data_tables"
Install bpRNA from GitHub and make the Perl entrypoint reachable in your environment:
git clone https://github.com/hendrixlab/bpRNA.git
# Option A: symlink into the active conda env
ln -sf "$(pwd)/bpRNA/bpRNA.pl" "$CONDA_PREFIX/bin/bpRNA.pl"
chmod +x "$CONDA_PREFIX/bin/bpRNA.pl"
# Option B: set env var instead of symlink
export BPRNA_PL="$(pwd)/bpRNA/bpRNA.pl"
The pipeline requires either bpRNA.pl on PATH or BPRNA_PL set.
For each species prepare:
- Genome FASTA (index will be created by
pyfaidxon first use).
Use the same reference genome/version reported in Zhang et al., Cell Reports (2023), Supplemental Information. - Editing results table (“resTable”-like; includes base counts per replicate, strand, etc.).
Use the per-species editing tables provided in Zhang et al., Cell Reports (2023), Supplemental Information. - Output directory for intermediate and final files.
Below, replace Species and all paths accordingly (run for each species).
Script: Script/Evolution/get_editing_levels.py
- Keeps A→G only
- Aggregates replicate counts
- Computes EditingLevel and Total_Coverage
Writes:
- A2IEditingSite.csv
- A2IEditingSite.bed (BED6 with strand)
python Script/Evolution/get_editing_levels.py \
-i /path/to/Species.resTable.txt \
-d /path/to/Species/
Optional flags (see --help): minimum coverage filter, custom column names, etc.
Script: Script/Evolution/cluster_editing_sites.py
Wraps:
bedtools sort -i A2IEditingSite.bed \
| bedtools merge -d 1000 -s -c 3,3,6 -o count,collapse,distinct
Keeps clusters with site count > 5.
python Script/Evolution/cluster_editing_sites.py \
-i /path/to/Species/A2IEditingSite.bed \
-d 1000 \
-m 5 \
-D /path/to/Species/
Output: /path/to/Species/cluster_d1000_up5editingsite.bed
Script: Script/Evolution/get_ds_with_majority_ES.py
- Extracts/folds extended windows (ViennaRNA)
- Converts and draws structures (dot2ct, draw)
- Runs bpRNA to parse segments (.st)
- Selects the segment containing the majority of editing sites
- Intersects the chosen ds regions with all editing sites (BED6)
- Collects nearby A (≤20 nt from an edited site) as negatives
- Writes structure SVGs and summary CSVs
python Script/Evolution/get_ds_with_majority_ES.py \
-i /path/to/Species/cluster_d1000_up5editingsite.bed \
-o /path/to/Species/ \
-e /path/to/Species/A2IEditingSite.bed \
-g /path/to/Species/genome.fa \
--num_processes 8
Key outputs (in -o):
- all_data_results.csv — ds selection, coordinates, MFE, etc.
- mfe_data_results.csv — per-window MFE summary
- Per-region .dbn/.ct/.shape/.svg
Script: Script/Evolution/merge_ds_results.py
- Expands by relative_positions (edited) and A_20d (nearby adenosines)
- Joins editing metrics by (Chr, Position, Strand)
- Writes combined table
python Script/Evolution/merge_ds_results.py \
-e /path/to/Species/A2IEditingSite.csv \
-a /path/to/Species/all_data_results.csv \
-o /path/to/Species/dsRNA_structure_with_editing_sites_andA20.csv \
-w 15
Script: Script/Evolution/filter_ds_groups.R
- Collapses to one row per (Chr, Position, Strand) using boundary consistency (default tolerance 20 bp for each ds boundary)
- Keeps length_small_ds ≥ 200 and Total_Coverage ≥ 100
- Optionally saves rows above an editing threshold (default 0.1)
Rscript Script/Evolution/filter_ds_groups.R \
--input /path/to/Species/dsRNA_structure_with_editing_sites_andA20.csv \
--output /path/to/Species/dsRNA_structure_with_editing_sites_andA20_len200_cov_100_withoutDup.csv \
--tolerance 20 --min-length 200 --min-coverage 100 \
--editing-threshold 0.1
# optionally:
# --save-above-threshold /path/to/Species/above_0.1.csv
Script: Script/Evolution/prepare_balanced_ml_sets.R (comma-list interface)
- Splits small_ds_seq into L/R by Local_Position
- Labels yes if EditingLevel > 0.1, no if < 0.001 (configurable)
- Balances yes/no per species and (optionally) equalizes across species
- Splits 80/20 → train/valid (configurable)
Rscript Script/Evolution/prepare_balanced_ml_sets.R \
--inputs "Strongylocentrotus_purpuratus=/.../Strongylocentrotus_purpuratus/..._withoutDup.csv,Octopus_bimaculoides=/.../Octopus_bimaculoides/..._withoutDup.csv,Ptychodera_flava=/.../Ptychodera_flava/..._withoutDup.csv" \
--out-dir /path/to/all_data/ \
--pos-threshold 0.1 --neg-threshold 0.001 \
--train-frac 0.8 --equalize-across TRUE --seed 42
Outputs (per species under --out-dir//):
data_for_prepare_<Species>.csvfinal_<Species>_train.csvfinal_<Species>_valid.csv
These are the files you feed into AdarEdit’s training/evaluation scripts (see your model README section).
Examples. See data_evo/examples/Evolution/ — it contains all example inputs/outputs you need (except the species editing site and genome FASTA, which is not included due to its size). For the clustering demo we kept only the first 40 clusters from cluster_d1000_up5editingsite.bed.
The final data preparation step converts CSV files to JSONL format required by the models.
Converts CSV files (output from data processing pipeline) to JSONL format (input for model training).
# Convert all CSV files in a directory tree
python csv_to_jsonl.py datasets/
# Convert CSV files in current directory
python csv_to_jsonl.py .
# Convert specific directory
python csv_to_jsonl.py datasets/Liver/combine_3_2/From the data processing pipeline, each CSV has 4 columns:
structure,L,R,y_n
(((...)))...(((...))),AUGCUAGCUAGC,GCUAGCUAGCUA,yes
.((..))....((..)),CUAGCUAGCUAG,UAGCUAGCUAGC,no
...Columns:
structure: Dot-bracket notation of RNA secondary structureL: Left flanking sequence (upstream of editing site)R: Right flanking sequence (downstream of editing site)y_n: Label - "yes" (edited) or "no" (non-edited)
Each CSV row becomes a JSONL entry:
{"messages": [{"role": "system", "content": "Predict if the central adenosine (A) in the given RNA sequence context within an Alu element will be edited to inosine (I) by ADAR enzymes."}, {"role": "user", "content": "L:AUGCUAGCUAGC, A:A, R:GCUAGCUAGCUA, Alu Vienna Structure:(((...)))...(((...))"}, {"role": "assistant", "content": "yes"}]}
{"messages": [{"role": "system", "content": "Predict if the central adenosine (A) in the given RNA sequence context within an Alu element will be edited to inosine (I) by ADAR enzymes."}, {"role": "user", "content": "L:CUAGCUAGCUAG, A:A, R:UAGCUAGCUAGC, Alu Vienna Structure:.((..))....((..)"}, {"role": "assistant", "content": "no"}]}Use the automated runner script to train and evaluate both models across all tissue combinations:
What it does:
- Creates a timestamped workspace copy of the project and datasets
- Runs smoke test to verify imports and paths
- Trains and evaluates both models on all train/valid pairs in
datasets/**/combine_*/ - Saves checkpoints, predictions, and ROC/PR curves for each model and split
- Generates a comprehensive evaluation summary
python run_all_evals_bioaware_baseline.py \
--variants baseline,bioawarehis will:
- Train baseline and bio-aware models on all tissue combinations
- Save results to
overnight_eval_YYYYMMDD_HHMMSS/ - Run 1000 epochs per model/split
- Use batch size 256
python run_all_evals_bioaware_baseline.py \
--variants baseline,bioaware # Models to train (comma-separated)
--timestamp my_experiment_name # Custom workspace name (optional)
--resume_existing # Skip splits with existing predictions
--start_after "combine_2_4/Liver->Brain" # Resume from specific split
--skip_variants baseline # Skip specific models
--clean_variants bioaware # Delete old results for specific modelsAvailable variants:
baseline: Baseline GAT modelbioaware: Bio-aware model with typed edges + sequence CNN
The script expects datasets organized as:
datasets/
├── Liver/
│ ├── combine_3_2/
│ │ ├── Liver_train.jsonl
│ │ ├── Liver_valid.jsonl
│ └── combine_3_1/
│ └── ...
├── Brain_Cerebellum/
│ └── combine_4_2/
│ └── ...
└── Combined/
└── ...
After running, you'll find in overnight_eval_YYYYMMDD_HHMMSS/:
overnight_eval_20240128_153045/
├── checkpoints/
│ ├── Liver/
│ │ ├── baseline/
│ │ │ └── combine_3_2_Liver_Liver/
│ │ │ └── best.pth # Best checkpoint for this split
│ │ └── bioaware/
│ │ └── combine_3_2_Liver_Liver/
│ │ └── best.pth
│ └── Brain_Cerebellum/
│ └── ...
├── predictions/
│ ├── Liver/
│ │ ├── baseline/
│ │ │ ├── combine_3_2_Liver_Liver.jsonl # Per-sample predictions
│ │ │ └── combine_3_2_Liver_Liver/
│ │ │ └── curves.npz # ROC/PR curve data
│ │ └── bioaware/
│ │ └── ...
│ └── ...
└── comprehensive_evaluation.md # Summary table
Both models use F1-optimized threshold search:
- At each epoch, evaluate on validation set
- Test 33 thresholds (0.1 to 0.9)
- Select threshold that maximizes F1 score
- Save checkpoint if F1 improves
- Best checkpoint is automatically selected
This ensures optimal performance for each tissue and split.
If you prefer to train models individually, use the standalone scripts:
python Scripts/model/gnnadar_verb_compact.py \
--train_file {tissue}_train.csv \
--val_file {tissue}_valid.csv \
--epochs 1000 \
--mode train \
--batch_size 128 \
--num_workers 6 \
--checkpoint_dir checkpoints/Liver \
--checkpoint_interval 10
- (A, B) ROC and PR curves comparing AdarEdit (graph-based) with sequence-only baselines (EditPredict, RNA-FM, ADAR-GPT) on liver tissue. Graph-based models achieve higher AUROC and AUPRC.
- (C, D) Cross-tissue F1 heatmaps for baseline and bio-aware models. Rows = training tissue, columns = validation tissue. Diagonal = within-tissue performance; off-diagonal = cross-tissue generalization.
- (E) ΔF1 heatmap showing bio-aware improvement over baseline across all tissue combinations.
- (F) Cross-species F1 scores: within-species (species→species) and human-to-other transfer (Combined→species) for sea urchin, acorn worm, and octopus.
- (G, H) ROC and PR curves for all cross-tissue and cross-species combinations, comparing baseline and bio-aware models.
Key findings: Bio-aware model outperforms baseline in 20/25 tissue settings (ΔF1 = +0.012 average) and shows robust cross-species transfer.
Note: Sequence-only baseline comparisons (EditPredict, RNA-FM, ADAR-GPT) are taken from a previously published study (see: ADAR-GPT, PNAS 2026).
AdarEdit provides comprehensive interpretability analysis through two complementary approaches:
The interpretability pipeline (Scripts/interpretability/xgboost_shap_analysis.py) performs two-stage XGBoost analysis: Usage:
python Scripts/interpretability/xgboost_shap_analysis.py \
--attention_csv attention_data.csv
- Stage 1: Train XGBoost on all attention features (positions -600 to +600)
- Stage 2: Apply SHAP analysis to identify top 20 most important features
- Stage 3: Retrain XGBoost using only top 20 features
Outputs:
shap_top20_XGBoost.png: SHAP feature importance plot for full modelshap_top20_Retrained_XGBoost.png: SHAP plot for retrained modelshap_data_Original_Model.pkl: Saved SHAP analysis datashap_data_Retrained_Model.pkl: Saved retrained model SHAP data
The GNN model automatically generates attention analysis during evaluation: Generated during model evaluation:
attention_data.csv: Attention weights for each position (-650 to +649) for each validation sampleattention_graphs/: Directory containing detailed attention visualization plots
This section documents the scripts used to characterize the model’s learned ADAR-related sequence and structure preferences around the target adenosine. The analyses are topology-aware: when a perturbation breaks a valid pair, the script removes the corresponding structural edge in the graph.
These scripts are intended for ** Scripts/interpretability/** and reproduce the same analysis workflow used in the manuscript,
All scripts operate on:
--val_file: Validation JSONL format used in this repository
(fields includeL,A,R, andAlu Vienna Structure)--checkpoint: Baseline checkpoint.pth(either a rawstate_dictor a dict containingstate_dict)--num_samples: Maximum number of samples to process (default: 4000)
To focus on high-confidence edited sites, all scripts:
- keep only edited samples (
label=1) - require original model confidence
pred > 0.7(on the unmodified graph)
Script: Scripts/interpretability/fig6ab_core_heatmaps_topology_aware.py
What it does
- Sequence preference (heatmap): mutates base identity in the window
[-3, +3](relative to the target A at position 0), while keeping structure and edges unchanged. It aggregates mean prediction per base/position and normalizes each column by subtracting its mean (excluding the target column). - Structure preference (heatmap; topology-aware): compares
pairedvsunpairedper base/position. In theunpairedcondition it sets the paired-flag to 0 and removes the structural edge to the partner node (when a partner exists).
Run
python Scripts/interpretability/fig6ab_core_heatmaps_topology_aware.py \
--val_file datasets/Liver/combine_2_4/Liver_valid.jsonl \
--checkpoint checkpoints/Liver/baseline/.../best.pth \
--num_samples 4000Outputs
fig6a_motif_sequence_preference_heatmap.pngfig6b_structure_preference_heatmap.png
Script: Scripts/interpretability/fig6c_structural_impact_square.py
What it does
- For each relative position in
[-40, +40], computes:impact = pred(paired) - pred(unpaired) - The
unpairedcondition is topology-aware: it removes the structural partner edge if it exists. - Aggregates the mean impact across samples and produces a square zoned curve.
Run
python Scripts/interpretability/fig6c_structural_impact_square.py \
--val_file datasets/Liver/combine_2_4/Liver_valid.jsonl \
--checkpoint checkpoints/Liver/baseline/.../best.pth \
--num_samples 4000Output
fig6c_structural_impact_square.png
Script: Scripts/interpretability/fig6d_partner_interaction_topology.py
What it does
- For positions
[-1, 0, +1], identifies the structural partner node for each sample and evaluates paired mutations across all base pairs (self×partner). - If a pair is invalid (not Watson–Crick or wobble), the script removes the structural edge between the two nodes.
- Aggregates mean prediction per pair and outputs heatmaps.
Run
python Scripts/interpretability/fig6d_partner_interaction_topology.py \
--val_file datasets/Liver/combine_2_4/Liver_valid.jsonl \
--checkpoint checkpoints/Liver/baseline/.../best.pth \
--out adar_partner_interaction_topology.pngOutput
adar_partner_interaction_topology.png(or your chosen--outpath)
To complement our main results, we provide an aggregate view of training behavior for the bio-aware model across all train→validation tissue/species combinations. The goal is to show that our F1-optimized checkpoint selection operates within broad, stable plateaus rather than isolated transient peaks (i.e., training is well-behaved and checkpoint choice is robust).
Training dynamics for the bio-aware model across all train→validation settings. Each colored line corresponds to one train→validation tissue or species combination. Curves are smoothed with a moving average window w=51.
The long-format CSVs used to generate panels A–B are included:
Tables/training_dynamics/loss_over_epochs_bioaware_plain.csvTables/training_dynamics/f1_over_epochs_bioaware_plain.csv
Each row is (variant, train, val, epoch, metric).
If you have the raw training log file(s), you can reproduce the plots directly from logs.
Loss over epochs (panel A)
Script: Scripts/training_dynamics/plot_loss_over_epochs.py
python Scripts/training_dynamics/plot_loss_over_epochs.py --log <TRAIN_LOG.txt> --variant bioaware_plain --smooth_window 51 --out_png Figure/training_loss_over_epochs_bioaware_plain.png --out_csv Tables/training_dynamics/loss_over_epochs_bioaware_plain.csvF1 over epochs (panel B)
Script: Scripts/training_dynamics/plot_f1_over_epochs.py
python Scripts/training_dynamics/plot_f1_over_epochs.py --log <TRAIN_LOG.txt> --variant bioaware_plain --smooth_window 51 --out_png Figure/training_f1_over_epochs_bioaware_plain.png --out_csv Tables/training_dynamics/f1_over_epochs_bioaware_plain.csv



