A high-performance tool to annotate GWAS summary statistics with accurate rsIDs using dbSNP.
rsidImpu is designed for large-scale genome-wide association studies (GWAS), providing fast and accurate rsID matching via chromosome + position + allele comparison (with support for allele flipping and strand complement).
The tool supports QC filtering, COJO-formatted output, gzip input/output, and multi-thread acceleration.
- 15GB+ dbSNP reference files (tsv or gz)
- millions of GWAS variants
- parallel processing (OpenMP)
- gzip input support
- Matches variants by CHR + POS + allele information
- Supports:
- Allele flipping (A1/A2 ↔ A2/A1)
- Strand complement matching (A↔T, C↔G)
- OpenMP multithreading
- Efficient gzipped input/output
- Hash-based dbSNP lookup
- Capable of processing:
- 15–30 GB dbSNP reference files
- Millions to tens of millions of GWAS variants
All components—logging, hashing, QC, matching—are thread-safe and deterministic under OpenMP.
- Matched rows →
<out>.txt - Unmatched rows →
<out>.txt.unmatched - GWAS alleles are never modified (A1/A2 remain as-is)
- Optional COJO output format:
SNP A1 A2 freq b se p N
(Optional, user-controlled)
- Remove invalid rows (non-finite N/beta/se/freq/P)
- Filter by MAF (default: 0.01)
- Remove duplicated SNPs (keep the one with lowest P)
Requirements:
- C++11 or later
- zlib
- OpenMP (optional but recommended)
git clone https://github.com/Crazzy-Rabbit/rsidImpu.git
cd rsidImpu/src
make clean
make
The binary rsidImpu will be generated in the src/ directory.
rsidImpu
--gwas-summary example/gwas_test_clean.txt \
--dbsnp example/dbsnp_test.txt \
--out example/gwas_rsid.txt \
--dbchr CHR --dbpos POS --dbA1 REF --dbA2 ALT --dbrsid RSID
rsidImpu \
--gwas-summary example/gwas_test_clean.txt \
--dbsnp example/dbsnp_test.txt \
--out example/gwas_rsid.cojo.gz \
--dbchr CHR --dbpos POS --dbA1 REF --dbA2 ALT --dbrsid RSID \
--format cojo \
--freq Freq --beta Beta --se SE --n N --pval P
rsidImpu \
--gwas-summary example/gwas_test_clean.txt \
--dbsnp example/dbsnp_test.txt \
--out example/gwas_rsid.cojo.gz \
--dbchr CHR --dbpos POS --dbA1 REF --dbA2 ALT --dbrsid RSID \
--format cojo \
--freq Freq --beta Beta --se SE --n N --pval P \
--remove-dup-snp \
--maf 0.01
rsidImpu
--gwas-summary example/gwas_test_clean.txt \
--dbsnp example/dbsnp_test.txt \
--out example/gwas_rsid.txt \
--dbchr CHR --dbpos POS --dbA1 REF --dbA2 ALT --dbrsid RSID \
--threads 16
rsidImpu
--gwas-summary example/gwas_test_clean.txt \
--dbsnp example/dbsnp_test.txt \
--out example/gwas_rsid.txt \
--dbchr CHR --dbpos POS --dbA1 REF --dbA2 ALT --dbrsid RSID \
--log rsidImpu.log
rsidImpu \
--gwas-summary example/gwas_test_clean.txt \
--dbsnp example/dbsnp_test.txt \
--out example/gwas_rsid.cojo.gz \
--dbchr CHR --dbpos POS --dbA1 REF --dbA2 ALT --dbrsid RSID \
--format cojo \
--freq Freq --beta Beta --se SE --n N --pval P \
--remove-dup-snp \
--maf 0.01 \
--threads 16 \
--log rsidImpu.log
rsidImpu --help
Required columns:
- CHR (or custom name)
- POS
- A1
- A2
- P (p-value)
Additional columns needed for COJO format:
- freq
- b
- se
- N
Two supported formats:
Required columns:
- CHR
- POS
- REF
- ALT
- RSID
Automatically parsed as:
CHR SNP CM POS A1 A2
Written to: <out> or <out>.gz\
Format depends on --format:
- gwas →
original GWAS columns + SNP - cojo →
SNP A1 A2 freq beta se P N
Written to: <out>.unmatched or <out>.unmatched.gz
SNP A1 A2 freq b se p N
rs1000 A G 0.37 0.145 0.035 1e-5 50000
rs2000 T C 0.42 -0.080 0.025 2e-3 50000
...
--remove-dup-snp Remove duplicated SNPs (keep lowest P)
--maf <val> MAF threshold (default 0.01)
Auto-QC Remove lines with non-finite freq/beta/se/N/P, or p outside [0,1]
- Allele matching allows:
- swap/flip: A1/A2 ↔ A2/A1
- strand complement: A↔T, C↔G
- Input alleles are never modified
- Only matched rows are included in the main output
- Gzip input/output is supported automatically based on filename suffix .gz
Special thanks to ChatGPT for code assistance and architectural optimization during tool development.
If you have any questions or suggestions, feel free to reach out: