Skip to content

Commit 04705f9

Browse files
committed
Beluga allows 2k sequence window
1 parent a593e60 commit 04705f9

File tree

1 file changed

+46
-20
lines changed

1 file changed

+46
-20
lines changed

docs/deepsea.rst

Lines changed: 46 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The current version of DeepSEA, nicknamed '**Beluga**', can predict **2002** chr
1111

1212
Jian Zhou, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya, **Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk**. Nature Genetics (2018).
1313

14-
DeepSEA is described in the following manuscript:
14+
DeepSEA is originally described in the following manuscript:
1515

1616
Jian Zhou, Olga G. Troyanskaya. **Predicting the Effects of Noncoding Variants with Deep learning-based Sequence Model.** Nature Methods (2015).
1717

@@ -20,33 +20,59 @@ To determine if certain features (ie. transcription factors, marks, or cell type
2020
Input
2121
-----
2222

23-
DeepSEA predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be ultilized for predicting chromatin features for any DNA sequence.
23+
DeepSEA predicts genomic variant effects on a wide range of chromatin features at the variant position (Transcription factors binding, DNase I hypersensitive sites, and histone marks in multiple human cell types). DeepSEA can also be utilized for predicting chromatin features for any DNA sequence.
2424

2525
File formats
2626
~~~~~~~~~~~~
2727
We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction:
2828

2929
**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele.
3030

31-
**Fasta format** input should include sequences of 1000bp length each. If a sequence is longer than 1000bp, only the center 1000bp will be used. A minimal example is ::
31+
**Fasta format** input should include sequences of 2000bp length each. If a sequence is longer than 2000bp, only the center 2000bp will be used. A minimal example is ::
3232

3333
>TestSequence
34-
TATCTCTCATGTTTCTGGTATAGATGGTATATATGTTAATCTTGTTCCTGAGGTCTGTTTTTTATTTTTGTCATTAAAGT
35-
GGGAATTAAATAGTTTTGTAGTGCATATAAATTAAAGAAAAAGTTCACATAAGCATATTTGCCAATCATCTCAAAATGCT
36-
ATATTCTCCTTCACGGTTTTGAAAATAATTCAGGGTTTTCTCTTCCTCATTGCTTTCCCACCAACTGACAGTATTATTTT
37-
CTTAGTCATTTTACTGACCTTTGAAATTACTCCTTTGAGGTCTTCTAAAAAATTTTATGGGCTCTGCTGCTTTTTGGTGG
38-
CCTCCTTGTATCATTTATTCTATTACAGGACGACTTACAAAAGGAAGCACATAAATTGACCCATATACATATCCTATCAT
39-
TGGGGAGTTTCTGTGCAAATGTTATTTATTGGAAGCTATTACTAAGAATTGTAAGAAAAATAATTGGTATTGATGCAGCT
40-
AGTATGGTTCCTGTAATTATCGTACTCAGCCACGTAAATCATAGCTATATGTAGCCAAAGATCCATGAACAAAATTTCCA
41-
GTAACATCATTATAATTCAAAAGGCAGACTTTCAGAACCAGACAGACTTGAATTTAAATTCTAGCTTTACCACACATGAA
42-
TTTAACCTTGTGGAAGGTTAACCTATCTAAACTCATGTTTCTTCATTGGTAGCTGATAAAATTAAGGATCATGTATATAA
43-
CCACCTAGTAGAGTTGTTTAAGAAACTGTTAGAATTCCATAAATTGTTAGTATTAATGAGTTTTTGTTGGACATGTGTTA
44-
GGCTAGGCCACTCCTTGACCTTCATAGAGGTATGGATTATGACACAAATTCTAAACTGTAGGTAGGCATGGCTTTGTAGC
45-
AAGTATTAAAATAGTAAATATTTTATTTTTATAAGATAAATGTAAACCTTTTAAAAGTTTCATTACATTTGTATTTATGA
46-
AATATCATCCTATATCAACTATAGAGAGAAGATCGCAAGA
47-
48-
49-
**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 1000bp-length regions. A minimal example is ``chr1 109817091 109818090``. The three columns are chromosome, start position, and end position.
34+
TGGGATTACAGGCGTGAGCCACCGCGCCCGGCCCATTGTACCATTCTTAT
35+
GCCTTTGCGTCCTCATAGCTTAGCTCCCGTATATCAGTGAGAACATACTA
36+
TGTTTGGTTTTCCATACCCGAGTTACTTCACTTAGAATAATAGTCTCCAA
37+
TTTCATCCAGGTCAGTGCAAATGCGTTAATTCGTTCCTTTTATGGCTGAG
38+
TAGTATTCCATCATATATATATACTACAGTTTCTTTATCCACTCGTAAAT
39+
TGATGGGCATTTGTGTTGGAACACTTCTCCACTGCTGGTGGGAATGTAAA
40+
TTAGTGCAGCCACTATGGATAACAGTGTGGAGATTTGTTAAAGAACTAAA
41+
ACTAGAACTACCATTTGATCCAGCAATCCCACTACTGGGTATCTACCCAG
42+
AAGAAAAGAAGTCATTATTTGAAAAAGATACTTGCACGGGCATGTTTATA
43+
GCAGCACAATTCACAATTGTAGTTGTATTTCTTTAAGCGTGTCTTTTCAA
44+
TATCTCTCATGTTTCTGGTATAGATGGTATATATGTTAATCTTGTTCCTG
45+
AGGTCTGTTTTTTATTTTTGTCATTAAAGTGGGAATTAAATAGTTTTGTA
46+
GTGCATATAAATTAAAGAAAAAGTTCACATAAGCATATTTGCCAATCATC
47+
TCAAAATGCTATATTCTCCTTCACGGTTTTGAAAATAATTCAGGGTTTTC
48+
TCTTCCTCATTGCTTTCCCACCAACTGACAGTATTATTTTCTTAGTCATT
49+
TTACTGACCTTTGAAATTACTCCTTTGAGGTCTTCTAAAAAATTTTATGG
50+
GCTCTGCTGCTTTTTGGTGGCCTCCTTGTATCATTTATTCTATTACAGGA
51+
CGACTTACAAAAGGAAGCACATAAATTGACCCATATACATATCCTATCAT
52+
TGGGGAGTTTCTGTGCAAATGTTATTTATTGGAAGCTATTACTAAGAATT
53+
GTAAGAAAAATAATTGGTATTGATGCAGCTAGTATGGTTCCTGTAATTAT
54+
CGTACTCAGCCACGTAAATCATAGCTATATGTAGCCAAAGATCCATGAAC
55+
AAAATTTCCAGTAACATCATTATAATTCAAAAGGCAGACTTTCAGAACCA
56+
GACAGACTTGAATTTAAATTCTAGCTTTACCACACATGAATTTAACCTTG
57+
TGGAAGGTTAACCTATCTAAACTCATGTTTCTTCATTGGTAGCTGATAAA
58+
ATTAAGGATCATGTATATAACCACCTAGTAGAGTTGTTTAAGAAACTGTT
59+
AGAATTCCATAAATTGTTAGTATTAATGAGTTTTTGTTGGACATGTGTTA
60+
GGCTAGGCCACTCCTTGACCTTCATAGAGGTATGGATTATGACACAAATT
61+
CTAAACTGTAGGTAGGCATGGCTTTGTAGCAAGTATTAAAATAGTAAATA
62+
TTTTATTTTTATAAGATAAATGTAAACCTTTTAAAAGTTTCATTACATTT
63+
GTATTTATGAAATATCATCCTATATCAACTATAGAGAGAAGATCGCAAGA
64+
AGGCAGTGGCAGCAGAGGCTCCAGTTAGGAGGCTACTAGTCCAAATACAT
65+
TGCGATAAAAACTTGGCAAAAGGTGCTGGTAGTCTGATGAAATAAAGTAG
66+
ATAAATTTTAGAGGTATTTATAAAATAATTAAAGAATATTCAATAATAGG
67+
AGATATATTACCCAATAGAGTGGAGATTCAAAGATAACTCCGAAAGTTTT
68+
TTGCTAAAGCAACATTTGGCTGTGCTATCATTTACTAAGAAAGACAACAA
69+
GAGAGTAAAATCAAGTTTGAGGATGAAGTGAATTTATTCCTTTTTGATTG
70+
ATACATAATTGACATGTAATAAAACCCACAATGTTAAGAGTTCGGTTTGA
71+
TGTGCTTGACTATTTTAGGCACTGGTGTTATCACAACACAAGACAACAGA
72+
TAGGACATTCTCAGAAAATTTTTTCATGTCCCTTTCCAGTCAGTTTCAAG
73+
CCTTCTTTCCATGCAATAATTTTCTCACTTTGCCATTCTAGTAGGTGTGA
74+
75+
**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 2000bp-length regions. A minimal example is ``chr1 109817091 109819090``. The three columns are chromosome, start position, and end position.
5076

5177
Genome coordinates
5278
~~~~~~~~~~~~~~~~~~
@@ -85,4 +111,4 @@ Perform "In silico saturated mutagenesis" (ISM) analysis to discover informative
85111

86112
Note that ISM only accepts a sequence (FASTA file) as input.
87113

88-
ISM outputs effects for each of three possible substitutions of all 1000 bases, across all chromatin features.
114+
ISM outputs effects for each of three possible substitutions of all 2000 bases, across all chromatin features.

0 commit comments

Comments
 (0)