Skip to content

Commit b942368

Browse files
committed
Update sei docs
1 parent 5449b92 commit b942368

File tree

3 files changed

+41
-9
lines changed

3 files changed

+41
-9
lines changed

docs/beluga.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
=======
2-
Beluga / DeepSEA
2+
DeepSEA (Beluga)
33
=======
44

55
Introduction

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Help topics
4545
tissue-networks
4646
modules
4747
netwas
48-
deepsea
48+
sei
4949
beluga
5050
expecto
5151
citations

docs/sei.rst

Lines changed: 39 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,54 @@
11
=======
2-
Sei
2+
Sei / DeepSEA
33
=======
44

55
Introduction
66
------------
77

8-
Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. You can also find the Sei code repository here (https://github.com/FunctionLab/sei-framework) or read about our manuscript here (https://www.biorxiv.org/content/10.1101/2021.07.29.454384v1).
8+
Sei is a deep-learning-based framework for systematically predicting sequence regulatory activities and applying sequence information to understand human genetics data. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes. Each sequence class integrates predictions for 21,907 chromatin profiles (transcription factor, histone marks, and chromatin accessibility profiles across a wide range of cell types) from the underlying Sei deep learning model. You can also find the Sei code repository `here <https://github.com/FunctionLab/sei-framework>`_ or read about our manuscript `here <https://www.biorxiv.org/content/10.1101/2021.07.29.454384v1>`_.
99

1010
Sequence class-level variant effects are computed by comparing the predictions for the reference and the alternative alleles. A positive score indicates an increase in sequence class activity by the alternative allele and vice versa. Sequence class-level scores are computed by projecting the 21,907 chromatin profile predictions for the sequence to the unit vector that represents each sequence class.
1111

12-
Input format
13-
------------
12+
For older DeepSEA models see:
13+
:doc:`beluga` (2019)
14+
15+
16+
Input
17+
-----
18+
19+
File formats
20+
~~~~~~~~~~~~
21+
We support three types of input: vcf, fasta, bed. If you want to predict effects of noncoding variants, use vcf format input. If you want to predict chromatin feature probabilities for DNA sequences, use fasta format. If you want to specify sequences from the human reference genome (GRCh37/hg19), you can use bed format. See below for a quick introduction:
22+
23+
**VCF format** is used for specifying a genomic variant. A minimal example is ``chr1 109817590 - G T`` (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. Currently, the genome position needs to be in GRCh37/hg19
24+
25+
**Fasta format** input should include sequences of 4096bp length each. If a sequence is longer than 4096bp, only the center 4096bp will be used.
26+
27+
**Bed format** provides another way to specify sequences in human reference genome (hg19). The bed input should specify 4096bp-length regions. A minimal example is ``chr1 109817091 109821186``. The three columns are chromosome, start position, and end position.
28+
29+
Genome coordinates
30+
~~~~~~~~~~~~~~~~~~
31+
We support only ``GRCh37/hg19`` genome coordinates. You can use LiftOver to convert your coordinates to the correct version.
1432

15-
VCF format is used for specifying a genomic variant. A minimal example is chr1 109817590 - G T (if you want to copy cover this text as input, you will need to change spaces to tabs). The five columns are chromosome, position, name, reference allele, and alternative allele. The genome position needs to be in GRCh38/hg38.
33+
Large submissions
34+
~~~~~~~~~~~~~~~~~
35+
We recommend using the web server if you have <10,000 variants or sequences. You will experience degraded performance when submitting a larger set of sequences. In those instances, we suggest that you split the set into multiple <10,000 submissions, or run the standalone version on your local machine, or contact our group directly.
36+
37+
38+
Output
39+
------
1640

1741
Sequence classes
18-
------------
42+
~~~~~~~~~~~~~~~~~~~~~~~~~
1943

2044
The Sei framework predicts 40 sequence class scores, covering a wide range of regulatory activities such as cell-type-specific enhancers and promoters, as well as 21,907 chromatin profiles for any DNA sequence.
2145

2246
To help interpretation, we grouped sequence classes into groups including P (Promoter), E (Enhancer), CTCF (CTCF-cohesin binding), TF (TF binding), PC (Polycomb-repressed), HET (Heterochromatin), TN (Transcription), and L (Low Signal) sequence classes. Please refer to our manuscript for a more detailed description of the sequence classes.
2347

48+
Note: sequence class predictions are only available for vcf inputs.
49+
2450
::
25-
51+
2652
| Sequence class label | Sequence class name | Rank by size | Group |
2753
|---------------------:|----------------------------------:|-------------:|------:|
2854
| PC1 | Polycomb / Heterochromatin | 0 | PC |
@@ -65,3 +91,9 @@ To help interpretation, we grouped sequence classes into groups including P (Pro
6591
| TF5 | AR | 37 | TF |
6692
| E12 | Erythroblast-like | 38 | E |
6793
| HET6 | Centromere | 39 | HET |
94+
95+
96+
97+
Regulatory feature scores
98+
~~~~~~~~~~~~~~~~~~~~~~~~~
99+
* **diffs**: The difference between the the predicted probability of the reference allele and the alternative allele for a regulatory feature (:math:`p_{alt} -p_{ref}`).

0 commit comments

Comments
 (0)