Difference between revisions of "FAQ"

From PastDB

 
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
  
 +
==== How do I cite PastDB? ====
 +
If you use data from PastDB, please cite our paper in [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02258-y Genome Biology]:
  
==== How can I search for an AS event by its VastDB ID? ====
+
Martín, G., Márquez, Y., Duque, P., Irimia, M. (2021). Alternative splicing landscapes in <i>Arabidopsis thaliana</i> across tissues and stress conditions highlight major functional differences with animals. Genome Biol, 22:35.
You can search AS events by ID using the search box in the top menu bar. Alternatively, you can use the search boxes in the home page to search events by gene, or by genomic coordinates.
 
  
  
 
==== How is the inclusion level (PSI) of a given AS event quantified? ====
 
==== How is the inclusion level (PSI) of a given AS event quantified? ====
AS event quantification is performed using [https://github.com/vastgroup/vast-tools ''vast-tools'']. ''vast-tools'' uses different modules to quantify cassette exons, microexons, alternative 5' and 3' splice sites and intron retention (reflected in the 'vast-tools module' field in the ‘VastDB Features’ section of each event). For detailed information about how the quantification works, please refer to the Supplementary Information of [http://www.cell.com/abstract/S0092-8674(14)01512-8 Irimia et al., ''Cell'' 2014].
+
AS event quantification is performed using [https://github.com/vastgroup/vast-tools ''vast-tools'']. ''vast-tools'' uses different modules to quantify cassette exons, microexons, alternative 5' and 3' splice sites and intron retention (reflected in the 'vast-tools module' field in the ‘VastDB Features’ section of each event). For detailed information about how the quantification works, please refer to the Supplementary Information of [http://www.cell.com/abstract/S0092-8674(14)01512-8 Irimia et al., ''Cell'' 2014]. Current inclusion data in PastDB corresponds to ''vast-tools v2.5.1''.
  
  
==== Why are there some events detected in ''vast-tools'', but not included in VastDB?====
 
Not all the events included in the ''vast-tools'' library are included in VastDB. This database contains only a selection of AS events displaying a certain level of alternativity, and therefore it can happen that some of your events of interest from ''vast-tools'' are not displayed here. For more details, please see the [[#What AS events are displayed in VASTDB? | next question]].
 
  
 +
==== How is the gene expression (GE) of a given gene quantified? ====
 +
GE quantification is also performed using [https://github.com/vastgroup/vast-tools ''vast-tools'']. ''vast-tools'' maps the first 50 nucleotides of the forward read (if longer and paired end) to a library with one reference transcript per gene. GE levels are provided using the cRPKM metric (corrected [for mappability] Reads Per Kilobasepair and Million mapped reads), as detailed in [https://stemcellsjournals.onlinelibrary.wiley.com/doi/full/10.1002/stem.1144 Labbé et al., ''Stem Cells'' 2012]. cRPKM can be converted to TPMs applying the following formula: TPM = 10^6 * cRPKM/sum_all(cRPKM). Moreover, ''vast-tools'' can provide tables with TPMs and raw counts.
 +
 +
 +
 +
==== What AS events are displayed in PastDB? ====
 +
PastDB contains information for all AS events detected and quantified in [https://github.com/vastgroup/vast-tools ''vast-tools'']. However, only a selection of them are displayed in the UCSC track and in the Gene page. These are the events that have the higher PSI variation across samples. If you are interested in an event that is not displayed, you can directly look for it using the search box in the main page.
  
==== What AS events are displayed in VastDB? ====
 
VastDB displays AS events detected and quantified in [https://github.com/vastgroup/vast-tools ''vast-tools''] that show a minimal level of alternative usage. This is defined following [http://www.cell.com/abstract/S0092-8674(14)01512-8 Irimia et al., ''Cell'' 2014]: a given sequence was considered alternatively spliced if its inclusion level (PSI) was 10 ≤ PSI ≤ 90 in at least 10% of the samples with sufficient read coverage, and/or have a range of PSIs ≥ 25 across all samples with sufficient read coverage.
 
  
  
Line 32: Line 36:
 
==== How are the splice site scores calculated? ====
 
==== How are the splice site scores calculated? ====
 
These scores were calculated using score5.pl and score3.pl from  [http://www.ncbi.nlm.nih.gov/pubmed/15285897 Yeo and Burge, 2004] . This method uses a position weight matrix and calculates deviation from the consensus. For 5’ splice sites, three exonic and six intronic positions surrounding the exon-intron junction were analyzed, and for the 3’ splice sites, 20 intronic and 3 exonic positions were analyzed.
 
These scores were calculated using score5.pl and score3.pl from  [http://www.ncbi.nlm.nih.gov/pubmed/15285897 Yeo and Burge, 2004] . This method uses a position weight matrix and calculates deviation from the consensus. For 5’ splice sites, three exonic and six intronic positions surrounding the exon-intron junction were analyzed, and for the 3’ splice sites, 20 intronic and 3 exonic positions were analyzed.
 +
  
  
Line 37: Line 42:
 
The pipeline to predict ORF impact is described in [https://www.ncbi.nlm.nih.gov/pubmed/25525873 Irimia ''et al.'', 2014]. Several things must be kept in mind when using this information as is:
 
The pipeline to predict ORF impact is described in [https://www.ncbi.nlm.nih.gov/pubmed/25525873 Irimia ''et al.'', 2014]. Several things must be kept in mind when using this information as is:
 
* The prediction is based on the impact that the specific alternative sequence is likely to have when included or excluded from the transcript in isolation. That is, if there are other associated AS events (e.g. mutually exclusive or coordinated exons) the prediction may not be accurate.  
 
* The prediction is based on the impact that the specific alternative sequence is likely to have when included or excluded from the transcript in isolation. That is, if there are other associated AS events (e.g. mutually exclusive or coordinated exons) the prediction may not be accurate.  
* Like any other prediction, our annotations must be inaccurate. Please check your results carefully and, as with any other dataset in VastDB, use at your own risk.
+
* We keep improving and polishing these annotations, and new versions are often released.  Make sure you use the most up-to-date version.
 +
* Like any other prediction, our annotations may be inaccurate. Please check your results carefully and, as with any other dataset in PastDB, use at your own risk.
 +
 
  
  
 
==== How should I interpret the domain information? ====
 
==== How should I interpret the domain information? ====
"Domain information is currently only available for cassette exons."
+
Domain information is currently available for cassette exons as well as for adjacent constitutive regions for INT, ALTA and ALTD events.
 
When an exon (either C1, A or C2) overlap a PROSITE or PFAM domain, it shows the following information:
 
When an exon (either C1, A or C2) overlap a PROSITE or PFAM domain, it shows the following information:
 
  
 
<div align="center">''Dom_ID'' = ''Dom_Name'' = ''Type_Overlap''(''%Dom_Overlap'' = ''%Exon_Overlap'')</div>
 
<div align="center">''Dom_ID'' = ''Dom_Name'' = ''Type_Overlap''(''%Dom_Overlap'' = ''%Exon_Overlap'')</div>
Line 58: Line 64:
 
*''%Dom_overlap'': percent of the domain encode by the exon.
 
*''%Dom_overlap'': percent of the domain encode by the exon.
 
*''%Exon_overlap'': percent of the exon that overlaps the domain.
 
*''%Exon_overlap'': percent of the exon that overlaps the domain.
 
  
 
==== How are the primers for RT-PCR validation designed? ====
 
==== How are the primers for RT-PCR validation designed? ====
Line 72: Line 77:
 
*Alternative sequence 300 ≤ LE < 1000 nt => optimal skipping band size = 350 nt.
 
*Alternative sequence 300 ≤ LE < 1000 nt => optimal skipping band size = 350 nt.
 
*Alternative sequence LE > 1000 nt => primers not designed. A three-primer strategy is recommended.
 
*Alternative sequence LE > 1000 nt => primers not designed. A three-primer strategy is recommended.
 +
  
  
Line 98: Line 104:
 
*inc,exc: total number of reads, corrected for mappability, supporting inclusion and exclusion.
 
*inc,exc: total number of reads, corrected for mappability, supporting inclusion and exclusion.
  
 
==== Where do the protein structures come from and what do the different colors mean? ====
 
ENSEMBL protein isoforms including at least one of the C1, A and C2 exons for cassette exon events have been mapped to protein structures from the same gene in the Protein Data Bank using sequence alignment. The best structural match is shown on the database, prioritizing structures containing the A exon.
 
 
For cassette exon events with no PDB hits, the structure of the longest ENSEMBL protein isoform was modeled using Phyre2 ([http://www.nature.com/nprot/journal/v10/n6/full/nprot.2015.053.html Kelley et al. 2015]).
 
 
Red residues correspond to the A exon of the event, while bright orange corresponds to the C1 exon and pale orange to the C2 exon. The rest of the protein is shown in grey in the case of structures retrieved from the PDB, and in light blue for models.
 
  
  
==== Where does the VastDB logo come from?  ====
+
==== Where does the PastDB logo come from?  ====
The image depicts an alternative exon (yellow) as the bridge between a neuron and a myocyte (red). These are two of the tissue types with the most distinctive alternative splicing signatures in the species included in the database. The image is an original design by Yamile Márquez.
+
The image depicts a pair of alternative splice acceptor sites (yellow) as the bridge between a seedling and a mature plant, representing plant development. The image is an original design by Yamile Márquez.

Latest revision as of 11:21, 17 January 2021

How do I cite PastDB?

If you use data from PastDB, please cite our paper in Genome Biology:

Martín, G., Márquez, Y., Duque, P., Irimia, M. (2021). Alternative splicing landscapes in Arabidopsis thaliana across tissues and stress conditions highlight major functional differences with animals. Genome Biol, 22:35.


How is the inclusion level (PSI) of a given AS event quantified?

AS event quantification is performed using vast-tools. vast-tools uses different modules to quantify cassette exons, microexons, alternative 5' and 3' splice sites and intron retention (reflected in the 'vast-tools module' field in the ‘VastDB Features’ section of each event). For detailed information about how the quantification works, please refer to the Supplementary Information of Irimia et al., Cell 2014. Current inclusion data in PastDB corresponds to vast-tools v2.5.1.


How is the gene expression (GE) of a given gene quantified?

GE quantification is also performed using vast-tools. vast-tools maps the first 50 nucleotides of the forward read (if longer and paired end) to a library with one reference transcript per gene. GE levels are provided using the cRPKM metric (corrected [for mappability] Reads Per Kilobasepair and Million mapped reads), as detailed in Labbé et al., Stem Cells 2012. cRPKM can be converted to TPMs applying the following formula: TPM = 10^6 * cRPKM/sum_all(cRPKM). Moreover, vast-tools can provide tables with TPMs and raw counts.


What AS events are displayed in PastDB?

PastDB contains information for all AS events detected and quantified in vast-tools. However, only a selection of them are displayed in the UCSC track and in the Gene page. These are the events that have the higher PSI variation across samples. If you are interested in an event that is not displayed, you can directly look for it using the search box in the main page.


What do the colors and block thickness in the UCSC track mean?

The colors signify the different types of AS events, whereas the block thickness inform about the type of sequence.

  • For any individual cassette exon event (including microexons), each C1, A and C2 exons are represented. The alternative exon (A) thus corresponds to the exon in between.
    • Blue: simple cassette exon. “Simple” is defined as cassette exons for which ≥95% of the reads used to quantify their PSI come from the three reference exon-exon junctions, which are C1A, AC2 and C1C2. It corresponds to “S” or “MIC_S” in ‘Average complexity’.
    • Purple: cassette exon event of intermediate complexity. This is defined as those alternative exons for which ≥50% and ≤95% of the reads used to quantify their PSI come from the three reference exon-exon junctions. Corresponds to “C1” or “C2” in ‘Average complexity’.
    • Red: complex cassette exon event, for which <50% of the reads used to quantify their PSI come from the three reference exon-exon junctions. Corresponds to “C3”, “ME” or “MIC_M” in ‘Average complexity’.
    • Black: groups multiple neighboring cassette exon events. Black tracks are only informative and do not link to any page in VASTDB.
  • For Intron Retention events: Orange track. Thick blocks correspond to the intronic sequence, and the thin blocks to the adjoining exons (C1 and C2).
  • For Alternative 3' and 5' splice site choice event: Dark Green and Light Green, respectively. In both cases, thick block corresponds to the alternative sequence, whereas the thin blocks are the constant exonic sequences (C1 and C2). For these events, at least two tracks are shown: for sequence exclusion (the most internal splice site; EventID-1/N) and for sequence inclusion.


How are the splice site scores calculated?

These scores were calculated using score5.pl and score3.pl from Yeo and Burge, 2004 . This method uses a position weight matrix and calculates deviation from the consensus. For 5’ splice sites, three exonic and six intronic positions surrounding the exon-intron junction were analyzed, and for the 3’ splice sites, 20 intronic and 3 exonic positions were analyzed.


How is the impact on the ORF predicted?

The pipeline to predict ORF impact is described in Irimia et al., 2014. Several things must be kept in mind when using this information as is:

  • The prediction is based on the impact that the specific alternative sequence is likely to have when included or excluded from the transcript in isolation. That is, if there are other associated AS events (e.g. mutually exclusive or coordinated exons) the prediction may not be accurate.
  • We keep improving and polishing these annotations, and new versions are often released. Make sure you use the most up-to-date version.
  • Like any other prediction, our annotations may be inaccurate. Please check your results carefully and, as with any other dataset in PastDB, use at your own risk.


How should I interpret the domain information?

Domain information is currently available for cassette exons as well as for adjacent constitutive regions for INT, ALTA and ALTD events. When an exon (either C1, A or C2) overlap a PROSITE or PFAM domain, it shows the following information:

Dom_ID = Dom_Name = Type_Overlap(%Dom_Overlap = %Exon_Overlap)


The meaning of each field is explained below:

  • Dom_ID: Domain ID in either PROSITE or PFAM databases. For PROSITE, domains with ID P0* (high frequency motifs) are excluded.
  • Dom_Name: Domain name as provided by PROSITE or PFAM databases.
  • Type_Overlap: There are four possible ways in which an exon can overlap a protein domain:
    • The whole exonic sequence fully overlaps with a domain (FE, Full Exon).
    • The whole domain is fully encoded within an exon (WD, Whole Domain).
    • The upstream (5') of the exon overlaps the domain (PU, Partial Upstream).
    • The downstream (3') of the exon overlaps the domain (PD, Partial Downstream).
  • %Dom_overlap: percent of the domain encode by the exon.
  • %Exon_overlap: percent of the exon that overlaps the domain.

How are the primers for RT-PCR validation designed?

Primers are designed automatically using Primer3 (optimal primer lenght = 21 nt; optimal Tm = 61 ºC). As a general rule, primers are located in the C1 and C2 exonic sequences, so two RT-PCR products will be produced: a shorter one (from C1 to C2, skipping the A sequence) and a longer one (including the A sequence). This is provided in ‘Band lengths’. To minimize PCR amplification bias towards shorter amplicons (i.e. over-representation of the skipping form) and, at the same time, optimize the visualization in agarose gels, primers are designed based on the size relationship between the two predicted amplicons. This is based on the following rules:

  • Alternative sequence LE < 15 nt => optimal skipping band size = 100 nt.
  • Alternative sequence 15 ≤ LE < 25 nt => optimal skipping band size = 110 nt.
  • Alternative sequence 25 ≤ LE < 40 nt => optimal skipping band size = 120 nt.
  • Alternative sequence 40 ≤ LE < 65 nt => optimal skipping band size = 140 nt.
  • Alternative sequence 65 ≤ LE < 100 nt => optimal skipping band size = 175 nt.
  • Alternative sequence 100 ≤ LE < 200 nt => optimal skipping band size = 250 nt.
  • Alternative sequence 200 ≤ LE < 300 nt => optimal skipping band size = 300 nt.
  • Alternative sequence 300 ≤ LE < 1000 nt => optimal skipping band size = 350 nt.
  • Alternative sequence LE > 1000 nt => primers not designed. A three-primer strategy is recommended.


What are the quality scores (QC) in the PSI plots?

As provided by vast-tools; from the README: Quality scores, and number of corrected inclusion and exclusion reads (qual@inc,exc):

  • Score 1: Read coverage, based on actual reads (as used in Irimia et al., Cell 2014:
    • For EX: OK/LOW/VLOW: (i) ≥20/15/10 actual reads (i.e. before mappability correction) mapping to all exclusion splice junctions, OR (ii) ≥20/15/10 actual reads mapping to one of the two groups of inclusion splice junctions (upstream or downstream the alternative exon), and ≥15/10/5 to the other group of inclusion splice junctions.
    • For EX (microexon module): OK/LOW/VLOW: (i) ≥20/15/10 actual reads mapping to the sum of exclusion splice junctions, OR (ii) ≥20/15/10 actual reads mapping to the sum of inclusion splice junctions.
    • For INT: OK/LOW/VLOW: (i) ≥20/15/10 actual reads mapping to the sum of skipping splice junctions, OR (ii) ≥20/15/10 actual reads mapping to one of the two inclusion exon-intron junctions (the 5' or 3' of the intron), and ≥15/10/5 to the other inclusion splice junctions.
    • For ALTD and ALTA: OK/LOW/VLOW: (i) ≥40/20/10 actual reads mapping to the sum of all splice junctions involved in the specific event.
    • For any type of event: SOK: same thresholds as OK, but a total number of reads ≥100.
    • For any type of event: N: does not meet the minimum threshold (VLOW).
  • Score 2: Read coverage, based on corrected reads (similar values as per Score 1).
  • Score 3: Read coverage, based on uncorrected reads mapping only to the reference C1A, AC2 or C1C2 splice junctions (similar values as per Score 1). Always NA for intron retention events.
  • Score 4: Imbalance of reads mapping to inclusion splice junctions (only for exon skipping events quantified by the splice site-based or transcript-based modules; For intron retention events, numbers of reads mapping to the upstream exon-intron junction, downstream intron-exon junction, and exon-exon junction in the format A=B=C)
    • OK: the ratio between the total number of reads supporting inclusion for splice junctions upstream and downstream the alternative exon is < 2.
    • B1: the ratio between the total number of reads supporting inclusion for splice junctions upstream and downstream the alternative exon is > 2 but < 5.
    • B2: the ratio between the total number of reads supporting inclusion for splice junctions upstream and downstream the alternative exon is > 5.
    • Bl/Bn: low/no read coverage for splice junctions supporting inclusion.
  • Score 5: Complexity of the event (only for exon skipping events quantified by the splice site-based or transcript-based modules); For intron retention events, p-value of a binomial test of balance between reads mapping to the upstream and downstream exon-intron junctions, modified by reads mapping to a 200-bp window in the centre of the intron (see Braunschweig et al., 2014).
    • S: percent of complex reads (i.e. those inclusion- and exclusion-supporting reads that do not map to the reference C1A, AC2 or C1C2 splice junctions) is < 5%.
    • C1: percent of complex reads is > 5% but < 20%.
    • C2: percent of complex reads is > 20% but < 50%.
    • C3: percent of complex reads is > 50%.
    • NA: low coverage event.
  • inc,exc: total number of reads, corrected for mappability, supporting inclusion and exclusion.


Where does the PastDB logo come from?

The image depicts a pair of alternative splice acceptor sites (yellow) as the bridge between a seedling and a mature plant, representing plant development. The image is an original design by Yamile Márquez.