CeNGEN splicing data portal


Note: this website uses Google Analytics to monitor total numbers of connections. The exact pages visited (gene/neurons etc) are not recorded.


Documentation (click to expand)

Differential splicing (local quantification)

Splicing was quantified using MAJIQ (see Vaquero-Garcia et al., 2016 for detailed explanations). Briefly, splice junctions (SJ) starting from the same exon (or splice junctions arriving into the same exon) are grouped in a Local Splicing Variation (LSV). For each LSV, the relative usage of each possible SJ is quantified, and a Percent Selected Index (PSI) is estimated with a Bayesian model. This way, a canonical exon skipping is represented as two LSVs, one LSV upstream, containing two SJ (one that links the upstream exon to the cassette exon, and one that skips the cassette to link to the downstream exon), and a second LSV with two SJ (one from the upstream exon into the downstream exon, the other from the cassette exon into the downstream exon).

The portal generates one page for each gene. On top, a representation of the splice graph displaying all exons and SJs. One can display the splice graph for individual neuron types or samples to see the number or reads spanning each SJ. Below, each LSV is represented schematically, and violin plots display the PSI of each SJ in each neuron type. In many cases, there were insufficient reads to quantify an SJ in a given neuron (e.g. when the gene is not expressed in that neuron), and the violin plot is left blank.

Applications and limitations

This visualization is most appropriate to establish differential splicing across neuron types.

This quantification only considers an LSV when there are differences in SJ. Thus, two isoforms distinguished only by the number of exons but with no branch in the splice graph will not appear differentially spliced.

Technical considerations

The MAJIQ quantifications and the VOILA server used for displaying are described in Vaquero-Garcia et al. (2016, 2020).

Source code and Github issues

The code used to process the data with MAJIQ and VOILA is available at https://github.com/cengenproject/majiq

The VOILA web app is described by its authors: https://majiq.biociphers.org/

References

Jorge Vaquero-Garcia, Alejandro Barrera, Matthew R Gazzara, Juan González-Vallinas, Nicholas F Lahens, John B Hogenesch, Kristen W Lynch, Yoseph Barash, A new view of transcriptome complexity and regulation through the lens of local splicing variations, eLife 2016;5:e11752 DOI:10.7554/eLife.11752

Jorge Vaquero-Garcia, Joseph K. Aicher, Paul Jewell, Matthew R. Gazzara, Caleb M. Radens, Anupama Jha, Christopher J. Green, Scott S. Norton, Nicholas F. Lahens, Gregory R. Grant, Yoseph Barash, RNA splicing analysis using heterogeneous and large RNA-seq datasets, bioRxiv 2021.11.03.467086 DOI: 10.1101/2021.11.03.467086

Differential splicing between neurons (local quantification)

The local quantification is the same as in the “Differential splicing by gene” module. However, that module provides one page per gene, and does not provide functionality to discover genes varying between neurons.

This app allows the user to enter neurons of interest and gets events and genes that appear differentially alternatively spliced (DAS).

Usage

In the “neuron pair” tab, one can input two neurons, and we use the statistics from MAJIQ to select events that are DAS.

In the “neuron sets” tab, one can input two sets of neurons. We then recover the PSI of each junction in neurons of these sets and perform a t-test. As the t-test tends to be robust, the results will usually point to a set of DAS events for further analysis. However, depending on the exact neuron sets used as input, the test may fail to correctly control false discoveries. Thus, the resulting list of DAS event should not be considered the “true” set of DAS events and may be misleading. In particular, any conclusion using size or overlap between the sets of events returned here is not valid.

Event names

MAJIQ provides event IDs of the form “WBGene00006064:t:5311643-5311763”. These identifiers are unique and robust, and should be the main way to describe an event. However, they are unpractical to reason with and cross-reference between several neuron types. For convenience, we also attribute a name to each event, e.g. “Quantasha”. These names were randomly attributed to each event, and may change in future updates, but they are consistent within the application. They are names that were given to at least 5 babies in the United States between 1880 and 2017, as provided by the R package {babynames}.

Applications and limitations

This application is a convenient way to determine genes of interest for a particular question. Those genes should be investigated in more detail in the VOILA app (local quantification) and the genome browser.

Technical considerations

The MAJIQ quantifications are the same as displayed in VOILA (local quantification by gene).

Source code and Github issues

The code of the Shiny web app is available at https://github.com/cengenproject/das_by_neuron

Transcript quantification

StringTie was used in quantification mode (Pertea et al., 2016) to estimate the expression level of each transcript. This view is often easier to interpret (as transcripts directly reflect the biology we’re interested in), but is typically less reliable than local quantifications. The transcriptome used was Wormbase WS289.

The quantification is accessible by two modules.

Single-gene mode

In single gene mode, you can input a single gene name and a combination of neurons. Use ALL for all neurons, individual neuron names (e.g. “AWA”, “ASEL”, or “OLL”), or keywords such as “ACh”, “motor”, “sensory”, … You can combine keywords and neuron names as needed. The list of genes can be longer than one, only the first one will be taken into consideration.

Three plots will be produced. The first one displays the average TPM value for that gene’s transcripts, across all sequenced neurons. This is a convenient way to get a glimpse of the general usage of this gene in the nervous system. Then, for that gene, the proportion of its expression that can be attributed to each of its transcripts. This makes identifying an isoform switch easy. However, when a gene is lowly or not expressed, this visualization can be misleading. The third plot allows direct visualization of the transcript expression levels, and enables distinguishing between neurons where no transcript of that gene is expressed, and neurons where some transcripts are highly expressed.

Selecting the checkbox “Plot individual samples” will represent each sample rather than neuron-level aggregates. The choice of color scale is customizable:

  • viridis and npg are well-suited for genes that have a few transcripts.
  • d3_cat20 makes it possible to distinguish up to 20 transcripts.
  • iwanthue generates random colors, to effectively color any number of transcripts. It is probably not colorblind-safe.
Heatmap

The choice of gene and neuron is as described above, except that several genes can be input at once. In addition, there are several options for normalization:

  • None displays raw TPM values
  • Log2 displays the log2 (shifted by 1, so that when TPM=0, log2(TPM+1)=0)
  • Z-score subtracts the mean and divides by the standard deviation. If “Scale on” is set to Transcript, each transcript will be scaled that way, such that a high Z-score indicates that this transcript is higher in this neuron than in the average neuron, and a negative Z-score indicates the transcript is lower in this neuron than in the average neuron. When “Scale on” is set to Neuron, a high value indicates that the transcript is higher in this neuron than the average transcript (of the selected genes), and conversely.
  • Min-Max scales all values between 0 and 1, such that, if “Scale on” is set to Transcript there will be one neuron with an expression value of 0, and one enuron with an expression value of 1; if “Scale on” is set to Neuron, there will be one transcript at 0, and one transcript at 1. The other Neurons/Transcripts will have values scaled between 0 and 1.

The choice of color scale is customizable.

  • MetBrewer OKeefe2 and viridis magma are well-suited for positive values (dark at 0, lighter for higher values).
  • Red-White-Blue has red colors for negative values, white around 0, blue for positive values. It can be used as a white-blue scale for positive values, and is also well-suited to visualize negative Z-scores.

Finally, you have the possibility to download the underlying data as a table (tab-separated values, can be opened with Excel). The downloaded dataset will contain the mean TPM value for selected neurons and transcripts (unnormalized). The heatmap itself can be downloaded as an SVG (compatible with Inkscape, Adobe Illustrator, Affinity Designer, …) for further editing. The SVG reflects the displayed heatmap, using the same neurons and transcripts, normalizations, and color scale. The downloaded file name contains the date and time of download (time in UTC, may differ from your local time), the name of the selected genes (truncated to 20 characters), and the name of the selected neurons (truncated to 20 characters).

Applications and limitations

Because the data was produced with short reads, this quantification is less accurate than a local quantification.

The different visualizations available can make some aspect more clear or obscure them. It is crucial to combine several visualizations before drawing conclusions.

As the transcript level typically reflects the biology, this quantification can be easier to interpret. It can be a great way to explore one or several genes before using more detailed tools.

Technical considerations

The quantification was performed with StringTie using the “eB” option. TPM values were extracted for each gene in each sample.

Source code and Github issues

The source code for running StringTie is at https://github.com/cengenproject/stringtie_quantif

The source code of the Shiny web application is at https://github.com/cengenproject/isoform_compare/

References

Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M Transcriptome assembly from long-read RNA-seq alignments with StringTie2, Genome Biology 20, 278 (2019), DOI:10.1186/s13059-019-1910-1

Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown, Nature Protocols 11, 1650-1667 (2016), DOI:10.1038/nprot.2016.095

Shumate A, Wong B, Pertea G, Pertea M, Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie, bioRxiv (2021) 2021.12.08.471868

Splicing browser (raw data)

This module allows direct visualization of the sequencing data for any genomic locus. For each sample, a density plot indicates the number of reads aligned at a particular genomic position (normalized by the total number of reads in that sample and multiplied by one million, unit: CPM). In addition, a splice junction track indicates the number of junction-spanning reads supporting that junction. For each neuron type, the average of all the samples is performed for each genomic position, to give a mean coverage for that neuron. In addition, the junction-spanning reads are summed for each junction, to give a total junction usage track for that neuron. Finally, to allow rapid examination of a genomic locus across neurons, additional “global” tracks are available:

  • “mean_cov” is the mean coverage (for each genomic position) across all neurons (as such, it is the mean of the means of sample CPM)
  • “min_cov” corresponds to the smallest value across all neuron means, for a given genomic position. If a given exon or gene is expressed in every neuron except one, this track will show the absence of expression; whereas the mean will show the average high expression in most neurons. Note that this track becomes of little use if a gene is not expressed at all in one neuron class: this single neuron class will force the min at 0 along the length of the gene.
  • “max_cov” corresponds to the highest value across all neuron means, for a given genomic position. If a given exon or gene is expressed in a single neuron type, this track will show that expression, while the mean will show the average absence of expression in all other neuron types. Conceptually, these two tracks are similar to maximum or minimum intensity projections (per pixel) as typically performed on microscopy images.
  • “sum_sj” is the total number of junction-spanning reads for each splice junction. This track only underwent a simple filtering (removing all junctions with fewer than 20 reads), thus may still contain a large amount of noise.
  • “min_sj”, “max_sj” are the minimum and maximum across all neurons, for each splice junction. As with coverage, they facilitate interrogation of a given position across all neurons. Junctions with fewer than 20 reads were filtered out of the “max_sj” track, junctions with fewer than 1 read filtered out of the “min_sj” track.

Genes that are expressed in a given sample typically have 0.2-1.5 CPM along the gene body. Expressions higher than 1 CPM suggest highly expressed genes. A useful approach to examine a given genomic locus is to first visualize the “global” statistics, and look for variable regions. Then, examine these regions using the means per neuron. Finally, if a region of interest appears differentially used between neuron types, displaying the individual samples gives an idea of the robustness. Junctions and coverage often provide complementary information, it is recommended to look at both.

The “sj” tracks are displayed by default as arcs, whose thickness is proportional to the log-number of reads. Clicking on the “…” next to the track name, you can select a LinearBasicDisplay where each splice junction is represented by a rectangle, with the number of reads printed below.

In addition to these neuron-specific tracks, an additional 5 tracks are available, corresponding to the TRAP-Seq data from Koterniak et al. (2020). These samples target all neurons, muscle, or intestine, as well as all serotonergic and all dopaminergic neurons. They correspond to SRA samples SRR6238082-SRR6238101 (we did not include the “input” samples), and underwent similar processing as the CeNGEN sorted neuron samples.

Applications and limitations

This visualization makes no assumption on the genomic structure, and allows direct observation of constitutive and alternative splicing (whether differentially used or not), non-coding RNAs, or unannotated genes or exons. The coverage values are only normalized by sequencing depth, the junction values not normalized at all; this can be considered raw data.

Neither the coverage nor the junction values are normalized by gene; thus, when averaging between neurons with different expression levels, the highest expressing neuron can dominate the result. Further, the criteria of what represents “high” and “low” expression can differ between genes. Finally, this is not a proper quantification: the other tools on this website will usually give more accurate answers to splicing questions. Visualizing the read coverage is a good way to verify the predictions of the splicing quantification.

Technical considerations

The tracks were processed using R/Bioconductor, and loaded in JBrowse2. The individual tracks can be downloaded from here and loaded into a different genome browser (e.g. Wormbase’s JBrowse, or UCSC).

Source code and Github issues

The code used to generate the tracks is available at https://github.com/cengenproject/splicing_browser.


Please feel free to send feedback by email or open an issue in the corresponding Github repository.


Updates

  • 2024-09-09: in browser, additional tracks with Koterniak (2020) tissue-specific TRAP-Seq data.
  • 2024-08-26: VOILA local quantification updated with WS289 annotation, new samples (CEP, DVB, HSN, PVP, PVQ, RME, SIA), gene symbols instead of gene ID.
  • 2024-08-06: neuron DAS comparisons added on front page.
  • 2024-07-31: neuron DAS comparisons, experimental app.
  • 2024-07-31: this landing page transferred to the WordPress page at cengen.org. Correct help of StringTie to reflect update 2024-03-22. The apps are now accessed at address splicingapps.cengen.org/<appname> instead of (previously) splicing.cengen.org/<appname>
  • 2024-05-14: transcript-level performance updates, now hosted at shinyapps.io. No notable user-facing change.
  • 2024-03-22: transcript-level, update data to WS289 (no more novel transcript discovery), new samples added (CEP, DVB, HSN, PVP, PVQ, RME, SIA).
  • 2024-03-22: Genome browser, cosmetic updates to SJ display.
  • 2023-12-12: Genome browser, WS289 annotation (previously WS281), new samples added (for neurons CEP, DVB, HSN, PVP, PVQ, RME, SIA).
  • 2022-12-01: update JBrowse to 2.2.1, add strand color in gff track, change sj tracks default display to Arcs, use “novel” gff3 version 221130 with exons.
  • 2022-10-20: Google Analytics.
  • 2022-07-21: transcript-level: multiple improvements (faster, buttons, download heatmap, heatmap color scales, Wormbase browser link, average across neurons, …).
  • 2022-07-19: transcript-level quantification backend improved (starts faster).
  • 2022-04-08: transcript-level quantification app updated with a new interactive heatmap. Documentation updated.
  • 2022-03-23: Data processing update; any multimapping read was removed, new transcripts added by comparing with (unpublished) long reads data (see e.g. Y37E3.30).
  • 2022-02-09: Adding minimal app for isoform quantification.
  • 2022-01-31: button order changed.
  • 2022-01-25: Updated help. Aesthetic changes in landing page.
  • 2022-01-18: splice junctions available in the browser.
  • 2021-12-06: VOILA server available. The individual samples are again available in the browser.
  • 2021-11-30: removing two samples that failed QC. The individual samples are no longer available.
  • 2021-11-23: “lower” and “higher” now correspond to the 3rd lowest/highest sample.
  • 2021-11-18: “lower” and “higher” now correspond to the 3nd lowest and highest neurons.
  • 2021-11-12: “lower” and “higher” now correspond to the 3rd lowest and 4th highest neurons.
  • 2021-11-12: annotation updated to WS281 to match bw tracks.
  • 2021-11-10: alignments on WS281 (vs 277 previously). In addition, min and max tracks replaced by median, 10th percentile, and 90th percentile.