Tutorial alternative splicing: tools description

I – Tools description

This page details the tools available, the type of data displayed by each one, its interpretation, and some of the customization options.

Table of contents

I - Tools description
I-1. Raw data in genome browser
I-1.a. Individual Sample Tracks
I-1.b. Neuron Tracks
I-1.c. Tissue-level (Koterniak 2020) tracks
I-1.d. Global tracks
- Mean coverage, sum of junctional counts
- Maximum coverage and junctional counts
- Minimum coverage and junctional counts
I-1.e. Interface and customization of the genome browser
- Advanced customization
- Transcript-level quantification
I-2. Differential splicing by gene (local quantification)
I-2.a. Local Splicing Variation (LSV)
- LSV naming
I-2.b. Interface
I-3. Differential splicing between neurons (local quantification)
I-3.a. Pair of neurons
I-3.b. Sets of neurons
I-4. Transcript-level quantification
Summary

1. Raw data in genome browser

We provide a genome browser to visualize a minimally-processed version of the dataset. A genome browser is a software that loads a number of tracks and can display them for a chosen region of the genome. For example, two essential tracks are the “Reference sequence” and the “Canonical geneset”, represented here for the gene fipr-12:

The upper track displays the canonical annotation of fipr-12 in Wormbase WS289, the bottom track displays the genomic sequence of chromosome V in the region 9,943,291 to 9,943,649 bp. At this resolution, the genomic sequence is color-coded, to see the individual basepairs, you need to zoom in; position your cursor on the coordinates bar and drag:

You can also use the lens symbols on the right:

Similarly, you can move left/right by clicking and dragging in the middle of any track, or with the arrows on the left.

We provide a number of additional tracks to visualize our RNA-Seq data. The available tracks are listed on the right side of the browser (1), this list can be re-opened if necessary with button (2).

The available tracks fall into 5 categories:

  • Annotations (the genomic sequence)
  • Global
  • Neurons
  • Tissue (Koterniak 2020)
  • Individual Samples

1.a. Individual Sample Tracks

For each neuron class in this dataset, a worm strain was obtained in which this neuron class has a particular fluorescence signature (list on the cengen.org website). Worms were dissociated into a suspension of single-cells, sorted on fluorescence to select the neurons of interest, and sequenced (see the papers for details). Each neuron type was sequenced in typically 3-4 biological replicates. Here, we name the individual samples as NEUr123 where NEU is the target neuron name, 123 the replicate number (the r is a separator).

The RNA-Seq data was aligned to the genome with STAR, and we report here two values: the exonic coverage and the junction count.

Consider a fictional gene, covered by 4 RNA-Seq reads:

For each base-pair within an exon, we define the exonic coverage as the number of reads covering this particular base-pair:

This raw exonic coverage has a drawback: if two samples with identical mRNA composition were sequenced with different total number of reads, every expressed base-pair in the more deeply sequenced sample would have a higher coverage than in the shallow-sequenced sample. To make the values more comparable between samples, we divide the raw exonic coverage by the total number of reads and multiply by 1,000,000 yielding Counts Per Million (CPM).

This unit is still not perfectly normalized, for example if two samples have very different mRNA compositions, the CPM can not be compared between them. In practice, one should compare CPM between regions of the same gene. Even so, we can often notice strong variations of CPM within a single exon; the junction count tends to be a more robust metric.

The fictive gene in out example is composed of two exons and one intron. Read 1 does not intercept the intron, however reads 2, 3, and 4 all cover the splice junction corresponding to the intron. Thus, defining the junction coverage as the number of reads covering the splice junction, here we have a junction count of 3.

In the genome browser, for each sample, e.g. RMDr179, we created 2 tracks, RMDr179_cov containing the exonic coverage (in CPM) and RMDr179_sj containing the junction count. Here are these tracks for the gene trpp-6:

Examining the exonic coverage track, we see that the central exons are typically covered by 3-4 CPM, with a coverage that decreases to 1 CPM close to the ends of the transcript. Considering the junction-intersecting reads only, the introns of this transcript are covered by 13-43 reads in this sample.

In our experience, these numbers suggest the gene trpp-6 is expressed in RMD, whereas the gene fipr-12 doesn’t appear expressed (note the scale on the left for the CPM, and the lack of junction reads):

The CPM value of the exonic coverage can be obtained by hovering your cursor on the track.

1.b. Neuron Tracks

As noted, we have several biological replicates per neuron type. To avoid looking at each sample individually, we created combined tracks.

The combined exonic coverage gives the mean CPM across replicates for each base-pair. For example, consider the last base-pair of the first exon of trpp-6:

Hovering our cursor over each track, we see that the RMD replicates have coverage of 2.6, 5.4, 3.0, 1.0, and 3.2 CPM. Thus the average coverage is 15.2/5=3.04 CPM. Indeed, this is the value we can read directly in the RMD coverage track:

On the other hand, the combined junction count is the sum of the individual junction counts. Consider the first intron of trpp-6:

The combined RMD track displays 139 = 30+35+11+25+38, the sum of the individual samples.

This points to a difference in interpretation of CPM and junction counts: the CPM are normalized by sample and averaged by neuron class, thus all coverage tracks are on the same scale and can potentially be compared (with caveats). The junction counts are raw counts, and directly reflect the strength of the evidence: a low number indicate a junction we do not detect in our data (which may indicate lack of expression or insufficient sequencing), a high number indicates a splice junction supported by many reads.

1.c. Tissue-level (Koterniak 2020) tracks

This data was generated by Gracida and Calarco (2017) and alternative splicing in these samples was described by Koterniak et al. (2020). It consists of TRAP-Seq (expression of a tagged ribosome subunit under a specific promoter allowing the pulldown of ribosomes from a particular cell type and sequencing of the associated mRNAs) targeting neurons, muscles, intestine, dopaminergic neurons, and serotonergic neurons, with 2 replicates per strain. We downloaded the corresponding data from GEO (only the pull-downs, not the inputs) and processed it in the same way as our neuronal samples.

As this covers non-neuronal tissues, it can usefully complement our dataset.

Note however that the technical approaches used being very different, the statistical properties (e.g. coverage) of these samples can be hard to compare.

1.d. Global tracks

Since we have a large number of neurons sequenced, examining the “typical” expression of a gene in the nervous system can be challenging. To help in this common situation, we provide several “global” tracks reflecting expression across all neurons.

Mean coverage, sum of junctional counts

The tracks “mean_cov” and “sum_sj” perform across all neurons the same operations as the neuron-level tracks perform across samples. The “mean_cov” is, for each base-pair, the mean coverage across neurons (note that it is then the mean of the mean coverage across samples, thus the weight of a sample depends on how many samples come from the same neuron). The “sum_sj” is the sum of the junction counts across neurons (which is equivalent to the sum across all samples).

These tracks thus display how the “average neuron” behaves. However this can mask the variability between neuron types.

Maximum coverage and junctional counts

The tracks “max_cov” and “max_sj” represent the maximum value across neuron types.

It is conceptually similar to the Maximum Intensity Projection used in microscopy. Consider a fictional gene with an alternative exon expressed in a single neuron type out of 3:

In exon 1 and exon 3, all neurons have approximately 10 CPM, thus the average is 10*3/3 = 10 CPM. However, in the cassette exon 2, the mean track would have a value of (10+0+0)/3 = 3.3 CPM. As we increase the number of neurons that do not express exon 2, we “dilute” the signal from Neuron A, and the average expression of exon 2 gets drowned in the background noise.

On the other hand, the maximum coverage in exon 2 corresponds to the coverage of neuron A, which stays 10 CPM. Since exons 1 and 3 are expressed at 10 CPM in all neurons, their maxima will stay around 10 CPM.

The potential inconvenient of the maximum track is that it keeps the maximum value of the noise across all neurons, whereas the mean track can average out the noise.

The maximum splice junction track follows the same principle. Thus, seeing a signal in the maximum tracks indicates that at least one neuron class has expression in that region, without the need to examine each neuron type individually. Determining the individual neuron type(s) is best done with the local quantification.

Minimum coverage and junctional counts

These tracks are the counterpart of the maximum tracks: “min_cov” and “min_sj” report the smallest value across all neuron types. Thus, a lack of signal in the minimum tracks indicate that at least one neuron type has no expression in this region.

1.e. Interface and customization of the genome browser

Note the color of the gene annotations and splice junctions indicates the strand the gene is on. A gene on the + strand appears purple, a gene on the – strand is teal (following the same convention as Wormbase).

The genome browser offers additional options to customize the data display. We will discuss a few that are regularly useful.

Track labels can interfere with the track being displayed. It is possible to hide them or offset them. Click on the menu icon in the top left and select “Track labels”:

You may also explore the “Show…” menu to toggle grid lines, headers, etc.

For any track already added, you can click on the vertical “…” menu icon to access a track-specific menu:

You will get a set of options depending on the track type. For an exonic coverage track, there are many ways to customize the display (e.g. change the fill color, display lines instead of density, use log scale). For junction tracks, you have the possibility to switch between a LinearArcDisplay and a LinearBasicDisplay.

The Arc Display shows the junctions as circle arcs, which may be more intuitive, but gets busy in complex and highly expressed genes. The Basic Display allows to stack the junctions, and may be more readable in some cases. For example, consider the gene unc-2, with Arc Display:

and with Basic Display:

While both representations show the same data, the Basic Display avoid overlapping labels and ensures the numbers stay readable.

Advanced customization

There is a possibility to customize tracks even more in depth, which requires significant effort and knowledge and is not recommended in general. Open the menu for a track, and select “Copy track”. The new copy appears in the list. In its menu, select “Settings”

This gives access to many more properties of this track. Modifying them often requires an understanding of JBrowse2 (the genome browser software).

2. Differential splicing by gene (local quantification)

2.a. Local Splicing Variation (LSV)

The local quantification was performed with MAJIQ and its results displayed with VOILA. The quantification is performed within a LSV framework. Consider a (fictional) gene with the following two transcripts, along with its Splice Graph representation:

The two transcripts only differ in their first exon (alternative first exon), in the Splice Graph this is reflected by the fact that exon 3 can receive a junction either from exon 1, or from exon 2. We can thus define a LSV consisting of the left half of exon 3, along with the right halves of exons 1 and 2:

And the quantification is then performed at the level of the LSV: among the 2 junctions that can enter exon 3, which one does the data support the most?

Since exon 3 receives junctions, we can call it a “target LSV”, conversely a gene can contain a “source LSV” that emits junctions. Consider a canonical cassette exon (a gene where exons 1 and 3 are constitutively included, exon 2 may be included or skipped):

From the point of view of exon 2, there is no variation (it can only receive and emit the red junctions), but we have 2 LSVs, a source LSV from exon 1, and a target LSV into exon 3.

In all the examples above, the exons only use a single splice site. In practice, a half-exon can contain alternative splice sites. Here is an example of alternative 3′ splice site

The alternative splice site is indicated with a dashed line in the Splice Graph. In this example, we have a single source LSV from exon 1. The situation can even become more complex, with different splice sites on both sides

Finally, a further potential complication is intron retention. These are taken into account as if they were another splice junction, but represented in a different way.

This is considered a source LSV with two potential paths in the Splice Graph.

LSV naming

Each LSV is given a name such as WBGene00004364:t:10262318-10262408. First is the gene ID, then a s or t for “source” or “target”, then coordinates indicating its genomic position.

2.b. Interface

The initial screen is a table of LSVs. Use the search box in the top right to type the name of a gene of interest to automatically filter the table, then click on the gene name to open a dedicated page. For example, let’s look up ric-4:

We obtain the following gene page.

In (1), we have the name and genomic position of the gene. In (2), we have a Splice Graph showing the gene structure. In particular, the exons are numbered from left to right (in transcript coordinates, note that ric-4 is on the – strand, thus would appear from right to left in the genome browser). The splice junctions have a count on top, equivalent to the junction count in the genome browser (but computed in a different way, the values will differ). These junction counts are displayed for one particular neuron type, here we see we are displaying the “Group ADF; Experiment ADFr99”. We can click on (3) “Splice Graph” to choose which neuron to display: the Splice Graph structure is identical in all neurons, but the counts are specific to each sample.

Further, we can use the buttons in (4) to change the display of the Splice Graph, in particular the first icon changes the scale (so that the size of the exons and introns reflects their genomic coordinates). You can hover your cursor on the Splice Graph (2) to display the coordinates of individual exons or splice junctions.

In (5), the legend indicates how to read the Splice Graph. Here “DB” stands for “database”: the WS289 annotation, whereas “RNASeq” means features that were not annotated by Wormbase, but that appear in our data (potentially novel events). Specifically, an exon appearing in grey or a splice junction in red correspond to the gene annotation from Wormbase. A splice junction that is dashed was present in the Wormbase annotation, but we have no evidence in the data that it is used. An exon, splice junction, or intron retention that appears in green is a novel feature: it wasn’t present in the Wormbase annotation, but there was evidence in the data suggesting it exists.

Below this splice graph, we have several boxes for the individual LSVs. In (6) we can see the first LSV: we have its structure (two exons, connected by an intron retention and a splice junction), its name (WBGene00004729:s:3465770-3465927), and on the left two buttons “Highlight” and “Weighted” to highlight it in the Splice Graph. Note that the colors in each LSV are always the same (red, blue, green, purple, orange, …), they are not related to the colors in the Splice Graph (where green means unannotated).

Below, in (7), we have a quantification of each splice junction or intron retention in this LSV, in each neuron type. The neuron types displayed can be selected in (8). Each measure is given by a violin plot the same color as the junction in the schematic (6). The violin plot reflects a posterior distribution (see the MAJIQ paper for details), the number under each violin plot indicates the proportion of this junction’s usage within this neuron type.

3. Differential splicing between neurons (local quantification)

This app has two tabs: “Pair of neurons” and “Sets of neurons”.

3.a. Pair of neurons

In this tab, enter names for the “First neuron” and the “Second neuron”, for example OLL and OLQ. Click on the “Run this pair of neurons” button at the bottom, you will get a table and a set of plots.

The table indicates each splice junction that appears differentially used between the two neurons. The plots represent the LSVs containing these junctions.

There are 3 hierarchical levels. A gene (such as sto-2) can contain one or several LSVs (identified by their “Event ID”, such as WBGene00006064:t:5311643-5311763, or by a short name, such as “Dereth”), each LSV contains several splice junctions (identified with the short name and a number, e.g. Dereth-1 or Dereth-2). In the example above, we see that Dereth-1 and Dereth-4, two junctions of the LSV “WBGene00006064:t:5311643-5311763” appear differentially used between OLL and OLQ. So we can go to the plot section for “Dereth” and notice that indeed, OLL mostly uses “Dereth-1” (in red) while OLQ mostly uses “Dereth-4” (in blue). Hover on the plot to see the junction numbers.

3.b. Sets of neurons

Here we have two fields to fill: “First set of neuron(s)” and “Second set of neuron(s)”.

The neurons can be indicated by listing their names, separated by commas, or by using a shorthand for a group of neurons (e.g. sensory, motor, GABA, …).

The results are presented in the same way as for pairs of neurons.

4. Transcript-level quantification

This app has two tabs: “Single gene” and “Heatmap of transcript expression”.

In Single gene, enter a gene name (e.g. ric-4) in (1) and click the “Plot” button.

In (2), we get an average across all sequenced neurons. In the case of ric-4, we see an average usage of 134 TPM for transcript ric-4a and 47 TPM for transcript ric-4b (hover on the plot for details). To see the names of the transcripts annotated on the gene model, click on button (3), to open the (old) Wormbase browser including the transcript names.

In (4), we have the relative usage of these two transcripts in every neuron. For example in ADF (bottom row), we see that 94.9% of ric-4 expression is attributed to ric-4a, while for PVM, 77.9% of ric-4 expression corresponds to ric-4b.

This relative usage always adds up to 100% in a neuron. This can be misleading: if the gene is not expressed in that neuron, these proportions of expression only reflect noise. Hence plot (5) gives the absolute TPM values. Here we see for example that ASEL displays a very restricted expression of ric-4 in our samples, the relative expression of ric-4a and ric-4b should probably not be interpreted in ASEL.

For visualizing several genes, we can use the heatmap tab.

Summary

There are 4 available tools

  • The “genome browser” allows full access to the data, with additional context
  • The “local quantification” uses LSVs: single branching points in the splice graph, and computes the relative usage of the possible branches
  • The “differential splicing between neurons” helps finding LSVs that may vary between sets of neurons
  • The “transcript-level quantification” attempts to quantify whole transcripts