Discovery and analysis of the C. elegans Neuronal Gene Expression Network – CeNGEN
The CeNGEN Rshiny app provides multiple routes to explore and analyze single-cell RNA sequence expression data from the L4 C. elegans hermaphrodite nervous system. This README describes all the functionalities of the app, which are organized by tabs.
1. Gene Expression by Cell Type.
On this tab you can obtain normalized gene expression values (in transcripts per million, TPM, see Help page for the definition of TPM) for a neuron of interest or for a gene or gene set of interest. These data are averaged expression values for a given gene (or genes) across all the cells corresponding to an individual neuron type. You may select data from one of four expression thresholds (1-4, in increasing stringency), or the unfiltered data (“Unfiltered”, see Taylor et al., 2020 for thresholding method). We recommend the default threshold 2 as a good balance between false positives and false negatives. Users most interested in capturing as much as possible expression should use threshold 1. Users most concerned with eliminating false positive expression should use threshold 4. Gene expression values below the threshold have been converted to 0.
This tab offers three possibilities from left to right:
- Cell type: Choose a cell type and retrieve all genes expressed in that cell type under the chosen threshold, sorted by expression. The resulting table includes all genes, including those not detected and those below the threshold (set to 0).
- Gene: Choose a gene and retrieve all cell types expressing that gene under the chosen threshold, sorted by expression. Genes can be entered as a WormBase Gene ID (WBGene00006744), sequence name (F26C11.2) or public name (unc-4). The resulting tables list only the cell types expressing the gene.
- Multiple genes: Retrieve the expression matrix with the chosen threshold for the subset of genes introduced in the box and separated by “,”.
2. Find Markers based on Percentage of Expression.
This tab allows you to query which genes are expressed in a given percentage of the single cells within given cell types. You provide groups of cell types separated by “,” in either of the two boxes: group 1 or group 2. The first box retrieves genes expressed in a greater percentage of single cells than the provided minimum value (e.g., genes expressed in > 65% of cells in the AWC_ON and AWC_OFF clusters). The second box retrieves genes expressed in a smaller percentage of single cells than the provided maximum value (e.g. genes expressed in < 2% of the cells in the SMD and SIB clusters). The two boxes can be combined to retrieve genes over a threshold in one group of cells and below a different threshold in another group of cells. Alternatively, by inputting the same group of cells in both boxes, you can retrieve genes expressed in a defined percentage of the cells (e.g., genes expressed in > 50% (box 1) and < 75% (box 2) of the cells in a given neuron-type cluster). The result of each independent query and the combined query are displayed in columns 1-2 and 3, respectively.
Because this tab considers each individual cell’s expression, it uses unthresholded data. To calculate percentages of cells expressing each gene, we consider a gene as expressed in a given cell if there is at least 1 UMI.
3. Enriched Genes by Cell Type
This tab allows you to retrieve the list of genes enriched in a given cell type compared to all other cells. You can choose between two datasets: “All cell types” and “Neurons only”. The first dataset includes non-neuronal cell types, affecting the background of cells against which the enriched genes of each cell have been calculated.
The “marker genes” in this tab are pre-calculated using the function “FindMarkers” implemented in Seurat using default parameters on the SCT-normalized data. The following metrics are reported:
- gene_name: public gene name
- gene: WormBase Gene ID
- p_val: p-value from Wilcoxon rank sum test
- avg_logFC: log fold-change of the average expression between the two groups.
- pct.1: The percentage of cells where the gene is detected in the selected cell type of interest
- pct.2: The percentage of cells where the gene is detected in all other cells (either all other cells or all other neurons, depending on the dataset selected)
- p_val_adj: Adjusted p-value, based on Bonferroni correction using all genes in the dataset
- cluster: The selected cell type of interest.
4. Find Differential Expression between Cell Types
In this tab you can perform differential gene expression analysis between two individual cell types or groups of cell types. You can also calculate enriched genes for a given cell type or group of cell types against the rest of cells in the dataset. Select the cell type(s) in the “Select Group 1” and “Select Group 2” boxes. Clicking on “Calculate DEX” will calculate differential expression using one of the available tests.
In the “Select Group 2” box, two additional options are possible. Typing “NEURONS” into the Group 2 box will compare the cells in Group 1 to all other neurons. Typing “ALL” into the Group 2 box will compare Group 1 to all other cells in the dataset (including non-neuronal cells). Be aware that calculation against ALL can take up to 5 minutes to be computed.
Wilcoxon on single cells
This corresponds to the default behavior of Seurat’s FindMarkers()
function. It considers each single cell as a biological replicate and tests gene expression of all the single cells in the clusters from group 1 against all the single cells from the clusters in group 2 (where the single cells are pooled).
The following columns are returned:
- gene: WormBase Gene ID
- pct.1: The percentage of single cells where the gene is detected in cluster(s) 1
- pct.2: The percentage of single cells where the gene is detected in cluster(s) 2
- avg_logFC: log fold-change of the average expression between the two groups
- p_val: p-value from Wilcoxon rank sum test
- p_val_adj: p-value, adjusted by Bonferroni correction using all genes in the dataset
- gene_name: public gene name
- seqnames: sequence name.
Pseudobulk dataset
In single-cell testing, each single cell is seen as an individual replicate. However these cells come from the same sample, so the variability between cells does not truly reflect the biological variability between independent samples. This issue is known as pseudoreplication.
For pseudobulk, individual single cells are grouped by their cluster (cell type) annotation, and by the biological replicate they were produced in. For each gene, the reads in a cell type-replicate are summed, giving a pseudobulk sample (which is notionally equivalent to the result we would get using FACS and bulk sequencing). Recent research (e.g. Crowell et al., 2020; Squair et al., 2021; Zimmerman et al., 2021; Murphy & Skene, 2021) suggests that pseudobulk analysis is more robust, as it accounts correctly for biological variation. The website implements two tests on pseudobulk replicates.
Pseudobulk: Wilcoxon
The pseudobulk replicates corresponding to group 1 are compared against the pseudobulk replicates corresponding to group 2. Only genes that pass the filterByExpr()
criteria (see below) are considered.
The following columns are returned:
- gene: WormBase Gene ID
- mean_1: The mean expression in cluster(s) 1
- mean_2: The mean expression in cluster(s) 2
- log2FC: log fold-change of the average expression between the two groups
- p_val: p-value from Wilcoxon rank sum test
- p_val_adj: p-value, adjusted by Bonferroni correction
- gene_name: public gene name
- seqnames: sequence name.
Pseudobulk: edgeR pairwise exact test
All pseudobulk replicates are loaded into an edgeR object, filtered with filterByExpr()
, before undergoing calcNormFactors()
and estimateDisp()
. The pseudobulk replicates corresponding to group 1 are compared against the pseudobulk replicates corresponding to group 2 using the exactTest()
function. Note that this function only allows pairwise comparisons. To compare more than two cell types, you can use the Wilcoxon test, or download the dataset locally to run a glmQLF test, DESeq2’s tests, or another method. These functions are not implemented in the Shiny app because of their computation time.
The following columns are returned:
- gene: WormBase Gene ID
- logFC: log fold-change of the average expression between the two groups
- logCPM: log of mean expression (in Count Per Million reads) of the gene across all groups
- p_val: p-value from the test
- p_val_adj: p-value, adjusted by Bonferroni correction
- gene_name: public gene name
- seqnames: sequence name.
Legacy app
In the legacy app, a number of tests are available at the single-cell level. Please see the Seurat documentation for details. In our experience, the results of these tests are similar, we recommend using the default (wilcox) test.
The following metrics are returned:
- gene: WormBase Gene ID
- p_val: p-value from Wilcoxon rank sum test
- avg_logFC: log fold-change of the average expression between the two groups
- pct.1: The percentage of cells where the gene is detected in cluster 1
- pct.2: The percentage of cells where the gene is detected in cluster 2
- p_val_adj: Adjusted p-value, based on Bonferroni correction using all genes in the dataset
- gene_name: public gene name
- seqnames: sequence name.
Because this tab considers each individual cell’s expression, it uses unthresholded data. To calculate percentages of cells expressing each gene we consider a gene is expressed in a given cell if there is at least 1 UMI.
5. Single Cell Plot (legacy app)
This tab displays multiple plots for gene expression and metadata visualization in single cells embedded in two UMAP dimensions. On the left menu the user can choose the entire CeNGEN dataset or the neuron only dataset. The second box allows to choose the plotting of metadata (i.e. Neuron Type, Tissue Type, etc.), single gene expression or the co-expression of two genes.
- Metadata plotting options
- Detection: Highlights cells in red based on the cell-calling method that detected the cells (options: Emptydrops, default CellRanger method, or both)
- Experiment: Highlights cells in the UMAP in red based on the single-cell experiment from which they came.
- Neuron.type: Highlights cells in the UMAP based on the annotated neuron type.
- Tissue.type (All cell types dataset only): Highlights cells in the UMAP based on the annotated tissue type.
- Individual genes. Typing a gene name (WBGeneID, sequence name, or public name) and clicking the plot button displays three panels.
- Single cells colored by cell type.
- Single cells colored by the scaled expression from the Seurat SCT-normalized data.
- Ridge plot of the density of scaled expression sorted by cell type.
The first panel controls the zoom capabilities. Double click zooms in, double click resets zoom.
- Colocalization. Filling the two boxes with Gene 1 and Gene 2 the user obtains two panels.
- The bottom panel shows cell types and controls zoom. As in other plots, double click zooms in, double click resets zoom.
- The bottom panel shows cell types and controls zoom. As in other plots, double click zooms in, double click resets zoom.
6. Heatmaps of gene expression.
This tab allows you to plot heatmaps of averaged gene expression per cell type, incorporating both relative gene expression (dot color, see below) and the percent of cells (dot size) expressing the gene within each neuron cluster. This tab uses thresholded data from threshold 2, and therefore only contains data from neuronal cell types. You can input a list of genes separated by commas, spaces, or returns. Genes can be entered as public gene names, WormBase ID or sequence names. A list of genes can also be uploaded from a file (tab separated text files and csv files are supported).
Placing the cursor over a dot will display the corresponding gene name, cell type, relative gene expression (“scaled.expr”, see below), percentage of cells expressing the gene (“prop”), and neuron modality (Sensory, Motor, Interneuron, Pharyngeal).
The scaled TPM expression value is the z-score for the gene in that neuron. The z-score represents how many standard deviations the TPM expression of a given gene in a given neuron is from the mean TPM for that gene. It is calculated by subtracting the mean TPM expression for the gene across all neurons from the TPM expression of the gene in the given neuron, then dividing by the standard deviation of TPM values for that gene. See the Help page for the calculation of TPM.