Using network concepts to study regulons.

So finding out gene regulatory networks (GRN) has been an important but daunting task in biology. It would be cool to know that a phenotype is just controlled by several gene drivers, especially in differentiation which we hope to program the cell to a particular lineage. Meanwhile, it will give us a better picture of how genes organize into pathways and cascades.

Traditionally, it is done by binning genes via co-expression, as done in WGCNA. WGNCA put DE genes that have high expression correlation into the same modules; usually these genes are then sent to GO analyses to identify biologically relevant pathways.

Here, Iacono et al. (2019) reported a novel way to build the co-expression matrix and apply graph-theory metrics to single-cell data to identify GRN with high accuracy.

I think this is a fun paper to read as it teaches me about network analysis, applied in a biological context that I am familiar with.

Also, it is refreshing to rethink about some statistics.

Preprocessing using bigScale

The authors realise that computing gene-gene correlation based on expression values are highly compromised by excessive drop-out events and technical noises of single-cell data. But amidst all the noises, single-cell data have many cells; thus could render high statistical power and a distribution that we can use to model the noises, which then can be corrected.

So they use the software they developed previously, bigSCale (Iacono et al (2017)) for the preprocessing of the data and find DE genes.

In brief, bigScale creates many mini-cluster (a step referred as preclustering) of cells with highly similar expression and threat them as biological replicates. Then bigSCale computes a noise model of each gene by performing 1-to-1 comparison of the expression among cells within the mini-cluster. Using the expression difference, a p-value is derived to test whether such expression difference is rare or common for a gene; then downstream correction is performed to nomalise the p-value.

Finally, bigScale use the p-value to shortlist highly variable genes, to build a cell-cell distance matrix and Ward’s linkage to build a hierarchy. Finally, finding clusters are determined using the elbow method. Of note, each round of cluster partitioning in bigScale may use different set of HVG, based on their unique expression patterns.

Details of bigScale installation and tutorial is posted in github.

Novel correlation matrix based on normalised Z-scores

Build up upon the clusters, the authors develop a new method to characterize gene regulatory networks.

In brief, the author performed DE analysis between all pairs of clusters they identified, and translated the probability of DE of a gene to a z-score. Then they computed gene-gene correlation based on the z-score. In their report, using the z-scores circumvent the issues of drop-outs and imputation is no longer needed.

Next they retained genes with significant correlation to build the GRN, that genes are nodes and correlation are the weighted edge. Finally, they used GO to refine the network, only keeping biologically functional correlations.

Network analysis

The rest of the paper was spent discussing network properties of GRNs, which were documented in details in the tutorial.

The foremost task of such analysis was to quantify the gene’s importance in a network. To do so, the authors explore several measures of node centrality including degree, betweenness, closeness, pagerank,and eigenvalue.

Finally, by comparing GRNs of normal and type-two diabetes cells, the authors observed a change in the shape of the network and genes centrality; thus identifying key drivers of the disease phenotype.

Interestingly, the authors also found out that most GRNs follow a power law that most genes are not highly connected while a few genes are hub genes and argue such high modularity render biological systems robust to perturbations.

Concluding remark

I have to apologize that I have not post any codes in this post as I am not too keen to download and try out bigSCale nor I am particular keen on repeating the codes for the network analyses.

However, I think the both papers have taught me a lot about statistics and graph theories. I do appreciate that the authors presented a new possible usage of single-cell data, beyond cell-type characterization and novel DE genes identification. Hopefully, I will later have some data that I can apply these awesome techniques in.