Identify TF that drive differentiation

Over-expression Screen of Transcription factors

Combinations of TFs are known to program cell differentiation. To study TF, people use observational studies (QTL mapping) and perturbation studies to relate the function of TF(s) to a phenotype. These are indirect way to study the TF function, and the findings were not very conclusive in the context of gene regulatory networks. Thanks to the single-cell technology, this study (Joung et al (2023)) perform an over-expression screen on TF combinations and directly track the changes via scRNA-seq.

More specifically, they built a barcoded TF atlas of 3,548 TF isofroms in hESCs and profile the cells with SHARE-seq (ATAC+RNA) after 7 days of infection (MOI <0.3).
The authors ended up with 671,453 cells covering 3,266 TFs (3–1,000 cells, with an average of 206 cells per TF ORF).

Inferring TF functions (protocol summary)

What makes the project very cool is the level of details these data show, but it also makes the analyses super complex. So, how to integrate these data and map TF to certain cell fate (and GRN) is the key important part of the paper.

For the scRNAseq part, they use Scanpy for the standard analysis and clustering.

Then, they assigned the most abundant TF barcode (>10 reads and >50% more reads than the second highest TF) of a cell to map the TF to a expression profile. To match the TF over-expression profile to natural state, they compared the Scanpy identified 1121 HVG from human fetal cortex or brain organoid scRNA-seq datasets to the TF over expression profiles using Pearson’s correlation.

They used pseudotime and scVelo to map the differentiation trajectories to determine whether a TF may be responsible for certain branch of differentiation (so as to reproduce figure 1H).

They also used NMF to identify GRNS. They performed NMF from log-normalized, centered expression data of HVGs by transforming negative values to 0 and positive values multiplied with -1. The optimal K-values they used was 50.

Another exciting aspect is the joint cluster analyses. They used Seurat FindTransferAnchors to map the human fetal cell transcriptome atlas to TF over-expressed cells to get the cell labels. “To annotate each differentiated cell, the cell type labels of the 10 nearest reference cells, as measured by Euclidean distance in the latent space, were evaluated. Differentiated cells were assigned the most common cell type if 8 (80%) or more of the reference cells shared the same cell type label.”

To integrate the ATAC-seq and scRNA-seq data, they again used Seurat v4 Weighted nearest neighbor analysis based on HGVs from scRNA-seq and top 250,000 most accessible regions from ATAC-seq. To assess chromatin accessibility, they used chromVAR and Presto (a fast implementation of the rank-sum test, auROC, and related summary statistics). I guess then they rank both highly-expressed and most open region and identify the corresponding TF?

I am truly impressed how big the bioinformatic pipeline was for the analyses; and each specific tools could help us to better understand the data.

This is truly an amazing read, with a lot of new packages that I shall look into!