Analysis of multi-condition single-cell data with latent embedding multivariate regression
·Projecting cells into different spaces based on experimental conditions
Eltze and Huber et al (2025) https://doi.org/10.1038/s41588-024-01996-0 found a method on surveying gene differential expressions on multi-condition scRNA-seq. Instead of integrating the cells from different experiments, this paper is proposing a different way to merge the cells and could led to meaningful DE analysis for different conditions.
“The central idea of LEMUR is to represent multi-condition single-cell RNA-seq data using a multi-condition extension of PCA.”
From my understanding, first of all, all the cells were mapped to a common integrated space (the base space). Then the subspace-spanning matrix R were calculated for each condition that could “rotate” the cell coordinates into a particular subspace. Such rotation is implicated in gene differential expressions upon a certain condition (described by the design matrix). And LEMUR then translate the rotation into DEGs by pseudobulk DE test (using EdgeR).
A second important feature that LEMUR renders is that it could project the cell coordinates from the base space back to the condition space, and thus could give us a rough estimate of how the cell would experience DE when change from one condition to another.
This method that provide a full mapping of cells across conditions are definitely useful in many settings and the way it analyses DE also sounds very comprehensive and appears to explain the residual variance very well rather than throwing them away in the integration step. Thus, I think it would worth trying to go through some of the examples and go deeper to learn how to use this package.
Glioblastoma analysis using LEMUR
The glioblastoma dataset was published by Zhao et al, 2020 GSE148842 to study the drug response in human tumour tissues using sc-RNAseq.
It treated the tumor with panobinostat or etoposide, and thus we have 3 treatment condition: ctrl, panobinostat, and etoposide. In the LEMUR tutorial for giloblastoma, we were only focusing on the comparison between ctrl and panobinostat. There were 5 patients (PW030-PW040) that had panobinostat treatments and thus, we have 5 patient conditions.
The design matrix would be like this:
condition | intercept | etoposide | panobinostat | PW030 | PW032 | PW034 | PW036 | PW040 |
---|---|---|---|---|---|---|---|---|
ctrl | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
etoposide | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
panobinostat | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Since, we are not interested in patient-specific expression, we kept all patient column as “0”. In the beginning of the tutorial, most of the efforts were used in putting the sample together as 1 big SingleCellExperiment object.
library(tidyverse)
library(lemur)
library(glue)
library(SingleCellExperiment)
library(monocle3)
library(Matrix)
set.seed(1)
setwd("../../LEMUR_sc_multi_condition/")
file_df <- tibble(file = list.files(path = "../LEMUR_sc_multi_condition/GSE148842_RAW/", full.names = TRUE, pattern= "^GSM"),
id = str_match(file, ".+/GSM\\d{7}_(PW[0-9-]*)\\.cts\\.txt\\.gz")[,2])
geo_metadata <- GEOquery::getGEO("GSE148842")
# Parse geo metadata #
patient_annotation <- bind_rows(as_tibble(geo_metadata[[1]]),
as_tibble(geo_metadata[[2]])) %>%
unite("description", starts_with("characteristic"), sep = "\n") %>%
transmute(id = title, description, origin = source_name_ch1) %>%
separate(id, into = c("patient_id", "treatment_id"), remove = FALSE) %>%
mutate(age = as.numeric(str_match(description, "age: (\\d{1,3})\\s")[,2]),
gender = str_match(description, "gender: ([MF])\\s")[,2],
location = str_match(description, "location:\\s*([^\\n]+)\\s*\\n")[,2],
diagnosis = str_match(description, "diagnosis:\\s*([^\\n]+)\\s*\\n")[,2],
treatment = str_match(description, "treatment:\\s*([^\\n]+)\\s*$")[,2]) %>%
dplyr::select(-description) %>%
mutate(patient_id = ifelse(str_starts(id, "PW05"), str_sub(id, start = 1L, end = -4L), patient_id),
treatment_id = ifelse(str_starts(id, "PW05"), str_sub(id, start = -3L, end = -1L), treatment_id)) %>%
mutate(origin = ifelse(origin == "glioma surgical biopsy", "biopsy", "tissue_slice"),
recurrent_tumor = str_ends(diagnosis, "recurrent")) %>%
mutate(condition = case_when(
treatment == "vehicle (DMSO)" ~ "ctrl",
treatment == "2.5 uM etoposide" ~ "etoposide",
treatment == "0.2 uM panobinostat" ~ "panobinostat",
.default = ~ "other",
))
### 43 samples, 10 patients, selected 3 treatment conditions for Lemur analysis ###
# Select panobinostat and one random slice for each condition
set.seed(1)
### make a file that have patient metadata and dir path to the scRNA dataset
tmp <- patient_annotation %>%
tidylog::inner_join(file_df) %>%
filter(condition != "other") %>%
filter(! recurrent_tumor) %>%
filter(origin == "tissue_slice")
### 26 files with different conditions ctrl, ectopside and panobinostat ###
# Load SingleCellExperiment from file
sces <- map(tmp$file, \(fi){
id <- str_match(fi, ".+/GSM\\d{7}_(PW[0-9-]*)\\.cts\\.txt\\.gz")[2]
df <- data.table::fread(file = fi, showProgress = FALSE)
rowdata <- df[,c("gid", "gene")]
counts <- df %>%
dplyr::select(- c(gid, gene)) %>%
as.matrix() %>%
as("dgCMatrix")
rownames(counts) <- rowdata$gid
rownames(rowdata) <- rowdata$gid
sce <- SingleCellExperiment(list(counts = counts), colData = data.frame(id = rep(id, ncol(counts))), rowData = as.data.frame(rowdata))
# Reduce to 800 cells per sample
# sce <- sce[,sample.int(ncol(sce), min(ncol(sce), 800))]
sce
}, .progress = TRUE)
### combine 26 list of patients to a single experiment ###
sce <- do.call(cbind, sces)
dim(sce)
### 60725 genes x 122548 cells ###
After rbinds all expression matrix together, we ended up with 60725 genes and 122548 cells. Each sample had cells ranging from 1696 to 15894 and here is the breakdowns.
colnames <- make.unique(colnames(sce))
### add patient annotation to sce ###
colData(sce) <- colData(sce) %>%
as.data.frame() %>%
tidylog::left_join(tmp, by = "id") %>%
dplyr::select(- c(file, origin, recurrent_tumor)) %>%
DataFrame()
sce$pat_cond <- paste0(sce$patient_id, "-", sce$condition)
table(colData(sce)$pat_cond)
#PW029-ctrl PW029-etoposide PW030-ctrl PW030-etoposide PW030-panobinostat
# 1696 4454 21147 3942 7118
#PW032-ctrl PW032-etoposide PW032-panobinostat PW034-ctrl PW034-etoposide
# 7647 2508 1401 15288 15894
#PW034-panobinostat PW036-ctrl PW036-etoposide PW036-panobinostat PW040-ctrl
# 2679 13289 6120 3096 14282
#PW040-panobinostat
# 1987
Then we proceeded with the filtering (genes have counts in >=5 cells) and transformation (log-transform). Then we removed cells based on qc_df that cells with high mitochondria portions were removed. Finally, we removed cells that have abnormally high read counts, only keeping cells that have total read counts between 800 and 1200.
logcounts(sce) <- transformGamPoi::shifted_log_transform(sce)
colnames(sce) <- colnames
# Remove useless genes
sce <- sce[rowSums(counts(sce)) >= 5, ]
## 37276 genes x 122548 cells ###
# Remove version number of ENSEMBL gene id's
rowData(sce)$gid <- stringr::str_remove(rowData(sce)$gid, "\\.\\d+")
rownames(sce) <- rowData(sce)$gid
qc_df <- scuttle::perCellQCMetrics(sce, subsets = list(Mito = ! is.na(rowData(sce)$chromosome) & rowData(sce)$chromosome == "MT",
Y_chr = ! is.na(rowData(sce)$chromosome) & rowData(sce)$chromosome == "Y"))
as_tibblssceas_tibblsceas_tibble(qc_df, rownames = "barcode") %>%
tidylog::left_join(as_tibble(colData(sce), rownames = "barcode"), by = "barcode") %>%
group_by(patient_id, gender) %>%
summarize(max(subsets_Y_chr_percent))
summary(qc_df$sum)
summary(qc_df$detected)
summary(qc_df$subsets_Mito_percent)
qc_filters <- scuttle::perCellQCFilters(qc_df, sub.fields = c("subsets_Mito_percent"))
rownames(qc_filters) <- rownames(qc_df)
### no colnames (cell names) were saved though in sce
qc_filters %>%
as_tibble() %>%
group_by(across(low_lib_size:high_subsets_Mito_percent)) %>%
summarize(n = n())
# A tibble: 2 × 4
# Groups: low_lib_size, low_n_features [2]
#low_lib_size low_n_features high_subsets_Mito_percent n
#<otlr.flt> <otlr.flt> <otlr.flt> <int>
# 1 FALSE FALSE FALSE 121545
# 2 FALSE TRUE FALSE 1003
### so 121545 pass QC and 1003 do not due to low features ###
sce <- sce[,!qc_filters$discard]
# Filter out extreme cell sizes
sce <- sce[,colSums(counts(sce)) > 800 & colSums(counts(sce)) < 12000]
qs::qsave(sce, "glioblastoma_cleaned_sce.qs")
Next, the authors utilized the fact that the tumors had a chromosome 7 duplication and a chromosome 10 deletion to define tumor cells.
### look into chromosome 7 duplication and chr 10 deletion ###
### calculate expression level of chr 7, 10 vs chr 1-5 (normal))
sel_chromosomes <- as.character(1:5)
special_chr <- c("7", "10")
w<-deframe(vctrs::vec_group_loc(rowData(sce)$chromosome.y))
w[unique(c(sel_chromosomes, special_chr))]
### identify rows that are in the respecitive chromosome based on rowData ###
### take row Means from counts(sce) ###
chr_counts <- t(lemur:::aggregate_matrix(t(counts(sce)),
group_split = deframe(vctrs::vec_group_loc(rowData(sce)$chromosome.y))[unique(c(sel_chromosomes, special_chr))], rowMeans2))
rownames(chr_counts)
dim(chr_counts)
### 7 chromosome x 94796 genes ##
ratio7 <- chr_counts["7", ] / colMeans2(chr_counts, rows = 1:5)
ratio10 <- chr_counts["10", ] / colMeans2(chr_counts, rows = 1:5)
simple_label <- ifelse(ratio10 < ratio7, "tumor", "microenvironment")
### so tumor cells have chr7 duplication and chr10 deletion ###
tibble(chr10 = ratio10, chr7 = ratio7) %>%
mutate(simple_label) %>%
ggplot(aes(x = chr10, y = chr7)) +
geom_point(aes(color = simple_label), size= 0.1, stroke = 0) +
scale_x_log10() + scale_y_log10() +
guides(color = guide_legend(override.aes = list(size = 1)))
Now, we identified HVG and build PCA and integrated map of the cells. Next, a integrated mapping was built using Harmony with the PCA results.
### HVG ###
set.seed(1)
hvg <- order(-rowVars(logcounts(sce)))
### build pca using 2000 HVG to 35 PC on 94796 cells ###
### res <- prcomp(t(Y), rank. = n, center = center_ind, scale. = FALSE) ###
### list(coordsystem = unname(res$rotation), embedding = unname(t(res$x)), offset = center) ###
subset_pca <- lemur:::pca(logcounts(sce)[hvg[1:2000], ], n = 35)
### run harmony ###
harm <- harmony::RunHarmony(t(subset_pca$embedding), meta_data = colData(sce), vars_use = c("pat_cond", "treatment_id"), lambda = c(1,1))
### harmony embedding show 94796 cells x 35 PC ###
harm_umap <- uwot::umap(harm)
### generate Clustering based on harmony mapping
graph <- bluster::makeKNNGraph(harm, k = 15, BNPARAM = BiocNeighbors::AnnoyParam())
clustering <- igraph::cluster_walktrap(graph)
set.seed(1)
clusters <- igraph::cut_at(clustering, 4)
The umap generated from harmony PC showed that there were roughly 4 clusters, reflecting the 4 cell types Myeloid cell, Tumor cells, Oligodendrocytes, and T cells. Breaking down the cell composition per cluster, we can see that the integration is successfully that the patient’s cells are distributed evenly in each cluster. And we can also see that the clustering showed mixing of drug treatment condition and tumor/microenviroment cells. In other words, the clustering reflect mainly the cell type signals but not the treatment or tumor signals.
Differential expression analysis
Here in the tutorial LEMUR github, marker genes were used to QC the clustering. After that, we run LEMUR. Using top 3000 HVG, LEMUR were run based on the design of ** ~ condition + patient_id ** to generate 60 latent embedding for the cells. Finally, we run lemur::test_de and lemur::find_de_neighborhoods to test DEG using aggregated reads from each neighborhood between the ctrl ~ panobinostat condition.
sel_genes <- c("CD14", "TYROBP", "PLP1", "MAG", "TRBC1", "TRBC2", "GAP43")
gene_ids <- deframe(as_tibble(rowData(sce)[c("gene", "gid")]))[sel_genes]
expr_mat <- counts(sce)[gene_ids, ]
cell_type_label <- case_when(
clusters == 1 ~ "Myeloid cells",
clusters == 2 ~ "Tumor cells",
clusters == 3 ~ "Oligodendrocytes",
clusters == 4 ~ "T cells"
)
colData(sce) <- colData(sce) %>%
as_tibble() %>%
mutate(cell_type = cell_type_label, chr_ratio_label = simple_label,
chr_10_ratio = ratio10, chr_7_ratio = ratio7) %>%
DataFrame()
reducedDim(sce, "PCA") <- t(subset_pca$embedding)
set.seed(1)
reducedDim(sce, "UMAP") <- uwot::umap(t(subset_pca$embedding))
reducedDim(sce, "harmony") <- harm
reducedDim(sce, "harmony_umap") <- harm_umap
qs::qsave(sce, "glioblastoma_annotated_sce.qs")
sce<-qs::qread("glioblastoma_annotated_sce.qs")
hvg <- order(-rowVars(logcounts(sce)))
#### reduce sce to only contain HVG ###
sce <- sce[hvg[1:3000],]
set.seed(1)
### run lemur to fit the latent embedding model ###
fit <- lemur::lemur(sce, design = ~ condition + patient_id, n_embedding = 60, test_fraction = 0.5)
reducedDim(fit, "fit_umap") <- uwot::umap(t(fit$embedding))
set.seed(1)
### Enforce additional alignment of cell clusters beyond the direct differential embedding ###
fit <- lemur::align_harmony(fit)
### Predict log fold changes between conditions for each cell ###
fit <- lemur::test_de(fit, contrast = cond(condition = "panobinostat") - cond(condition = "ctrl"), new_assay_name = "DE_panobinostat")
qs::qsave(fit, "glioblastoma_fit_hvg3k.qs")
nei_pan <- lemur::find_de_neighborhoods(fit, group_by = vars(patient_id, condition),
de_mat = assay(fit, "DE_panobinostat"), test_method = "edgeR")
qs::qsave(nei_pan, "glioblastoma_fit_hvg3k-nei_pan.qs")
#> nei_pan
# name n_cells sel_statistic pval adj_pval #f_statistic df1 df2
#1 ENSG00000210082 91752 -278.19109 6.122879e-02 0.143281094 4.245333375 1 12.26472
#2 ENSG00000118785 87673 -140.95765 2.133515e-02 0.070724261 6.957071366 1 12.26472
#3 ENSG00000167996 59399 -33.45591 9.125650e-02 0.188908456 3.358030447 1 12.26472
#4 ENSG00000087086 82175 -61.24831 3.492159e-01 0.488413852 0.947288731 1 12.26472
#5 ENSG00000120885 91981 153.77459 5.405079e-02 0.131085186 4.537991245 1 12.26472
#6 ENSG00000125148 83114 499.28694 1.781635e-05 0.001484696 45.815445875 1 12.26472
#7 ENSG00000251562 15481 -25.39461 5.747738e-02 0.136811871 4.392805386 1 12.26472
#8 ENSG00000109846 60067 88.53674 8.150645e-01 0.875785612 0.057113560 1 12.26472
#....
Conclusion
I am very happy to see that I could finally perform EdgeR directly from the scRNA-seq and see the DE table as an end product. I think LEMUR really streamlines the process and help us confidently connect multiple datasets.
The article also demonstrates other use of the LEMUR package, especially the zebrafish differentiation dataset that can help identify DE genes along differentiation axis.