A novel approach to remove the batch effect of single-cell data
·Another way to merge noisy replicates
Here is an interesting methods proposed by Zhang et al (2019) for removing batch effects. I am currently dealing with datasets that have profound batch effects and exploring different ways to find closest neighbor between datasets.
According to the article, the concept is quite simple. First, the expression matrix of each batch is compressed to a 1-d t-SNE dimension. Then group 10 cells into a representative cell based on the t-SNE value. Then BEER map the closest neighbor (mutual nearest neighbor) between batches by Kendall’s tau B between representative cells, one from each batch.
In the second part, BEER just concatenate the expression matrices of batches and perform the PCA as usual and generate 50 PC. Then BEER identify the PCs that have poor correlation between MN-paired cells group and removed them.
The idea sounds promising, but I am not sure removing PCs would be a better idea of preserving biological variances comparead to other methods such as CCA. And what if the batch effects affect a lot of PCs? Are the MN pairing one-to-one and how to deal with 1-many mapping? How about groups of novel cells that were not present in another batch? Will that affect our downstream analyses of biomarkers identifications? Or is this methods more about comparing cell subpopulation size?
And thus, I am digging in the code to find out.
Example in BEER
Step 1: load data
library(reticulate)
use_python("/path/to/python3",required=T)
py_config()
### load data ###
library(Seurat)
source("/path/to/BEER/BEER.R")
setwd("/path/to/BEER/")
D1 <- read.table(unz("demodata.zip","DATA1_MAT.txt"), sep='\t', row.names=1, header=T)
### 6695 genes x 861 cells
D2 <- read.table(unz("demodata.zip","DATA2_MAT.txt"), sep='\t', row.names=1, header=T)
### 6695 genes x 1225 cells
colnames(D1)=paste0('D1_', colnames(D1))
colnames(D2)=paste0('D2_', colnames(D2))
#### simple_combine
DATA=.simple_combine(D1,D2)$combine
### define batch info ###
BATCH=rep('D2',ncol(DATA))
BATCH[c(1:ncol(D1))]='D1'
### step 2: Detect Batch Effect ###
mybeer=BEER(DATA, BATCH, GNUM=30, PCNUM=50, ROUND=1, GN=2000, SEED=1, COMBAT=TRUE, RMG=NULL)
### step 3: rerun umap based on BEER selected PC and plot UMAP again ###
pbmc <- mybeer$seurat
PCUSE <- mybeer$select
pbmc <- RunUMAP(object = pbmc, reduction='pca',dims = PCUSE, check_duplicates=FALSE)
DimPlot(pbmc, reduction='umap', group.by='batch', pt.size=0.1)
So in this step, dataset were given a header “D1” and “D2” by batch and simply cbind the datasets after checking that the gene list (rownames) of the datasets were the same.
Breaksdown on the BEER functions (step 2)
So here I looked into the wrapper BEER function and see how they clean up the batch effects.
BATCH=2
GNUM=30
PCNUM=50
GN=2000
CPU=4
COMBAT=TRUE
print_step=10
SEED=123
N=2
ROUND=1
RMG=NULL
set.seed( SEED)
RESULT=list()
library(Seurat)
COMBAT.EXP=NULL
require(stringi)
BATCH=stri_replace_all(BATCH, '.',fixed='_')
UBATCH=unique(BATCH)
print('Group number (GNUM) is:')
print(GNUM)
print('Varible gene number (GN) of each batch is:')
print(GN)
print('ROUND is:')
print(ROUND)
VARG=c()
i=1
### find highly variable genes in each batch ###
for(this_batch in UBATCH){
print(i)
i=i+1
print(this_batch)
d<-DATA[,which(BATCH==this_batch)]
dm<-as.matrix(d, nrow = 6695)
this_pbmc=CreateSeuratObject(counts = dm, min.cells = 0,
min.features = 0, project = this_batch)
this_pbmc <- NormalizeData(object = this_pbmc, normalization.method = "LogNormalize",
scale.factor = 10000)
this_pbmc <- FindVariableFeatures(object = this_pbmc, selection.method = "vst", nfeatures = GN)
this_varg=VariableFeatures(object = this_pbmc)
VARG=c(VARG, this_varg)
}
### find union of unique geens
VARG=unique(VARG)
###2551 HVG
print('Total varible gene number (GN) is:')
print(length(VARG))
Here, BEER first separate the batches and identify highly variable genes in each batch using Seurat. A union of HVG are extracted and used downstream.
### work with combined dataset
pbmc=CreateSeuratObject(counts = DATA, min.cells = 0, min.features = 0, project = "ALL")
pbmc@meta.data$batch=BATCH
VariableFeatures(object = pbmc)=VARG
A Seurat object is generated from the combined dataset, with the batch info and HVG defined manually from previously steps.
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
if(COMBAT==FALSE){
### have a look at non-normalized plot
pbmc <- ScaleData(object = pbmc, features = VariableFeatures(object = pbmc))
pbmc <- RunPCA(object = pbmc, seed.use=SEED, npcs=PCNUM, features = VariableFeatures(object = pbmc), ndims.print=1,nfeatures.print=1)
pbmc <- RunUMAP(pbmc, dims = 1:PCNUM,seed.use = SEED,n.components=N)
library(ggplot2)
library(patchwork)
DimPlot(pbmc)+labs(title="UMAPs of DATA before bath effect removal")
}else{
############## COMBAT ###
library(sva)
library(limma)
pheno = data.frame(batch=as.matrix(BATCH))
orig.data=pbmc@assays$RNA$data
used.gene.index=which(rownames(orig.data) %in% VARG)
#edata = as.matrix(orig.data)[used.gene.index,]
edata = as_matrix(orig.data)[used.gene.index,]
batch = pheno$batch
modcombat = model.matrix(~1, data=pheno)
combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plots=FALSE)
rownames(combat_edata)=rownames(edata)
colnames(combat_edata)=colnames(edata)
combat_edata=as.matrix(combat_edata)
combat_edata[which(combat_edata<0)]=0
combat_edata[which(is.na(combat_edata))]=0
pbmc@assays$RNA$data=combat_edata
######
pbmc <- ScaleData(object = pbmc, features = VariableFeatures(object = pbmc))
######
pbmc@assays$RNA$data=orig.data
COMBAT.EXP=combat_edata
#################
rm(edata)
rm(combat_edata)
rm(orig.data)
gc()
}
In this step, we can see that the dataset were normalized first using Seurat and then further normalized using COMBAT. And then, Suerat Scaledata is used again, based on the pre-selected HVG.
Very interestingly, the example data is already very well merged at this point. So it is not very clear how one can demonstrate the use of BEER.
print('Calculating PCs ...')
pbmc <- RunPCA(object = pbmc, seed.use=SEED, npcs=PCNUM, features = VariableFeatures(object = pbmc), ndims.print=1,nfeatures.print=1)
### make 50 PC pca ###
pbmc <- RunUMAP(pbmc, dims = 1:PCNUM,seed.use = SEED,n.components=N)
for(this_batch in UBATCH){
this_index=which(BATCH==this_batch)
this_one=DR[this_index,]
#CNUM=max(c(5, round(length(this_index)/GNUM) ))
this_gnum=min(GNUM, (length(this_index)-1))
this_group=.getGroup(this_one,this_batch,this_gnum)
#### .getGROUP = function to perform kmeans clustering of GNUM (30 by default) cluster ###
GROUP[this_index]=this_group
}
pbmc@meta.data$group=GROUP
VP=.getVPall(pbmc, ROUND)
Using the .getGroup function, each batch was clustered to 30 clusters. Next, using the .getVPall function, 1-to-1 mapping of cluster representative were performed between 2 batches. This is the novel part of BEER and thus lets have a closer look into the code
#.getVPall<- function(pbmc, ROUND){
print('Finding MN pairs...')
################
#REF=.generate_agg(pbmc@assays$RNA$data, pbmc@meta.data$group)
exp_sc_mat<-pbmc@assays$RNA$data
TAG<-pbmc@meta.data$group
REF=.generate_agg(pbmc@assays$RNA$data, pbmc@meta.data$group)
##### breaksdown of the function .generate_agg ###
##### so here, for each cluster, mean expression (row mean) were calculated. ####
#### this generate a representative per cluster ###
.generate_agg <- function(exp_sc_mat, TAG, print_step=100){
print_step=100
NewRef=matrix(0,ncol=length(unique(TAG)),nrow=nrow(exp_sc_mat))
### NewRef 6695 x 60 new representative exp matrix
TAG=as.character(TAG)
refnames=unique(TAG)
total_num=length(refnames)
outnames=c()
i=1
#### take exp mean for each cluster for each gene to generate representative
while(i<=length(refnames)){
one=refnames[i]
this_col=which(TAG==one)
outnames=c(outnames,one)
if(length(this_col) >1){
#this_new_ref=apply(exp_sc_mat[,this_col],1,mean)
this_new_ref=apply(exp_sc_mat[,this_col],1,sum)
}else{
this_new_ref = exp_sc_mat[,this_col]
}
NewRef[,i]=this_new_ref
if(i%%print_step==1){print(paste0(i,' / ' ,total_num ))}
i=i+1
}
rownames(NewRef)=rownames(exp_sc_mat)
colnames(NewRef)=outnames
if(length(NewRef[1,])==1){
NewRef=cbind(NewRef[,1], NewRef[,1])
rownames(NewRef)=rownames(exp_sc_mat)
colnames(NewRef)=c(outnames,outnames)
}
return(NewRef)
}
#.getVPall cont.
VREF=NewRef
CORMETHOD='spearman'
CVREF=cor(VREF,method=CORMETHOD)
orig.CVREF=CVREF
UBATCH=unique(pbmc@meta.data$batch)
#### UBATCH - "D1" and "D2"
.get_batch<-function(x){
y=unlist(strsplit(x,'_'))[1]
return(y)
}
group_batch=apply(as.matrix(colnames(CVREF)),1,.get_batch)
.getMN <- function(this_cor_mat){
VP=c()
i=1
while(i<=nrow(this_cor_mat)){
this_p1=rownames(this_cor_mat)[i]
j=1
while(j<=ncol(this_cor_mat)){
this_p2=colnames(this_cor_mat)[j]
this_cor=this_cor_mat[i,j]
if( this_cor!= -99999 & this_cor==max(this_cor_mat[i,]) & this_cor==max(this_cor_mat[,j])){
### so not only b1 has to be the max correlated with b2, b2 also has to be the max correlated with b1
VP=cbind(VP,c(this_p1,this_p2))
}
j=j+1}
i=i+1}
return(VP)
}
VP=c()
I=1
i=1
while(i<length(UBATCH)){
j=i+1
while(j<=length(UBATCH)){
b1=UBATCH[i]
b2=UBATCH[j]
b1i=which(group_batch==b1)
b2i=which(group_batch==b2)
this_cor_mat=CVREF[b1i,b2i]
### make the correlation matrix fot batch 1 vs batch 2 ###
this_vp=.getMN(this_cor_mat)
VP=cbind(VP,this_vp)
########################
########################
j=j+1}
i=i+1
}
########################
########################
print('ROUND:')
print(I)
I=I+1
}
VP=t(VP)
VP=unique(VP)
#######################
print('Number of MN pairs:')
print(nrow(VP))
return(VP)
}
#> VP
# [,1] [,2]
#[1,] "D1_26" "D2_23"
#[2,] "D1_27" "D2_27"
#[3,] "D1_23" "D2_10"
#[4,] "D1_14" "D2_28"
#[5,] "D1_11" "D2_11"
#[6,] "D1_29" "D2_22"
The .getVPall function consist of 2 main sub-functions. .generate_agg would isolated cells (columns) involved with the same cluster and compute the mean of each gene expression value to create the cluster center representative. .getMN function computed the correlation matrix of all cluster representative and identify if a cluster contain a 1-to-1 mapping between batches. The 1-to-1 mapping is defined by the cluster has the highest correlation with its counterpart while the counterpart also has the highest correlation with such cluster. For example, if the counterpart cluster has a higher correlation with another cluster, then MN pair was not established. In this demo dataset, we found 6 1-to-1 mappings between the 2 batches (1 batch contain 30 clusters).
Then we moved on the step to identify and remove the PCs that contain substantial batch effects.
DR=pbmc@reductions$pca@cell.embeddings
MAP=rep('NA',length(GROUP))
MAP[which(GROUP %in% VP[,1])]='V1'
MAP[which(GROUP %in% VP[,2])]='V2'
pbmc@meta.data$map=MAP
OUT=.evaluateProBEER(DR, GROUP, VP)
###.evaluateProBEER <- function(DR, GROUP, VP){
OUT=list()
VALID_PAIR=VP
ALL_COR=c()
ALL_PV=c()
ALL_LCOR=c()
ALL_LPV=c()
ALL_LC1=c()
ALL_LC2=c()
print('Evaluating PCs ...')
print('Start')
THIS_DR=1
while(THIS_DR<=ncol(DR)){
THIS_PC = DR[,THIS_DR]
lst1_quantile=c()
lst2_quantile=c()
i=1
while(i<=nrow(VALID_PAIR)){
this_pair=VALID_PAIR[i,]
this_index1=which(GROUP %in% this_pair[1])
this_index2=which(GROUP %in% this_pair[2])
lst1_quantile=c(lst1_quantile,quantile(DR[this_index1,THIS_DR]))
lst2_quantile=c(lst2_quantile,quantile(DR[this_index2,THIS_DR]))
i=i+1}
this_test=cor.test(lst1_quantile, lst2_quantile, method=CORMETHOD)
### why are they doing cor.test on quantile values?
this_cor=this_test$estimate
this_pv=this_test$p.value
this_test2=cor.test(lst1_quantile, lst2_quantile, method='pearson')
this_cor2=this_test2$estimate
this_pv2=this_test2$p.value
olddata=data.frame(lst1=lst1_quantile, lst2=lst2_quantile)
fit=lm(lst1 ~lst2, data=olddata)
ALL_LC1=c(ALL_LC1, summary(fit)$coefficients[1,4])
ALL_LC2=c(ALL_LC2, summary(fit)$coefficients[2,4])
ALL_COR=c(ALL_COR, this_cor)
ALL_PV=c(ALL_PV, this_pv)
ALL_LCOR=c(ALL_LCOR, this_cor2)
ALL_LPV=c(ALL_LPV, this_pv2)
print(THIS_DR)
THIS_DR=THIS_DR+1}
OUT$cor=ALL_COR
OUT$pv=ALL_PV
OUT$fdr=p.adjust(ALL_PV,method='fdr')
OUT$lc1=ALL_LC1
OUT$lc2=ALL_LC2
OUT$lcor=ALL_LCOR
OUT$lpv=ALL_LPV
OUT$lfdr=p.adjust(ALL_LPV,method='fdr')
print('Finished!!!')
return(OUT)
}
Here, the .evaluateProBEER funtion is for computing the PC values between the 1-to-1 mapping groups. Briefly, from the previously identified MN pair (ie. “D1_26” and “D2_23”), the original PCA cells loadings were recalled (N cells x 50 PC) and isolated based on the cluster assignation. Then, the quantiles of the PC vlaues were obtained for each group and cor.test were performed on the quantile values. I think the quantiles values were taken so as to resolved the issue with different cluster size and so each 1-to-1 cluster pair will have 5 pairs of points (0, 25, 50, 75, 100%) in the cor.test. In here, we have 5 pairs of points x 6 MN pair, thus 30 data points for cor.test for each PC. Finally, cor.test results of each PC is recorded and returned.
### BEER function cont
### writing results
RESULT=list()
RESULT$seurat=pbmc
RESULT$vp=VP
#RESULT$vpcor=VP_OUT$cor
RESULT$cor=OUT$cor
RESULT$pv=OUT$pv
RESULT$fdr=OUT$fdr
RESULT$lcor=OUT$lcor
RESULT$lpv=OUT$lpv
RESULT$lc1=OUT$lc1
RESULT$lc2=OUT$lc2
RESULT$lfdr=OUT$lfdr
################
RESULT$ROUND=ROUND
RESULT$COMBAT=COMBAT
RESULT$COMBAT.EXP=COMBAT.EXP
RESULT$RMG=RMG
RESULT$GNUM=GNUM
RESULT$GN=GN
RESULT$PCNUM=PCNUM
RESULT$SEED=SEED
RESULT$N=N
RESULT$APP='BEER'
###############
PCUSE=which( (rank(RESULT$cor)>=length(RESULT$cor)/2 | RESULT$cor>0.7 ) &
(rank(RESULT$lcor) >=length(RESULT$cor)/2 | RESULT$lcor>0.7) #&
#p.adjust(RESULT$lc1,method='fdr') >0.05
)
### so select pc with cor >0.7 or above average ###
RESULT$select=PCUSE
print('############################################################################')
print('BEER cheers !!! All main steps finished.')
print('############################################################################')
print(Sys.time())
return(RESULT)
}
In the final step of BEER, we can see that PC with correlation > 0.7 or correlation values above average PC in the tests were selected. Then UMAP is constructed using Seurat on the COMBAT-normalized combined dataset again, based on the selected PCs by BEER.
Conclusion
So after digging into the codes, I can see that:
1) BEER does use COMBAT and then further remove batch effects based on the PCA.
2) clusters mapping is 1-to-1 and we have to accept that only a small fraction of cells have MN-pair clusters.
3) In this example the batch effect do not affect many PCs after COMBAT correction, so it seems working. The article did present a more drastic case that COMBAT did not work at all. It would be interesting to see which PC they remove to get the merge successful.
4) Finally, there were no further instructions on DE analysis after merging. So I guess we will have the perform biomarker identification individually first to identify biomarker and compare and contrast between batches of cells identified to be in the same cluster in the merged dataset.