Optimal Transport for building trajectories.

This time, we are going to explore this new method to infer expression transition in cellular processes. In this paper (Schiebinger et al (2019)), the authors describe a new way to track cell trajectories from time-series scRNA-seq data using Waddington-OT.

The basic idea of the optimal transport principle is that cells move through the expression profile manifold in time; this process of movement is described as a time-varying probability distribution on gene expression space.

In that light, cells are most likely to move between the nearest probability distribution. In addition to that, the authors add in the possibility that cells undergone apoptosis during differentiation and are lost in the trajectory, preventing the artifact of inferring dead cells to other cell fates in the process. The estimate also incorporate the sampling noise and varying growth rate of the cells.

The development process is then defined as moving cells from state A to B in time through a transport map, which the path with optimal cost (measured as squared-Euclidean distance) of transport.

This method is very different from the RaceID and StemID methods and it will be interesting to find out its performance compared to StemID, which also infer developmental trajectory and identify gene regulatory networks contributing to particular cell fate captured by the trajectory.

RaceID + StemID

Briefly, StemID draws inter-cluster links projection based on 1) projects between cluster medoids and 2) nearest neighbor distances between two cells. In methods one, StemID calculates the significance of a link by assessing how many cells fall onto the link, the more cells that fit onto a particular link, the more likely the link is a true lineage. The vignettes of StemID is also very detailed and easy to follow.

WOT in practice

WOT has a very good documentation and tutorial. The tutorial can also be run interactive in Terra, that is really cool and save me from downloading all the big input dataset.

Lets have a look into the tutorial data and see what we will need for WOT as input. From tutorial 1, in addition to the gene expression matrix and the meta data specifying the number of days of differentiation when the cell was sampled, there are a few extra files:

1) Fle_corrds.txt - looks like a umap coordinates of all cells.

2) ExprMatrix.var.genes.h5ad - expression matrix of highly variable genes?

3) gene_sets.gmx - a predefine gene set file that contain gene sets relevant to a particular process. (I thought we are going to infer that from data…)

4) cell_sets.gmt - a user define sete file that group cells together.

So from the first glance, we need to process the raw data with another software to compute highly variable genes, a umap, a list of gene markers and a list of binned cells of different groups at least so as to use WOT.

After trying a few versions, it seems that python3.6 is the most compatible. I still encounter some issues though, as noted in the code blocks.

Visualise input data

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import wot

##load input data
FLE_COORDS_PATH ='data/fle_coords.txt'
FULL_DS_PATH = 'data/ExprMatrix.h5ad'
VAR_DS_PATH = 'data/ExprMatrix.var.genes.h5ad'
CELL_DAYS_PATH = 'data/cell_days.txt'
GENE_SETS_PATH = 'data/gene_sets.gmx'
GENE_SET_SCORES_PATH = 'data/gene_set_scores.csv'
CELL_SETS_PATH = 'data/cell_sets.gmt'

coord_df = pd.read_csv(FLE_COORDS_PATH, index_col='id', sep='\t')
days_df = pd.read_csv(CELL_DAYS_PATH, index_col='id', sep='\t')

# Read expression matrix, cell days, and 2-d coordinates
metadata=pd.concat([days_df,coord_df], axis=1)
adata = wot.io.read_dataset(FULL_DS_PATH, obs=metadata)
### annotated object with 236285 cells x 19089 genes
### not sure why all the metadata becomes Nan in adata.obs ###
adata.obs = metadata
unique_days = adata.obs['day'].unique()
unique_days = unique_days[np.isnan(unique_days) == False]
# plot visualization coordinates
figure = plt.figure(figsize=(10, 10))
plt.axis('off')
plt.tight_layout()
plt.scatter(adata.obs['x'], adata.obs['y'],c=adata.obs['day'],
               s=4, marker=',', edgecolors='none', alpha=0.8)
cb = plt.colorbar()
cb.ax.set_title('Day')
plt.show()

Next, the software scores the cells according to a list of provided gene markers of a signature process, (manually curated in the WOT paper, see section Creating gene signatures and cell sets in Methods).

Compute gene set scores

feature_id=adata.var.index.values.astype("str")
### in my case, the feature ID and the gene names extracted from the gmx file don't match, convert to str seems solves the problem ###
gs = wot.io.read_sets(GENE_SETS_PATH, feature_id)
#geneset_gmx=wot.io.read_gmx(GENE_SETS_PATH, adata.var.index.values)
# read in gene set path and see if they are HVG in the expression profiles).
# return 19089 gene present/absent to 32 signature #
### something is wrong as the gs read in are all Nan ###
gene_set_scores_df = pd.DataFrame(index=adata.obs.index)
for j in range(gs.shape[1]):
    gene_set_name = str(gs.var.index.values[j])
    result = wot.score_gene_sets(ds=adata, gs=gs[:, [j]], permutations=0, method='mean_z_score')
    gene_set_scores_df[gene_set_name] = result['score']
gene_set_scores_df.to_csv('data/gene_set_scores.csv', index_label='id')

Now based on the gene set scores, we would divide cells into different sets, then we can proceed to tutorial 2. As mentioned in the tutorial, transport map growth rate is estimated based on the cell signature score of proliferation and apoptosis markers, make sure we have these two gene sets if we are using our own data.

Compute growth rate for each cell

gene_set_scores = gene_set_scores_df
proliferation=gene_set_scores['Cell.cycle']
apoptosis = gene_set_scores['Apoptosis']

# apply logistic function to transform to birth rate and death rate
def logistic(x, L, k, x0=0):
    f = L / (1 + np.exp(-k * (x - x0)))
    return f

def gen_logistic(p, beta_max, beta_min, pmax, pmin, center, width):
    return beta_min + logistic(p, L=beta_max - beta_min, k=4 / width, x0=center)

def beta(p, beta_max=1.7, beta_min=0.3, pmax=1.0, pmin=-0.5, center=0.25):
    return gen_logistic(p, beta_max, beta_min, pmax, pmin, center, width=0.5)

def delta(a, delta_max=1.7, delta_min=0.3, amax=0.5, amin=-0.4, center=0.1):
    return gen_logistic(a, delta_max, delta_min, amax, amin, center, width=0.2)

birth = beta(proliferation)
death = delta(apoptosis)

# growth rate is given by 
gr = np.exp(birth-death)
growth_rates_df = pd.DataFrame(index=gene_set_scores.index, data={'cell_growth_rate':gr})
# compute growth rate for each cell.
# growth_rates_df.to_csv('data/growth_gs_init.txt')

Make transport map

SERUM_CELL_IDS_PATH = 'data/serum_cell_ids.txt'
metadata=pd.concat([adata.obs, growth_rates_df], axis=1)
# load data
adata = wot.io.read_dataset("data/ExprMatrix.var.genes.h5ad", obs=None)
adata.shape
# 175472 x 1479
### we apply a filter to select cells from the serum time course
serum_cells=pd.read_csv(SERUM_CELL_IDS_PATH, header=None, index_col=0)
p=serum_cells.index.tolist()
adata.obs=metadata.loc[p]
# create OTModel
ot_model = wot.ot.OTModel(adata,epsilon = 0.05, lambda1 = 1,lambda2 = 50) 
# Compute a single transport map from day 7 to 7.5
tmap_annotated = ot_model.compute_transport_map(7,7.5)

Here we have an annotated transport map between timepoint 7 and 7.5, with a estimated of grwoth rate at g0 and g1 timepoiny. In the ot_model, $\lambda$ ₁, $\lambda$ ₂ and $\epsilon$ all require fine tuning by iterating through a range of values.

And tutorial 3 will proceed to model long-range couplings to model expression changes over a longer time period. In tutorial 4, we will learn to build a trajectory and infer “ancestors” and “descendants” at a given time point.

User comments

Overall, the WOT software is not as straightforward to use as I expected. It runs on python 3.5 to 3.6 and I did have package compatibility problems with pandas and anndata.

It required very well integrated multi-time point datasets as input, that is not that easily obtained; and it required us to have a predefined set of gene modules for its estimations on growth rate and cell fate.

Then, users have to define cells sets via clustering (i.e. nearest neighbour) and gene signature scores, probably the hardest part of the project.

All in all, we probably cannot use WOT unless we have a very well integrated dataset with minimal technical noises and enough knowledges on defining cell sets and gene modules. Also, WOT may not be applicable to data with poor separation of less obvious phenotypic difference (i.e. data in single cell crispr screen).