PHATE - better capture cell-cell similarity across manifold

Upon constructing more realistic trajectory.

With the increasing complexity of single-cell experinments, researchers now have to handle data with expanding number of replicates, cell-types, and transitional states.

Simply using pca + t-sne for cluster definition and visualisation fail to capture the correct global structure of the cell population, thus incorrect influence of differentiation trajectories and transition cell state.

Last year, t-sne was replaced by UMAP, which corrected the between-cluster distance. But UMAP still lacks a sense of directions to reflect the cell-state transition in the data.

Today we look into this new tool, PHATE, that accommodate global structure and infer trajectories simultaneously, which produces precise clustering along a developmental axis.

The PHATE pipeline

PHATE begins with a filtered and normalized count matrix. In the tutorial, it only kept cells with total RNA counts higher than the bottom 25% and lower than the top 25%. It further removes cells with extremely high mitochondria counts (90%) and genes that expressed in fewer than 10 cells. Next it normalised the matrix by square root rather than the typical log1p().

So as usual, let started with the python tutorial of PHATE.

import pandas as pd
import numpy as np
import phate
import scprep
import os
download_path ="/home/achu/blog_posts/2020_06_PHATE"

scprep.io.download.download_and_extract_zip("https://data.mendeley.com/datasets/v6n743h5ng/1/files/7489a88f-9ef6-4dff-a8f8-1381d046afe3/scRNAseq.zip?dl=1", download_path)


sparse=True
T1 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T0_1A"), sparse=sparse, gene_labels='both')
T2 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T2_3B"), sparse=sparse, gene_labels='both')
T3 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T4_5C"), sparse=sparse, gene_labels='both')
T4 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T6_7D"), sparse=sparse, gene_labels='both')
T5 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T8_9E"), sparse=sparse, gene_labels='both')
T1.head()
sparse=True
T1 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T0_1A"), sparse=sparse, gene_labels='both')
T2 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T2_3B"), sparse=sparse, gene_labels='both')
T3 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T4_5C"), sparse=sparse, gene_labels='both')
T4 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T6_7D"), sparse=sparse, gene_labels='both')
T5 = scprep.io.load_10X(os.path.join(download_path, "scRNAseq", "T8_9E"), sparse=sparse, gene_labels='both')
T1.head()

filtered_batches = []
for batch in [T1, T2, T3, T4, T5]:
    batch = scprep.filter.filter_library_size(batch, percentile=20, keep_cells='above')
    batch = scprep.filter.filter_library_size(batch, percentile=75, keep_cells='below')
    filtered_batches.append(batch)
del T1, T2, T3, T4, T5
# Merge all datasets and create a vector representing the time point
EBT_counts, sample_labels = scprep.utils.combine_batches(
    filtered_batches, 
    ["Day 00-03", "Day 06-09", "Day 12-15", "Day 18-21", "Day 24-27"],
    append_to_cell_names=True
)

EBT_counts = scprep.filter.filter_rare_genes(EBT_counts, min_cells=10)
EBT_counts = scprep.normalize.library_size_normalize(EBT_counts)
#remove dead cells
mito_genes = scprep.select.get_gene_set(EBT_counts, starts_with="MT-") # Get all mitochondrial genes. There are 14, FYI.
scprep.plot.plot_gene_set_expression(EBT_counts, genes=mito_genes, percentile=90)
EBT_counts, sample_labels = scprep.filter.filter_gene_set_expression(
    EBT_counts, sample_labels, genes=mito_genes, 
    percentile=90, keep_cells='below')
# transform
EBT_counts = scprep.transform.sqrt(EBT_counts)

Using euclidean distance matrix for cell clustering is along the same line with SC3; the new part that PHATE brought about is modeling cell-cell similarity from the distance matrix. PHATE then built a diffusion graph after obtaining the cell-cell similarity affinity matrix. The diffusion matrix , which report the transition probability from one point to another, was computed based a KNN parameters (k) and a decay parameter (alpha).

Finally PHATE convert the diffusion distance (determined from random walk parameter t) to potential distance (Shanon information distance). And visualisation was obtained from metric MDS.

Although it sounds complicated, the PHATE command is very simple in python.

phate_operator = phate.PHATE(n_jobs=1)
Y_phate = phate_operator.fit_transform(EBT_counts)
scprep.plot.scatter2d(Y_phate, c=sample_labels, figsize=(12,8), cmap="Spectral",
                      ticks=False, label_prefix="PHATE")

I have also tried the dataset (myeloid and erythroid cells in mouse bone marrow) they used in the R tutorial. So, by simply importing the csv as a pandas object, we can perform PHATE analysis directly on the dataset.

Also, I am happy to learn this scprep plotting package specialises at plotting beautiful scatterplots.

As mentioned above, one can fine-tune the PHATE model by changing t, knn, and decay. Here reducing the knn parameter potentially enhance local structure (reduce clustering cutoff). By changing t, one can smooth the noise (as we increase t, one allow more steps taken in the the diffusion process, thus more accurate modeling).

phate_operator.set_params(knn=4, decay=15, t=12)
Y_phate = phate_operator.fit_transform(man_counts)
### this step also did imputing gene counts ###

scprep.plot.scatter2d(Y_phate, c=man_counts.loc[: , "Mpo"], figsize=(12,8), cmap="Spectral",
                      ticks=False, label_prefix="PHATE", title="BMMSC-MPO")



scprep.plot.scatter2d(Y_phate, c=man_counts.loc[: , "Ifitm1"], figsize=(12,8), cmap="Spectral",
                      ticks=False, label_prefix="PHATE", title="BMMSC-Ifitm1")

In the myeloid dataset, changing these parameters increase the sparseness of the cell distribution but did not change the overall shape of the PHATE-plot.

Unlike R, the python version seems did not mention gene count imputation.

But plotting the PCA, t-SNE and PHATE along each other, one can see that PHATE has capture the overall differentiate trajectory as well as the local clustering structure of the population.

An interesting observation is that cells with a high expression of MPO (yellow ones), do looks more sparse (PCA) or as sparse as (t-SNE) those with low expression but were compressed to a thin trajectory on PHATE. I wonder if that means more tuning required or PHATE reflected the truth that PCA and t-SNE did not manage to capture.

Next post, I will go further to an add-on of PHATE, PhMED, that evaluate population composition changes upon drug treatment.