Improved mutant function prediction via PACT: Protein Analysis and Classifier Toolkit
·Using PACT to evaluate beneficial mutations.
This is a summary of my understandings on the Protein Analysis and Classifier Toolkit (PACT). It deals with DMS data and hold great promises by allowing users to add pdb files to aid the mutation effect evaluation. There is also a naive Bayes component to filter truly beneficial mutations using sequence and structural features.
What is special about PACT?
The biggest ravel of PACT is Enrich2. It takes in multiple replicates of time-series selection data on a protein variant library that observed the frequency changes over time. Using the changes over time results, Enrich2 fit a lemma model to define a mutant fitness as the slope of regression line, evaluate the Poisson variance over replicates and generate the fitness as Z-score compared to WT performance.
Expanded utility
1. FACs score
PACT, on the contrary, mainly compares the endpoint (be it a competition assay of FACs sorted filtering) to a reference (plasmid/starting point library). So, it can provide different fitness metrics based on the experimental setting - growth based fitness and FACs based fitness. As pointed out by Enrich2 WT non-linearity is a problem in this type of normalisation. Not sure if using only the 2 time points help to bypass this issue in PACT.
One interesting point of the formulas (in Note S12 of the PACT paper) is that the variant fitness seems indispensable to that of the WT. So I guess that means the fitness is always a measure about how the variant fares compared to the WT, regardless the WT itself is being positively or negatively selected.
After outlining the importance of the WT phenotype. PACT established the null baseline using synonymous WT mutations that assumed have the same protein phenotype. The WT counts is the used to for the Poisson distribution of the null.
2. PSSM
PACT can decompose per site mutations using mutual information of a pair of mutations. It is not obvious about variants baring more mutations but I assume PACT will lump them together based on the focal mutations.
That makes good sense though, as the more detailed characterization of variants the fewer reads we can work with, thus reducing the statistical power and further complicating the calculation.
3. PDB features
One of the important developments of PACT is that it can classify mutation effects with added information from the protein structure. The added in structural features are distance to active site, contact number, and PSSM, fraction burial of a residue, and the size and chemical change of mutation based on distance to active site, contact number and fraction burial). These features do play direct role in whether the mutation is deleterious, neutral or beneficial, especially the fraction burial (Fig S3 in the PACT paper). I think I would definitely look into these scripts on structural analyses; they are definitely useful for other projects if I would like to extract these features in other proteins.
4. Bayes classifier
After binning mutations to beneficial, deleterious, and neutral, they run a naive Bayes classifier to determine which features are important (Table S6). They found that natural homology frequency (mutation that are also found in other naturally occurring homologues) seems to be the dominant predictor of high performance.
Implementation issue
This very sound package is operated via multiple protocol (.ini files). The dataset it is based on is from the earlier TEM1-1- and LGK paper published by the same author in 2015 Klesmith et al. : Deep mutational scanning enzyme solubility. Therefore, the python scripts are accustomed to this set of deep sequencing dataset, that unfortunately seems not available publicly. In Note S1, there are some but insufficient specification on how the input fastq should look like in order for successful merging, etc. It is not obvious how to run the package, such as the order of which protocol we should run first, etc. And I thought the classification_analysis.ini is an appropriate intermediate protocol I can just run it to see if it can complete the analyses. I failed.
In sum, I think it is a nice package. BUT I CANNOT USE IT AT ALL!
I would probably recycle some of the scripts, especially those on protein analyses and homologue MSA to make a PSSM. It still serve a very good guideline on how to work with FACs based MAVE dataset and classification on mutational effects. For those that are interested in mutating protein with available structural, plenty of natural variants. Maybe this is the package you would like to go to for quantifying your variants.
The PACT paper also highlights a deeper but urgent issue, the availability of useful data, clear documentation and the universal format of analyses are all lacking; without these three user friendly characteristics the package is rendered useless.
If the package can accommodate variant count matrix on top of fastq, it can be extremely useful as I can directly plug in the matrix I download from MAVE DB to test the package.
On another hand, I see the protocol idea is to standardize workflow and version control. It is not compatible to other packages expect PACT and it makes using PACT more challenging. An alternatively is to migrate to snakemake to standardize the workflow.
Finally, it is the .PACT file that make the mix and match of PACT to other external packages also tremendously difficult. It would be much better if they are .h5 or .tsv. At least I can handle the conversion with less decoding.
I think this is also an important lesson for a potential developer so that the package we write can be adopted and applied to other project.