Working on B-cell repertoires.

Through the long time in-between my posts, I have switched to a new project and landed a new job. So long to the single-cell field and now I am in the field of protein engineering trying to pool in machine-learning and bio-informatics to guide the synthesis of novel therapeutic proteins.

There is also a shift that I am now heavily relying on published/available data for my work as we are learning from naturally observed immune diversity or previously published synthetic constructs and then use the information to create new things. More specifically, we are looking into building a high-throughput pipeline for functional antibody synthesis and identification for novel antigen, fitting the need for new drugs and vaccines in the Covid era.

One natural source of functional antibodies to look into would be Human B-cell repertoire data. In addition to Ig-seq data, 10x genomics VDJ technology has speed up the accumulation of immune cell profiles. However, the major down-side is that raw cDNA/DNA sequences are not ideal. As the ultimate goal is to compare and investigate the immunoglobulin binding and structure, raw data from these technology required extra steps of clean-up, annotation, and characterization to different germline subgroups. Even there are well-established pipeline to assemble the proteome from Ig-seq such as MIXCR, I do hesitate to invest so much resources to re-annotate the vast amount of raw data just to extract a small amount of antibody sequences in the end (more precisely, CDR3 sequences from memory B-cells).

The OAS database

Therefore, discovering the OAS database is definitely a godsend. THe OAS database is a fully curated database for immune repertoire; data is deposited based on the treatment and personal information of the subject. I my case, I am interested in looking into normal people B-cell profiles for typical features of a “functional” mature antibody. Other options included sorted data based on cell type, tissues and antigen (vaccinations). Another amazing feature is that in each of the dataset, each antibody sequence is already annotated, with their germline genes identified, and have IMGT numbering assigned to the sequences. That saved a huge amount of work for preprocessing!

The data is stored in the .json format. I am not familiar about this format before but am instantly fascinated by the usefulness and conveniences of the format as it can be directly read into python as dictionaries. So with two readline() commands, I can obtain matadata and IMGT numbered sequence data from the .json file!

import os,gzip,json,pprint
import numpy as np

def read_in_oas_json(src):
	seqs={}
	freqs={}
	for line in gzip.open(src, 'rb'):
		row=json.loads(line)
		if 'data' in row.keys() and row['redundancy']==1:
			row['data']=json.loads(row['data'])
			freq=len(row['cdr3'])
			try:
				seqs[freq].append(row['cdr3'])
				freqs[freq].append(row['data']['cdrh3'])
			except:
				seqs[freq]=[row['cdr3']]
				freqs[freq]=[row['data']['cdrh3']]
	return seqs, freqs			

The OAS data also specifies if a sequence is redundant of not. Using this dataset streamlines my work so much that I can plot aa residue frequencies in 5 minutes, processing 10k sequences for an experimental subject. The .json.gz file is also very small, so now I can download and compare a lot of samples using a small amount of time and disk space.

The next step is to find some immune-profile for cancers. I am reading about the pVACtools and I wonder if there are any repertoires data out there. Hopefully, I will find somethings soon.