Saptarshi Chakraborty, Ph.D.
Assistant Professor
Department of Biostatistics
School of Public Health and Health Professions
State University of New York at Buffalo

Statistical inference on the cancer-site specificities of collective ultra-rare whole genome somatic mutations is an open problem. Traditional statistical methods cannot handle whole-genome mutation data due to their ultra-high-dimensionality and extreme data sparsity -- e.g., >30 million unique variants are observed in the ~1700 whole-genome tumor dataset considered herein, of which >99% variants are encountered only once. To harness information in these rare variants we propose a multilevel meta-feature regression model to extract the critical information from the mutation contexts of rare variants in a way that permits us to also extract diagnostic information from any previously unobserved variants in the new tumor sample. Our framework further leverages topic models from the field of computational linguistics to induce an interpretable dimension reduction of the mutation contexts. The proposed model is implemented using an efficient MCMC algorithm that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of out-of-the-box high-dimensional multi-class regression methods and software. We employ our model on the Pan Cancer Analysis of Whole Genomes (PCAWG) dataset, and our results reveal interesting novel insights.

 
Sponsor(s)
Population Health: Biostatistics
Audience
VCU Faculty, VCU Staff, VCU Students , School of Medicine