Abhirup Datta, Ph.D.
Assistant Professor
Department of Biostatistics
Bloomberg School of Public Health
Johns Hopkins University
Baltimore, MD

Computer-coded verbal autopsy (CCVA) algorithms predict cause of death from high-dimensional family questionnaire data (verbal autopsy) of a deceased individual, which are then aggregated to generate national and regional estimates of cause-specific mortality fractions. Such prevalence estimation tasks using predicted labels from a classifier is known as Quantification learning. Quantification methods assume that the sensitivities and specificities of the classifier are either perfect or transportable from the training to the test population, which is inappropriate under dataset shift, when the misclassification rates are different in the training and test populations. We first present a parsimonious hierarchical Bayesian model to calibrate verbal autopsy based cause-specific mortality fractions using a small labeled dataset from the test population to estimate the classifier’s misclassification rates. This model works only for single-class (categorical) predictions and assuming perfect knowledge of the true labels on a small subset of the test population. To extend to probabilistic classifiers (compositional predicted labels), and to accommodate uncertainty in true cause-of-death diagnosis (compositional true labels), we define the notion of misclassification for compositional data. Using this, we propose generalized Bayes quantification learning (GBQL) that uses the entire vector of compositional predictions from probabilistic classifiers and allows for uncertainty in true class labels. We use model-free Bayesian estimating equations for these compositional data that allow 0’s and 1’s in the compositions, and are based only on a first-moment assumption. This will be of independent importance in Bayesian compositional data analysis. Our method yields existing quantification approaches as special cases. Extension to an ensemble GBQL using predictions from multiple classifiers is discussed. We outline a fast and efficient Gibbs sampler for GBQL. We establish asymptotic and finite sample guarantees of the method. Empirical performance of GBQL is demonstrated through simulations. GBQL is used to improve estimation of cause specific mortality rates from verbal autopsies in datasets with evident dataset shift.

Medicine: Biostatistics
Abhirup Datta, Ph.D.
School of Medicine, VCU Faculty, VCU Staff, VCU Students