ABSTRACT: With the influx of high-dimensional data there is an immediate need for statistical methods that are able to handle situations when the number of predictors greatly exceeds the number of samples. One area of growth is in examining how environmental exposures to toxins impact the body long term. The cytokinesis-block micronucleus assay can measure the genotoxic effect of exposure as a count outcome. To investigate potential biomarkers, high-throughput assays that assess gene expression and methylation have been developed. It is of interest to identify biomarkers or molecular features that are associated with elevated micronuclei (MN) or nuclear bud (Nbud) frequency, measures of exposure to environmental toxins.

Given our desire to model a count outcome (MN and Nbud frequency) using high-throughput genomic features as predictors, novel methods that can handle over-parameterized models need development. Overdispersion, when the variance of the count outcome is larger than the mean of the count outcome, is frequently observed with count response data. For situations where overdispersion is present, the negative binomial distribution is most appropriate. Furthermore, we expand the method to the longitudinal Poisson and longitudinal negative binomial settings for modeling a longitudinal or clustered outcome both when there are equidispersion and overdispersion. The method we have chosen to expand is the Generalized Monotone Incremental Forward Stagewise (GMIFS) method. We extend the GMIFS to the negative binomial distribution so it may be used to analyze a count outcome when both a high-dimensional predictor space and overdispersion are present. Our methods were compared to glmpath. We also extend the GMIFS to the longitudinal Poisson and longitudinal negative binomial distribution for analyzing a longitudinal outcome. Our methods were compared to glmmLasso and GLMMLasso. The developed methods were used to analyze two datasets, one from the Norwegian Mother and Child Cohort study and one from the breast cancer epigenomic study conducted by researchers at Virginia Commonwealth University. In both studies, a count outcome measured exposure to potential genotoxins and either gene expression or high-throughput methylation data formed a high-dimensional predictor space. Further, the breast cancer study was longitudinal such that outcomes and high-dimensional genomic features were collected at multiple times points during the study for each patient. Our goal is to identify biomarkers that are associated with elevated MN or NBud frequency. From the development of these methods, we hope to make available more comprehensive statistical models for analyzing count outcomes with high-dimensional predictor spaces and either cross-sectional or longitudinal study designs.

Rebecca R. Lehman, Ph.D. Candidate, Public Portion of the Final Defense, VCU
All ( Open to the public )