Mixture Models for Predicting Chemical Outcomes Using Structural Features in the Context of High-throughput Toxicity Testing
Kelly Moran, Duke University
Today there are more than 80,000 chemicals registered for use, with fewer than 600 subject to long-term in vivo studies conducted by the National Toxicology Program. It's of interest to understand how the slurry of chemicals human beings are exposed to every day affects health outcomes. While in vivo studies are expensive and time-consuming, high-throughput toxicity screening programs allow for the relatively cheap and fast collection of dose-response information. In cases when neither in vivo nor in vitro information is available (e.g., when a chemical is newly developed), in silico methods provide a way to estimate the the chemical's toxicity. While there is a large literature in estimating chemical toxicity using structure-activity relationship (SAR) information, the focus has primarily been on scalar responses; only two approaches have addressed the problem of estimating dose-response curves. We will focus on the Bayesian additive adaptive basis tensor product (BAATP) model (Wheeler 2018). The BAATP model is designed to be effective for high-dimensional surfaces sampled with error. One can think of dose-response curves as cross sections of a surface defined by chemical structure. The model has the advantage of speed by reducing dimensionality relative to a pure Gaussian process approach and it is highly flexible as the tensor product bases are learned from the data. However, a high number of inactive chemicals may bias the functions relating to dose to be flatter and hurt the resulting performance of the model for active chemicals (i.e., those chemicals that are likely of the most interest). The simulation study included in the original paper simulates sets of dose-response curves that are mostly active; only around 13 percent of the chemicals in each simulated data set were inactive. In most ToxCast assays, the majority of chemicals are inactive. We introduce a mixture model in which the probability of activity depends on chemical structure; inactive curves are assumed to be drawn from a white noise process and active curves are modeled using the BAATP. In this formulation, only curves deemed active in a given iteration of the sampler are included when sampling the parameters of the BAATP. We perform a study with simulated assays of "high," "moderate"and "low" activity. We also include a real data application using the U.S. EPA's ToxCast high-throughput toxicity testing platform. Finally, we suggest improvements to the structural dimension reduction technique used in the data example from the original paper to account for possible nonlinear structure in the feature space. Coverage and performance comparisons are made to the original BAATP model.
Abstract Author(s): Kelly Moran