Xiang Zhan
Peking University
Xiang Zhan is an Associate Professor at the Department of Biostatistics and Beijing International Center for Mathematical Research of Peking University. He obtained his BS degree from Peking University in 2010 and PhD degree from Penn State in 2015. Before joining Peking University, Xiang had been working at Penn State as an Assistant Professor of Biostatistics. His research interest includes biostatistics, high dimensional statistics, compositional data analysis, kernel methods and next generation sequencing data analysis.
Abstract
It is quite common to encounter compositional data in many disciplines in modern data sciences (e.g., sequence count data in biological and biomedical research). Unfortunately, traditional statistical methods without addressing compositionality can lead to suboptimal or even misleading analysis results.
In this talk, we first discuss measurement error issues in compositional data. The presence of covariate measurement errors poses grand challenges for existing statistical error-in-variable regression analysis methods since measurement error in one component has an impact on others in the composition. To simultaneously address the compositional nature and measurement errors in the high dimensional compositional covariates, we propose a new method named ERror-In-Composition (Eric) Lasso for regression analysis of corrupted compositional predictors. Estimation error bounds of Eric Lasso and its asymptotic sign consistent selection properties are established.
The second part of this talk is about composition-on-composition regression. When both responses and predictors are compositional, the inventory of statistical analysis tools is surprisingly limited. To fill this gap, we propose a high-dimensional Composition-On-Composition (COC) regression analysis, which does not require log-ratio transformations and hence can handle excessive zeroes in sequence count data. We first introduce a penalized estimation equation approach in COC to improve its estimation accuracy in high-dimensional settings and then establish inference procedures to quantify uncertainties in COC model estimation and prediction. The proposed methods are evaluated using both numerical simulations and real data applications to demonstrate its validity and superiority.