Speaker:
Kathryn Roeder is the UPMC Professor of Statistics and Life Sciences inthe Departments of Statistics & Data Science and Computational BiologyIn 1997 she received the COPSS Presidents' Award for the outstandingstatistician under age 40.In 2020 she was awarded the COpsS Distin-guished Achievement Award and Lectureship. In 2019 she was inductedinto the National Academy of Sciences. Her research group develops sta-tistical tools applied to genetic and genomic data to understand theworkings of the human brain, and the interplay with genetic variation.
Lecture 1:
How data science and machine learning interpret genomic dataand contribute to personalized medicine
Abstract: High-throughput genomics yields vast amounts of data for personalized medicine and otherhealth-related discoveries. For instance, genome-wide association studies (GWAS), which involves tensof thousands to millions of subjects, have linked thousands of genetic changes or variants with humandiseases. Accumulating these variants across a subjects entire genome can help predict their risk forvarious diseases and these findings have already contributed in some instances to improved clinicaltreatment. However, even with the vast amount of information available, predictive power is typicallyweak using standard analytical techniques. Breakthroughs in the near future are anticipated using ma-chine learning and Al techniques. On another front, CRlSPR, a genetic engineering marvel, promisesbreathtaking potential for treatments of cancer and other genetic defects. To realize these benefitscareful study of immense amounts of data will be reguired. Data science and machine learning mustbecome an integral part of genomics to fully realize the potential of CRISPR, GWAS and other genomicstudies in the coming decade.
Lecture 2:
Tackling genomic testing in the presence of unmeasuredconfounding and missing data
Abstract: When aiming to identify diferential genomic outcomes such as gene expression or proteirabundance, thousands of simultaneous hypothesis tests are routinely performed. These tests can bebiased by the presence of unmeasured confounders and missing data. Recent advances inscRNA-Seg and CRiSPR technologies have allowed for the study of case vs. control and the charac-terization of experimental perturbations at single-cell resolution, further exacerbating these chal-lenges. We develop a large-scale hypothesis testing solution for multivariate generalized linearmodels in the presence of confounding effects. Next. realizingthatnumber of advantages can beaccrued by taking a causal inference approach, we expand this solution by exploring doubly robustand proximal inference options as well.As genomic studies progress from studying transcriptomic to proteomic readouts, new challenges havearisen, most notably large numbers of missing values. A common strategy to address this issue is to relyon an imputed dataset, which often introduces systematic bias into downstream analyses. By contrast, wedevelop a statistical framework inspired by doubly robust estimators that offers valid and efficient infer-ence for proteomic data. Our framework relies on powerful machine learning tools, such as variational au-toencoders, to augment the imputation quality with high-dimensional peptide data.