清华主页 EN
导航菜单

Towards modern datasets: laying mathematical foundations to streamline machine learning

来源: 12-16

时间:Tues., 16:00-17:00, Dec. 17, 2024

地点:C548, Shuangqing Complex Building A

主讲人:Chen Cheng

Statistical Seminar


Organizer:

吴宇楠


Speaker:

Chen Cheng

Statistics Department in Stanford University

Time:

Tues., 16:00-17:00, Dec. 17, 2024

Venue:

C548, Shuangqing Complex Building A

Online:

Zoom Meeting ID: 271 534 5558

Passcode: YMSC

Title:

Towards modern datasets: laying mathematical foundations to streamline machine learning

Abstract:

Datasets are central to the development of statistical learning theory, and the evolution of models. The burgeoning success of modern machine learning in sophisticated tasks crucially relies on the vast growth of massive datasets (cf. Donoho), such as ImageNet, SuperGLUE and Laion-5b. However, such evolution breaks standard statistical learning assumptions and tools.

In this talk, I will present two stories tackling challenges modern datasets present, and leverage statistical theory to shed insight into how should we streamline modern machine learning.

In the first part, we study multilabeling—a curious aspect of modern human-labeled datasets that is often missing in statistical machine learning literature. We develop a stylized theoretical model to capture uncertainties in the labeling process, allowing us to understand the contrasts, limitations and possible improvements of using aggregated or non-aggregated data in a statistical learning pipeline. In the second part, I will present novel theoretical tools that are not simply convenient from classical literature, such as random matrix theory under proportional regime. Theoretical tools for proportional regime are crucially helpful in understanding “benign-overfitting” and “memorization”. This is not always the most natural setting in statistics where columns correspond to covariates and rows to samples. With the objective to move beyond the proportional asymptotics, we revisit ridge regression (ℓ2-penalized least squares) on i.i.d. data X ∈ Rn×d, y ∈ Rn. We allow the feature vector to be infinite-dimensional (d= ∞), in which case it belongs to a separable Hilbert space.

返回顶部
相关文章
  • Bayesian machine learning

    Record: YesLevel: GraduateLanguage: EnglishPrerequisiteProbability theory, Mathematical statistics, Machine learningAbstractProbabilistic approach in machine and deep learning leads to principled solutions. It provides explainable decisions and new ways for improving of existing approaches. Bayesian machine learning consists of probabilistic approaches that rely on Bayes formula. It can help in...

  • Probabilistic machine learning

    IntroductionProbabilistic approach in machine and deep learning leads to principled solutions. It provides explainable decisions and new ways for improving of existing approaches. Bayesian machine learning consists of probabilistic approaches that rely on Bayes formula. It can help in numerous applications and has beautiful mathematical concepts behind. In this course, I will describe the found...