Towards modern datasets: laying mathematical foundations to streamline machine learning

Time：Tues., 16:00-17:00, Dec. 17, 2024

Venue：C548, Shuangqing Complex Building A

Speaker：Chen Cheng

Statistical Seminar

Organizer：

吴宇楠

Speaker：

Chen Cheng

Statistics Department in Stanford University

Time：

Tues., 16:00-17:00, Dec. 17, 2024

Venue：

C548, Shuangqing Complex Building A

Online：

Zoom Meeting ID: 271 534 5558

Passcode: YMSC

Title：

Towards modern datasets: laying mathematical foundations to streamline machine learning

Abstract：

Datasets are central to the development of statistical learning theory, and the evolution of models. The burgeoning success of modern machine learning in sophisticated tasks crucially relies on the vast growth of massive datasets (cf. Donoho), such as ImageNet, SuperGLUE and Laion-5b. However, such evolution breaks standard statistical learning assumptions and tools.

In this talk, I will present two stories tackling challenges modern datasets present, and leverage statistical theory to shed insight into how should we streamline modern machine learning.

In the first part, we study multilabeling—a curious aspect of modern human-labeled datasets that is often missing in statistical machine learning literature. We develop a stylized theoretical model to capture uncertainties in the labeling process, allowing us to understand the contrasts, limitations and possible improvements of using aggregated or non-aggregated data in a statistical learning pipeline. In the second part, I will present novel theoretical tools that are not simply convenient from classical literature, such as random matrix theory under proportional regime. Theoretical tools for proportional regime are crucially helpful in understanding “benign-overfitting” and “memorization”. This is not always the most natural setting in statistics where columns correspond to covariates and rows to samples. With the objective to move beyond the proportional asymptotics, we revisit ridge regression (ℓ2-penalized least squares) on i.i.d. data X ∈ Rn×d, y ∈ Rn. We allow the feature vector to be infinite-dimensional (d= ∞), in which case it belongs to a separable Hilbert space.

DATEDecember 16, 2024

Related News

0
Manifold learning for noisy and high-dimensional datasets: challenges and some solutions
Abstract：Manifold learning theory has garnered considerable attention in the modeling of expansive biomedical datasets, showcasing its ability to capture data essence more effectively than traditional linear methodologies. Nevertheless, prevalent algorithms are primarily designed for low-dimensional and clean datasets, whereas contemporary biomedical datasets tend to be high-dimensional and no...
1
Towards Provably Efficient Quantum Algorithms for Nonlinear Dynamics and Large-scale Machine Learning Models
AbstractNonlinear dynamics play a prominent role in many domains and are notoriously difficult to solve. Whereas previous quantum algorithms for general nonlinear equations have been severely limited due to the linearity of quantum mechanics, we gave the first efficient quantum algorithm for nonlinear differential equations with sufficiently strong dissipation. This is an exponential improvemen...