Topology of large language models data representations-清华大学求真书院

学术报告

首页 > 书院学术 > 至美数学 > 学术报告

Topology of large language models data representations

来源： 12-03

时间：14:00 - 16:00, 2024-12-05

地点： A3-1-301

组织者：Mingming Sun, Yaqing Wang

主讲人：Serguei Barannikov

Speaker: Serguei Barannikov BIMSA, IMJ-PRG

Time: 14:00 - 16:00, 2024-12-05

Venue: A3-1-301

ZOOM: 230 432 7880

PW: BIMSA

Organizers: Mingming Sun, Yaqing Wang

Abstract

The rapid advancement of large language models (LLMs) has made distinguishing between human and AI-generated text increasingly challenging. The talk examines the topological structures within LLM data representations, focusing on their application in artificial text detection. We explore two primary methodologies: 1) Intrinsic dimensionality estimation: Human-written texts exhibit an average intrinsic dimensionality of around 9 for alphabet-based languages in RoBERTa representations. In contrast, AI-generated texts displayed values approximately 1.5 units lower. This difference has allowed the development of robust detectors capable of generalizing across various domains and generation models. 2) Topological data analysis (TDA) of attention maps: By extracting interpretable topological features from transformer model attention maps, we capture structural nuances of texts. Similarly, TDA applied to speech attention maps and embeddings from models like HuBERT enhances classification performance in several tasks.

These topological approaches provide a mathematical methodology to study the geometric and structural properties of LLM data representations and their role in detecting AI-generated texts. The talk is based on the following works, carried out in collaboration with my PhD students E.Tulchinsky and K.Kuznetsov, and other colleagues:

Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts, NeurIPS 2023;

Topological Data Analysis for Speech Processing, InterSpeech 2023;

Artificial Text Detection via Examining the Topology of Attention Maps, EMNLP 2021.

Speaker Intro

Prof. Serguei Barannikov earned his Ph.D. from UC Berkeley and has made contributions to algebraic topology, algebraic geometry, mathematical physics, and machine learning. His work, prior to his Ph.D., introduced canonical forms of filtered complexes, now known as persistence barcodes, which have become fundamental in topological data analysis. More recently, he has applied topological methods to machine learning, particularly in the study of large language models, with results published in leading ML conferences such as NeurIPS, ICML, and ICLR, effectively bridging pure mathematics and advanced AI research.

返回顶部

Safety of Large Language Models
Date2025-04-16 ~ 2025-04-25SchedulePasswordWeekdayTimeVenueOnlineIDMon,Wed,Fri09:50-12:15A3-4-312Zoom815 762 8413BIMSAIntroductionThis course introduces students to the core principles and challenges surrounding large-scale neural language models' safe and responsible development. It is designed for graduate students and technical professionals with prior experience in machine learning and natu...
View more
On Galois representations with large image
AbstractIn this presentation, we will discuss the images of Galois representations. In particular, we will address the tame version of the Fontaine-Mazur conjecture, as well as the issue of the existence of Galois extensions with open image.About Christian Maire“Mes domaines d'expertises sont la théorie des nombres, l'arithmétique, l'algèbre sur les corps de nombres, les pro-p extensions, etc....
View more

书院学术

Topology of large language models data representations

Safety of Large Language Models

On Galois representations with large image

友情链接 HYPERLINK：