Statistical Frontiers in Foundation Models and LLMs
STAT 992, UW - Madison, Department of Statistics, 2025
In this graduate seminar we will explore statistical frontiers in foundation models and large language models (LLMs).
Course Description
Statistical Frontiers in Foundation Models and LLMs explores the emerging intersection between modern statistical thinking and the capabilities of large-scale foundation models such as large language models (LLMs), vision-language models (VLMs), and diffusion models. While foundation models have achieved remarkable empirical success, they raise foundational questions about uncertainty, reliability, generalization, and inference—core concerns of the statistical sciences. This course examines how statistical tools and perspectives can help us rigorously evaluate, understand, and extend the capabilities of these models, and how in turn, foundation models may offer new tools for statisticians.
We begin by introducing foundation models from a statistical viewpoint, emphasizing why concepts such as calibration, entropy, and generalization remain central in the age of large-scale deep learning. Students will critically examine techniques for evaluating model reliability, including calibration accuracy, posterior consistency, and other statistical quantities. The seminar then turns to modern approaches for uncertainty quantification in generative models and LLMs, such as conformal prediction and deep ensembles, highlighting both the theoretical underpinnings and practical challenges of deploying these systems in safety-critical domains. Finally, we explore how foundation models can be used to perform statistical tasks themselves—such as assisting in Bayesian inference, simulation-based inference, and prediction-powered inference—marking a shift from models as objects of analysis to models as computational subroutines in statistical tasks.
Throughout the course, students will engage deeply with contemporary research papers at the frontier of statistics and machine learning, with an emphasis on developing the tools to evaluate and innovate in this rapidly evolving landscape.
Topics
Topic 1: What are Foundation Models and Why Should Statisticians Care?
Title: Should Statisticians Care About Foundation Models?
We set the stage for the seminar by defining what foundation models are (e.g., LLMs, VLMs, diffusion models), how they differ from traditional statistical models, and why their empirical power raises new theoretical and practical questions. We also discuss the role of statistical thinking in evaluating and repurposing these models for scientific inference, decision-making, and societal impact.
Potential Readings:
- An Overview of Large Language Models for Statisticians
- Do Large Language Models (Really) Need Statistical Foundations?
- Position: Bayesian Deep Learning is Needed in the Age of Large-Scale AI
Topic 2: Statistical Evaluation of Foundation Models
Title: Measuring What Matters: Statistical Lenses on Model Behavior
Foundation models are often evaluated in terms of benchmarks or task accuracy—but what does it mean for them to be statistically reliable or trustworthy? This module explores evaluation through the lens of calibration, entropy, consistency, and statistical diagnostics. We treat models as black boxes and use classical statistical tools to probe the quality of their predictions and generations.
We might also read papers about watermarking. Watermarking straddles the line between statistical evaluation of foundation models and intervention: Statistical signals can be embedded into model outputs—intentionally or inadvertently. Such signals can be detected, analyzed, and potentially used for provenance, accountability, and robustness.
Potential Readings:
- Calibration, Entropy Rates, and Memory in Language Models
- Large Language Models as Markov Chains
- Calibration of Pre-trained Transformers
- An Explanation of In-context Learning as Implicit Bayesian Inference
- Plex: Towards Reliability using Pretrained Large Model Extensions
- A Study on the Calibration of In-context Learning
- Confidently Wrong: Exploring the Calibration and Expression of (Un)Certainty of Large Language Models in a Multilingual Setting
- Can LLMs Express Their Uncertainty?
- Look Before You Leap: An Exploratory Study of Uncertainty Analysis for Large Language Models
- Uncertainty-Aware Evaluation for Vision-Language Models
- Quantifying Structure in CLIP Embeddings
- FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees
- Measuring and Improving Consistency in Pretrained Language Models
- Understanding prompt engineering may not require rethinking generalization
- Non-Vacuous Generalization Bounds for Large Language Models
- A Watermark for Large Language Models
- A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules
Topic 3: Statistical Methods for Uncertainty Quantification
Title: What Do You (Think You) Know? Quantifying Uncertainty in Foundation Models
When foundation models are deployed in high-stakes settings, uncertainty is as important as accuracy. This module focuses on statistical techniques for quantifying predictive uncertainty—including conformal prediction, deep ensembles, Laplace approximations, and attention-based uncertainty metrics. We discuss both theoretical guarantees and empirical behavior under distribution shift.
Potential Readings:
- Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
- LoRA Ensembles for Large Language Model Fine-Tuning
- Bayesian Low-rank Adaptation for Large Language Models
- BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
- https://arxiv.org/abs/2111.02080
- Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models
- Shedding Light on Large Generative Networks: Estimating Epistemic Uncertainty in Diffusion Models
- Generative Uncertainty in Diffusion Models
- BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference
- A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification
- Conformal Prediction with Large Language Models for Multi-Choice Question Answering
- Conformal Nucleus Sampling
- Prune ‘n Predict: Optimizing LLM Decision-making with Conformal Prediction
- Function-Space Regularization in Neural Network
- Fine-Tuning with Uncertainty-Aware Priors Makes Vision and Language Foundation Models More Reliable
- Reducing LLM Hallucinations using epistemic neural networks
Topic 4: Foundation Models for Statistical Inference
Title: From Model to Oracle: Using Foundation Models in the Service of Inference
Beyond being objects of analysis, foundation models can act as tools in statistical workflows. This module explores how LLMs and generative models can assist in Bayesian inference, simulation-based inference (SBI), and prediction-powered inference. We consider the opportunities and pitfalls of using models as simulators, priors, or query engines in structured inference problems.
Potential Readings:
- Textual Bayes: Quantifying Uncertainty in LLM-Based Systems
- Large Language Models to Enhance Bayesian Optimization
- ADO-LLM: Analog Design Bayesian Optimization with In-Context Learning of Large Language Models
- Let’s Think Var-by-Var: Large Language Models Enable Ad Hoc Probabilistic Reasoning
- BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models
- Tractable Control for Autoregressive Language Generation
- PPI++: Efficient Prediction-Powered Inference
- Prediction-Powered Inference
- Transformers Can Do Bayesian Inference
- Statistical Foundations of Prior-Data Fitted Networks
- Can Transformers Learn Full Bayesian Inference in Context?
- Uncertainty Quantification for Prior-Data Fitted Networks using Martingale Posteriors
- All-in-one simulation-based inference