Key Concepts in AI Safety: Reliable Uncertainty Quantification in Machine Learning

Key Chapter Title List

Introduction
The Challenge of Reliably Quantifying Uncertainty
Understanding Distribution Shift
Accurately Characterizing Uncertainty
Existing Methods for Uncertainty Quantification
Deterministic Methods
Model Ensembles
Conformal Prediction
Bayesian Inference
Practical Considerations for Uncertainty Quantification
Outlook

Document Introduction

The rapid development of machine learning research over the past decade has given rise to systems with astonishing capabilities but criticized for their lack of reliability. The uneven performance of such systems poses significant challenges for their deployment in real-world scenarios. Building machine learning systems that know what they don't know—systems capable of recognizing and responding to scenarios where they are prone to errors—has become an intuitive path to solving this problem. This goal is technically defined as uncertainty quantification, an open and widely researched topic in the field of machine learning.

As the fifth research report in the AI Safety series, this report systematically introduces the working principles, core difficulties, and future prospects of uncertainty quantification. The report first explains the key concept of calibration, which means that a machine learning model's predictive uncertainty should match its probability of prediction error. It illustrates three model states—underconfident, well-calibrated, and overconfident—through calibration curves, using medical image diagnosis as an example to demonstrate the practical value of a well-calibrated system.

Distribution shift is a core real-world challenge for uncertainty quantification. It refers to the difference between the data distribution encountered after model deployment and the one during the training phase. This difference is difficult to predict, detect, and precisely define, causing calibration models that perform well in the lab to potentially fail in complex real-world environments. Simultaneously, the probability outputs of traditional machine learning models have inherent flaws: they cannot guarantee a correlation with actual accuracy rates and struggle to express "none of the above" unknown scenarios, further exacerbating the difficulty of quantification.

The report details four mainstream categories of uncertainty quantification methods: deterministic methods, model ensembles, conformal prediction, and Bayesian inference. It analyzes the technical principles, advantages, and limitations of each category. Deterministic methods guide models to exhibit high uncertainty for non-training data but struggle to cover all complex real-world scenarios. Model ensembles improve accuracy and uncertainty estimation by combining predictions from multiple models but lack a universal validation mechanism. Conformal prediction offers mathematical reliability guarantees but relies on the assumption of no distribution shift. Bayesian inference provides a theoretically rigorous framework but is difficult to implement precisely in modern machine learning models.

At the practical application level, uncertainty quantification methods can serve as add-ons to standard training pipelines, adding a layer of safety to deployed systems. However, human-computer interaction design must be fully considered to ensure that human operators can effectively interpret and utilize uncertainty estimates. It is also crucial to recognize that existing methods are not universal solutions; false confidence should not arise from using uncertainty estimation. System design must fully account for unknown risks.

Although reliably quantifying uncertainty faces fundamental challenges and achieving a completely deterministic "knowing what you don't know" may never be possible, research in related fields has made significant progress in improving the reliability and robustness of machine learning systems. In the future, the focus is expected to shift from fundamental research to practical engineering challenges, playing a key role in enhancing the safety, reliability, and interpretability of AI systems such as large language models.