Readings
Italicized resources are required, and other resources are suggested.
Table of contents
Introduction
Deep Learning Review
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Risk Analysis
Robustness
Adversarial Robustness
Universal and Transferable Adversarial Attacks on Aligned Language Models
Towards Deep Learning Models Resistant to Adversarial Attacks
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks (website)
Adversarial Examples Are a Natural Consequence of Test Error in Noise
Long Tails and Distribution Shift
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
Adversarial NLI: A New Benchmark for Natural Language Understanding
Monitoring
OOD and Malicious Behavior Detection
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
VOS: Learning What You Don’t Know by Virtual Outlier Synthesis
Scaling Out-of-Distribution Detection for Real-World Settings
A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks
Interpretable Uncertainty
Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Posterior calibration and exploratory analysis for natural language processing models
Accurate Uncertainties for Deep Learning Using Calibrated Regression
Transparency
Representation Engineering: A Top-Down Approach to AI Transparency
Convergent Learning: Do different neural networks learn the same representations?
Trojans
Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
STRIP: A Defence Against Trojan Attacks on Deep Neural Networks
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Detecting and Forecasting Emergent Behavior
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Control
Power-Seeking
Honest AI
Machine Ethics
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
Ethics Background (Introduction through “Absolute Rights or Prima Facie Duties”)