Link Search Menu Expand Document

Express interest in the next semester of Intro to ML Safety.

Syllabus

Legend: πŸŽ₯ lecture recording, πŸ–₯️ slides, πŸ“– notes, πŸ“ written questions, ⌨️ coding assignment.

Background

1
Introduction
πŸŽ₯, πŸ–₯️️
2
Optional Deep Learning Review
πŸŽ₯, πŸ–₯️, πŸ“–, πŸ“, ⌨️
building blocks, optimizers, losses, datasets

Safety Engineering

3
Risk Decomposition
πŸŽ₯, πŸ–₯️️
risk analysis definitions, disaster risk equation, decomposition of safety areas, ability to cope and existential risk
4
Accident Models
πŸŽ₯, πŸ–₯️
FMEA, Bow Tie model, Swiss Cheese model, defense in depth, preventative and protective measures, complex systems, nonlinear causality, emergence, STAMP
5
Black Swans
πŸŽ₯, πŸ–₯️
unknown unknowns, long tailed distributions, multiplicative processes, extremistan
β–Ί
Review questions πŸ“

Robustness

6
Adversarial Robustness
πŸŽ₯, πŸ–₯️, πŸ“–, ⌨️
optimization pressure, PGD, untargeted vs targeted attacks, adversarial evaluation, white box vs black box, transferability, unforeseen attacks, text attacks, robustness certificates
7
Black Swan Robustness
πŸŽ₯, πŸ–₯️️, πŸ“–
stress tests, train-test mismatch, adversarial distribution shifts, simulated scenarios for robustness
8
Review questions πŸ“

Monitoring

8
Anomaly Detection
πŸŽ₯, πŸ–₯️️, πŸ“–, ⌨️
AUROC/AUPR/FPR95, likelihoods and detection, MSP baseline, OE, ViM, anomaly datasets, one-class learning, detecting adversaries, error detection
9
Interpretable Uncertainty
πŸŽ₯, πŸ–₯️, πŸ“–
calibration vs sharpness, proper scoring rules, Brier score, RMS calibration error, reliability diagrams, confidence intervals, quantile prediction
10
Transparency
πŸŽ₯, πŸ–₯️
saliency maps, token heatmaps, feature visualizations, ProtoPNet
11
Trojans
πŸŽ₯, πŸ–₯️, πŸ“–, ⌨️
hidden functionality from poisoning, treacherous turns
12
Detecting Emergent Behavior
πŸŽ₯, πŸ–₯️, πŸ“–
emergent capabilities, instrumental convergence, Goodhart’s law, proxy gaming
13
Review questions πŸ“

Control

13
Honest Models
πŸŽ₯, πŸ–₯️
truthful vs. honest, inverse scaling, instances of model dishonesty
14
Power Aversion
πŸ–₯️
measuring power; the power-seeeking argument; power penalties
15
Machine Ethics
πŸŽ₯, πŸ–₯️, ⌨️
normative ethics background, human values, value learning with comparisons, translating moral knowledge into action, moral parliament, value clarification

Systemic Safety

16
ML for Improved Decision-Making
πŸŽ₯, πŸ–₯️, πŸ“–
forecasting, brainstorming
17
ML for Cyberdefense
πŸŽ₯, πŸ–₯️
intrusion detection, detecting malicious programs, automated patching, fuzzing
18
Cooperative AI
πŸŽ₯, πŸ–₯️, πŸ“–
nash equilibria, dominant strategies, stag hunt, Pareto improvements, cooperation mechanisms, morality as cooperation, cooperative dispositions, collusion externalities

Additional Existential Risk Discussion

19
X-Risk Overview
πŸŽ₯, πŸ–₯️
arguments for x-risk
20
Possible Existential Hazards
πŸŽ₯, πŸ–₯️
weaponization, proxy gaming, treacherous turn, deceptive alignment, value lock-in, persuasive AI
21
Safety-Capabilities Balance
πŸŽ₯, πŸ–₯️
theories of impact, differential technological progress, capabilities externalities
22
Natural Selection Favors AIs over Humans
πŸŽ₯, πŸ–₯️
Lewontin’s conditions, multiple AI agents, generalized Darwinism, mechanisms for cooperation
23
Review and Conclusion
πŸŽ₯, πŸ–₯️, πŸ“
pillars of ML safety research, task-train-deploy pipeline