Express interest in the next semester of Intro to ML Safety.
Syllabus
Legend: π₯ lecture recording, π₯οΈ slides, π notes, π written questions, β¨οΈ coding assignment.
Background
Safety Engineering
- 3
- Risk Decomposition
- π₯, π₯οΈοΈ
- risk analysis definitions, disaster risk equation, decomposition of safety areas, ability to cope and existential risk
- π₯, π₯οΈοΈ
- 4
- Accident Models
- π₯, π₯οΈ
- FMEA, Bow Tie model, Swiss Cheese model, defense in depth, preventative and protective measures, complex systems, nonlinear causality, emergence, STAMP
- π₯, π₯οΈ
- 5
- Black Swans
- π₯, π₯οΈ
- unknown unknowns, long tailed distributions, multiplicative processes, extremistan
- π₯, π₯οΈ
- βΊ
- Review questions π
Robustness
- 6
- Adversarial Robustness
- π₯, π₯οΈ, π, β¨οΈ
- optimization pressure, PGD, untargeted vs targeted attacks, adversarial evaluation, white box vs black box, transferability, unforeseen attacks, text attacks, robustness certificates
- π₯, π₯οΈ, π, β¨οΈ
- 7
- Black Swan Robustness
- π₯, π₯οΈοΈ, π
- stress tests, train-test mismatch, adversarial distribution shifts, simulated scenarios for robustness
- π₯, π₯οΈοΈ, π
- 8
- Review questions π
Monitoring
- 8
- Anomaly Detection
- π₯, π₯οΈοΈ, π, β¨οΈ
- AUROC/AUPR/FPR95, likelihoods and detection, MSP baseline, OE, ViM, anomaly datasets, one-class learning, detecting adversaries, error detection
- π₯, π₯οΈοΈ, π, β¨οΈ
- 9
- Interpretable Uncertainty
- π₯, π₯οΈ, π
- calibration vs sharpness, proper scoring rules, Brier score, RMS calibration error, reliability diagrams, confidence intervals, quantile prediction
- π₯, π₯οΈ, π
- 10
- Transparency
- π₯, π₯οΈ
- saliency maps, token heatmaps, feature visualizations, ProtoPNet
- π₯, π₯οΈ
- 11
- Trojans
- π₯, π₯οΈ, π, β¨οΈ
- hidden functionality from poisoning, treacherous turns
- π₯, π₯οΈ, π, β¨οΈ
- 12
- Detecting Emergent Behavior
- π₯, π₯οΈ, π
- emergent capabilities, instrumental convergence, Goodhartβs law, proxy gaming
- π₯, π₯οΈ, π
- 13
- Review questions π
Control
- 13
- Honest Models
- π₯, π₯οΈ
- truthful vs. honest, inverse scaling, instances of model dishonesty
- π₯, π₯οΈ
- 14
- Power Aversion
- π₯οΈ
- measuring power; the power-seeeking argument; power penalties
- π₯οΈ
- 15
- Machine Ethics
- π₯, π₯οΈ, β¨οΈ
- normative ethics background, human values, value learning with comparisons, translating moral knowledge into action, moral parliament, value clarification
- π₯, π₯οΈ, β¨οΈ
Systemic Safety
- 16
- ML for Improved Decision-Making
- π₯, π₯οΈ, π
- forecasting, brainstorming
- π₯, π₯οΈ, π
- 17
- ML for Cyberdefense
- π₯, π₯οΈ
- intrusion detection, detecting malicious programs, automated patching, fuzzing
- π₯, π₯οΈ
- 18
- Cooperative AI
- π₯, π₯οΈ, π
- nash equilibria, dominant strategies, stag hunt, Pareto improvements, cooperation mechanisms, morality as cooperation, cooperative dispositions, collusion externalities
- π₯, π₯οΈ, π
Additional Existential Risk Discussion
- 19
- X-Risk Overview
- π₯, π₯οΈ
- arguments for x-risk
- π₯, π₯οΈ
- 20
- Possible Existential Hazards
- π₯, π₯οΈ
- weaponization, proxy gaming, treacherous turn, deceptive alignment, value lock-in, persuasive AI
- π₯, π₯οΈ
- 21
- Safety-Capabilities Balance
- π₯, π₯οΈ
- theories of impact, differential technological progress, capabilities externalities
- π₯, π₯οΈ
- 22
- Natural Selection Favors AIs over Humans
- π₯, π₯οΈ
- Lewontinβs conditions, multiple AI agents, generalized Darwinism, mechanisms for cooperation
- π₯, π₯οΈ
- 23
- Review and Conclusion
- π₯, π₯οΈ, π
- pillars of ML safety research, task-train-deploy pipeline
- π₯, π₯οΈ, π