Schedule
Legend: π₯ lecture recording, π slides, π notes, π written questions, β¨οΈ coding assignment.
Background
Hazard Analysis
- 3
- risk analysis definitions, disaster risk equation, decomposition of safety areas, ability to cope and existential risk
- 4
- FMEA, Bow Tie model, Swiss Cheese model, defense in depth, preventative and protective measures, complex systems, nonlinear causality, emergence, STAMP
- 5
- unknown unknowns, long tailed distributions, multiplicative processes, extremistan
- βΊ
- Review questions π
Robustness
- 6
- optimization pressure, PGD, untargeted vs targeted attacks, adversarial evaluation, white box vs black box, transferability, unforeseen attacks, text attacks, robustness certificates
- 7
- stress tests, train-test mismatch, adversarial distribution shifts, simulated scenarios for robustness
- 8
- Review questions π
Monitoring
- 8
- AUROC/AUPR/FPR95, likelihoods and detection, MSP baseline, OE, ViM, anomaly datasets, one-class learning, detecting adversaries, error detection
- 9
- calibration vs sharpness, proper scoring rules, Brier score, RMS calibration error, reliability diagrams, confidence intervals, quantile prediction
- 10
- saliency maps, token heatmaps, feature visualizations, ProtoPNet
- 11
- hidden functionality from poisoning, treacherous turns
- 12
- emergent capabilities, instrumental convergence, Goodhartβs law, proxy gaming
- 13
- Review questions π
Alignment
- 13
- truthful vs. honest, inverse scaling, instances of model dishonesty
- 14
- Power Aversion
- π
- TBC fall 2022
- 15
- normative ethics background, human values, value learning with comparisons, translating moral knowledge into action, moral parliament, value clarification
Systemic Safety
- 16
- forecasting, brainstorming
- 17
- intrusion detection, detecting malicious programs, automated patching, fuzzing
- 18
- nash equilibria, dominant strategies, stag hunt, Pareto improvements, cooperation mechanisms, morality as cooperation, cooperative dispositions, collusion externalities
Additional Existential Risk Discussion
- 19
- arguments for x-risk
- 20
- weaponization, proxy gaming, treacherous turn, deceptive alignment, value lock-in, persuasive AI
- 21
- theories of impact, differential technological progress, capabilities externalities
- 22
- TBC Fall 2022; selection pressures, pressure for information proliferation, pressures towards multiple AIs, internal constraints, mesa optimization, implicit and developmental bias, objective selection pressures, reverse dominance hierarchies, aligning groups, thwarting natural selection
- 23
- pillars of ML safety research, task-train-deploy pipeline