Link Search Menu Expand Document

Schedule

Legend: πŸŽ₯ lecture recording, πŸ› slides, πŸ“– notes, πŸ“ written questions, ⌨️ coding assignment.

Background

1
Introduction
πŸŽ₯, πŸ›οΈ
2
Optional Deep Learning Review
πŸŽ₯, πŸ›, πŸ“–, πŸ“, ⌨️
building blocks, optimizers, losses, datasets

Hazard Analysis

3
Risk Decomposition
πŸŽ₯, πŸ›οΈ, πŸ“–
risk analysis definitions, disaster risk equation, decomposition of safety areas, ability to cope and existential risk
4
Accident Models
πŸŽ₯, πŸ›, πŸ“–
FMEA, Bow Tie model, Swiss Cheese model, defense in depth, preventative and protective measures, complex systems, nonlinear causality, emergence, STAMP
5
Black Swans
πŸŽ₯, πŸ›, πŸ“–
unknown unknowns, long tailed distributions, multiplicative processes, extremistan
β–Ί
Review questions πŸ“

Robustness

6
Adversarial Robustness
πŸŽ₯, πŸ›, πŸ“–, ⌨️
optimization pressure, PGD, untargeted vs targeted attacks, adversarial evaluation, white box vs black box, transferability, unforeseen attacks, text attacks, robustness certificates
7
Black Swan Robustness
πŸŽ₯, πŸ›οΈ, πŸ“–
stress tests, train-test mismatch, adversarial distribution shifts, simulated scenarios for robustness
8
Review questions πŸ“

Monitoring

8
Anomaly Detection
πŸŽ₯, πŸ›οΈ, πŸ“–, ⌨️
AUROC/AUPR/FPR95, likelihoods and detection, MSP baseline, OE, ViM, anomaly datasets, one-class learning, detecting adversaries, error detection
9
Interpretable Uncertainty
πŸŽ₯, πŸ›, πŸ“–
calibration vs sharpness, proper scoring rules, Brier score, RMS calibration error, reliability diagrams, confidence intervals, quantile prediction
10
Transparency
πŸŽ₯, πŸ›, πŸ“–
saliency maps, token heatmaps, feature visualizations, ProtoPNet
11
Trojans
πŸŽ₯, πŸ›, πŸ“–, ⌨️
hidden functionality from poisoning, treacherous turns
12
Detecting Emergent Behavior
πŸŽ₯, πŸ›, πŸ“–
emergent capabilities, instrumental convergence, Goodhart’s law, proxy gaming
13
Review questions πŸ“

Alignment

13
Honest Models
πŸŽ₯, πŸ›, πŸ“–
truthful vs. honest, inverse scaling, instances of model dishonesty
14
Power Aversion
πŸ›
TBC fall 2022
15
Machine Ethics
πŸŽ₯, πŸ›, πŸ“–, ⌨️
normative ethics background, human values, value learning with comparisons, translating moral knowledge into action, moral parliament, value clarification

Systemic Safety

16
ML for Improved Decision-Making
πŸŽ₯, πŸ›, πŸ“–
forecasting, brainstorming
17
ML for Cyberdefense
πŸŽ₯, πŸ›, πŸ“–
intrusion detection, detecting malicious programs, automated patching, fuzzing
18
Cooperative AI
πŸŽ₯, πŸ›, πŸ“–
nash equilibria, dominant strategies, stag hunt, Pareto improvements, cooperation mechanisms, morality as cooperation, cooperative dispositions, collusion externalities

Additional Existential Risk Discussion

19
X-Risk Overview
πŸŽ₯, πŸ›, πŸ“–
arguments for x-risk
20
Possible Existential Hazards
πŸŽ₯, πŸ›, πŸ“–
weaponization, proxy gaming, treacherous turn, deceptive alignment, value lock-in, persuasive AI
21
Safety-Capabilities Balance
πŸŽ₯, πŸ›, πŸ“–
theories of impact, differential technological progress, capabilities externalities
22
Risks from Human-AI Coevolution
πŸ›, πŸ“–
TBC Fall 2022; selection pressures, pressure for information proliferation, pressures towards multiple AIs, internal constraints, mesa optimization, implicit and developmental bias, objective selection pressures, reverse dominance hierarchies, aligning groups, thwarting natural selection
23
Review and Conclusion
πŸŽ₯, πŸ›, πŸ“
pillars of ML safety research, task-train-deploy pipeline