Link Search Menu Expand Document

Schedule

Legend: πŸŽ₯ lecture recording, πŸ› slides, πŸ“– notes, πŸ“ written questions, ⌨️ coding assignment.

Background

1
Introduction
πŸŽ₯, πŸ›οΈ
2
Optional Deep Learning Review
πŸŽ₯, πŸ›, πŸ“–, πŸ“, ⌨️
building blocks, optimizers, losses, datasets

Hazard Analysis

3
Risk Decomposition
πŸŽ₯, πŸ›οΈ, πŸ“–
risk analysis definitions, disaster risk equation, decomposition of safety areas, ability to cope and existential risk
4
Accident Models
πŸŽ₯, πŸ›, πŸ“–
FMEA, Bow Tie model, Swiss Cheese model, defense in depth, preventative and protective measures, complex systems, nonlinear causality, emergence, STAMP
5
Black Swans
πŸŽ₯, πŸ›, πŸ“–
unknown unknowns, long tailed distributions, multiplicative processes, extremistan
β–Ί
Review questions πŸ“

Robustness

6
Adversarial Robustness
πŸŽ₯, πŸ›, πŸ“–, ⌨️
optimization pressure, PGD, untargeted vs targeted attacks, adversarial evaluation, white box vs black box, transferability, unforeseen attacks, text attacks, robustness certificates
7
Black Swan Robustness
πŸŽ₯, πŸ›οΈ, πŸ“–
stress tests, train-test mismatch, adversarial distribution shifts, simulated scenarios for robustness
8
Review questions πŸ“

Monitoring

8
Anomaly Detection
πŸŽ₯, πŸ›οΈ, πŸ“–, ⌨️
AUROC/AUPR/FPR95, likelihoods and detection, MSP baseline, OE, ViM, anomaly datasets, one-class learning, detecting adversaries, error detection
9
Interpretable Uncertainty
πŸŽ₯, πŸ›, πŸ“–
calibration vs sharpness, proper scoring rules, Brier score, RMS calibration error, reliability diagrams, confidence intervals, quantile prediction
10
Transparency
πŸŽ₯, πŸ›, πŸ“–
saliency maps, token heatmaps, feature visualizations, ProtoPNet
11
Trojans
πŸŽ₯, πŸ›, πŸ“–, ⌨️
hidden functionality from poisoning, treacherous turns
12
Detecting Emergent Behavior
πŸŽ₯, πŸ›, πŸ“–
emergent capabilities, instrumental convergence, Goodhart’s law, proxy gaming
13
Review questions πŸ“

Alignment

13
Honest Models
πŸŽ₯, πŸ›, πŸ“–
truthful vs. honest, inverse scaling, instances of model dishonesty
14
Power Aversion
πŸ›
TBC fall 2022
15
Machine Ethics
πŸŽ₯, πŸ›, πŸ“–, ⌨️
normative ethics background, human values, value learning with comparisons, translating moral knowledge into action, moral parliament, value clarification

Systemic Safety

16
ML for Improved Decision-Making
πŸŽ₯, πŸ›, πŸ“–
forecasting, brainstorming
17
ML for Cyberdefense
πŸŽ₯, πŸ›, πŸ“–
intrusion detection, detecting malicious programs, automated patching, fuzzing
18
Cooperative AI
πŸŽ₯, πŸ›, πŸ“–
nash equilibria, dominant strategies, stag hunt, Pareto improvements, cooperation mechanisms, morality as cooperation, cooperative dispositions, collusion externalities

Additional Existential Risk Discussion

19
X-Risk Overview
πŸŽ₯, πŸ›, πŸ“–
arguments for x-risk
20
Possible Existential Hazards
πŸŽ₯, πŸ›, πŸ“–
weaponization, proxy gaming, treacherous turn, deceptive alignment, value lock-in, persuasive AI
21
Safety-Capabilities Balance
πŸŽ₯, πŸ›, πŸ“–
theories of impact, differential technological progress, capabilities externalities
22
Natural Selection Favors AIs over Humans
πŸ›, πŸ“–
TBC Fall 2022; fitness comparisons, Lewontin’s conditions, multiple agents and variation, generalized Darwinism, artificial vs natural selection, evolutionary mechanisms for altruism, internal constraints, incentives, Leviathan
23
Review and Conclusion
πŸŽ₯, πŸ›, πŸ“
pillars of ML safety research, task-train-deploy pipeline