Readings
Italicized resources are required, and other resources are suggested.
Table of contents
Introduction
Deep Learning Review
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Dropout: A Simple Way to Prevent Neural Networks from Overfitting
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Risk Analysis
A Brief History of Generative Models for Power Law and Lognormal Distributions
The Black Swan and Antifragile Summaries
Shortcomings of the Bow Tie and Other Safety Tools Based on Linear Causality1
Robustness
Adversarial Robustness
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples
Towards Deep Learning Models Resistant to Adversarial Attacks
Universal Adversarial Triggers for Attacking and Analyzing NLP
Adversarial Examples for Evaluating Reading Comprehension Systems
BERT-ATTACK: Adversarial Attack Against BERT Using BERT (GitHub)
Gradient-based Adversarial Attacks against Text Transformers
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks (website)
Adversarial Examples Are a Natural Consequence of Test Error in Noise
Using Pre-Training Can Improve Model Robustness and Uncertainty
Motivating the Rules of the Game for Adversarial Example Research
Long Tails and Distribution Shift
The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models
Adversarial NLI: A New Benchmark for Natural Language Understanding
Monitoring
OOD and Malicious Behavior Detection
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
VOS: Learning What You Don’t Know by Virtual Outlier Synthesis
Scaling Out-of-Distribution Detection for Real-World Settings
A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks
Interpretable Uncertainty
Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Posterior calibration and exploratory analysis for natural language processing models
Accurate Uncertainties for Deep Learning Using Calibrated Regression
Transparency
Interpretable Explanations of Black Boxes by Meaningful Perturbation
Exemplary Natural Images Explain CNN Activations Better than State-of-the-Art Feature Visualization
Network Dissection: Quantifying Interpretability of Deep Visual Representations
Convergent Learning: Do different neural networks learn the same representations?
Trojans
Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
STRIP: A Defence Against Trojan Attacks on Deep Neural Networks
Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Detecting and Forecasting Emergent Behavior
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Control
Power-Seeking
Honest AI
Machine Ethics
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
Ethics Background (Introduction through “Absolute Rights or Prima Facie Duties”)