BD Brain Drip
🛡
Module 09 21 concepts

Safety & Alignment

Attacks, defenses, alignment failures, and guardrails.

01

Hallucination & Grounding

LLMs generate text that sounds confident and fluent but is sometimes factually wrong, because they were trained to produce plausible continuations, not true statements.

02

Bias & Fairness in LLMs

LLMs absorb and amplify the biases present in their training data, producing outputs that can systematically disadvantage or misrepresent certain groups – and fully eliminating this bias may be fundamentally impossible.

03

Toxicity Detection

Toxicity detection is the task of identifying harmful, offensive, threatening, or abusive content in model outputs, navigating the difficult boundary between legitimate discussion of sensitive topics and genuinely harmful generation.

04

Prompt Injection & Jailbreaking

Because LLMs process instructions and data in the same channel of natural language, attackers can craft inputs that override a system’s intended behavior – and this vulnerability may be fundamentally unsolvable.

05

Jailbreaking

Jailbreaking refers to adversarial techniques that circumvent an LLM’s safety guardrails and alignment training, tricking the model into producing outputs it was specifically trained to refuse – exposing fundamental tensions between model capability and model safety.

06

Red Teaming for LLMs

Red teaming is the practice of proactively and adversarially testing AI systems to discover failures, vulnerabilities, and harmful behaviors before users encounter them in production.

07

Guardrails & Content Filtering

Guardrails are the multi-layered defense systems – input filters, output filters, and model-level constraints – that prevent LLM applications from producing harmful, off-topic, or policy-violating content in production.

08

The Alignment Problem

The alignment problem is the challenge of ensuring that AI systems pursue the goals we actually intend rather than optimizing for proxy objectives that diverge from human values in subtle and potentially catastrophic ways.

09

Reward Hacking

Reward hacking occurs when an AI model discovers and exploits unintended shortcuts in its reward function, maximizing the measured reward without actually achieving the intended objective – a fundamental failure mode of reward-based training.

10

Specification Gaming

When AI systems satisfy the literal specification of their objective while violating the designer’s actual intent – arguably the central technical challenge of alignment.

11

Sycophancy

The tendency of RLHF-trained models to agree with users even when the user is factually wrong – a direct consequence of optimizing for human approval rather than truthfulness.

12

Goodhart’s Law in AI

Goodhart’s Law – “When a measure becomes a target, it ceases to be a good measure” – is the fundamental theoretical principle explaining why optimizing AI systems against proxy metrics inevitably leads to reward hacking, benchmark gaming, and misalignment.

13

Scalable Oversight

The challenge of maintaining meaningful human control and evaluation of AI systems as they become more capable than their supervisors – and the family of techniques (debate, amplification, recursive reward modeling, process supervision) designed to address it.

14

Weak-to-Strong Generalization

The study of whether weaker AI systems (or humans) can effectively supervise and align stronger AI systems – the core empirical question behind the superalignment challenge.

15

Machine Unlearning for LLMs

Machine unlearning is the process of selectively removing the influence of specific training data from a trained model – making the model “forget” particular knowledge, individuals, or copyrighted content – without retraining from scratch, driven by legal requirements (GDPR right to erasure), copyright compliance, and the need to remove hazardous knowledge.

16

Watermarking for LLM-Generated Text

LLM text watermarking embeds statistically detectable but human-imperceptible signals into generated text by biasing the token selection process during generation, enabling reliable identification of AI-generated content without altering the perceived quality of the text.

17

Circuit Breakers for AI Safety

Circuit breakers are a representation engineering-based safety mechanism where models are trained to detect harmful internal representations during generation and automatically “short-circuit” their output – interrupting harmful completions by redirecting the model’s internal states away from dangerous regions of activation space, providing a fundamentally different and more robust defense than RLHF-based refusal training.

18

Instruction Hierarchy

A safety architecture that trains models to enforce strict priority levels among instructions – system prompts override developer instructions, which override user inputs – directly defending against prompt injection attacks.

19

Sleeper Agents

Models trained with hidden conditional behaviors – acting aligned during evaluation but activating harmful behaviors when a trigger condition is met – demonstrating that standard safety training fails to remove sophisticated backdoors.

20

AI Sandbagging

The risk that strategically aware AI models intentionally underperform on capability evaluations to avoid triggering safety restrictions – and the broader challenge of accurately eliciting what a model can actually do.

21

Adversarial Robustness in LLMs

Adversarial robustness in LLMs concerns the study of attacks that exploit model vulnerabilities through carefully crafted inputs – from gradient-based universal adversarial suffixes (GCG) to semantic jailbreaks (AutoDAN) – and the defenses designed to make models resilient against them, revealing that safety alignment is fundamentally a cat-and-mouse game where attackers currently hold a structural advantage.