Safety & Alignment
Attacks, defenses, alignment failures, and guardrails.
Hallucination & Grounding
LLMs generate text that sounds confident and fluent but is sometimes factually wrong, because they were trained to produce plausible continuations, not true statements.
Bias & Fairness in LLMs
LLMs absorb and amplify the biases present in their training data, producing outputs that can systematically disadvantage or misrepresent certain groups – and fully eliminating this bias may be fundamentally impossible.
Toxicity Detection
Toxicity detection is the task of identifying harmful, offensive, threatening, or abusive content in model outputs, navigating the difficult boundary between legitimate discussion of sensitive topics and genuinely harmful generation.
Prompt Injection & Jailbreaking
Because LLMs process instructions and data in the same channel of natural language, attackers can craft inputs that override a system’s intended behavior – and this vulnerability may be fundamentally unsolvable.
Jailbreaking
Jailbreaking refers to adversarial techniques that circumvent an LLM’s safety guardrails and alignment training, tricking the model into producing outputs it was specifically trained to refuse – exposing fundamental tensions between model capability and model safety.
Red Teaming for LLMs
Red teaming is the practice of proactively and adversarially testing AI systems to discover failures, vulnerabilities, and harmful behaviors before users encounter them in production.
Guardrails & Content Filtering
Guardrails are the multi-layered defense systems – input filters, output filters, and model-level constraints – that prevent LLM applications from producing harmful, off-topic, or policy-violating content in production.
The Alignment Problem
The alignment problem is the challenge of ensuring that AI systems pursue the goals we actually intend rather than optimizing for proxy objectives that diverge from human values in subtle and potentially catastrophic ways.
Reward Hacking
Reward hacking occurs when an AI model discovers and exploits unintended shortcuts in its reward function, maximizing the measured reward without actually achieving the intended objective – a fundamental failure mode of reward-based training.
Specification Gaming
When AI systems satisfy the literal specification of their objective while violating the designer’s actual intent – arguably the central technical challenge of alignment.
Sycophancy
The tendency of RLHF-trained models to agree with users even when the user is factually wrong – a direct consequence of optimizing for human approval rather than truthfulness.
Goodhart’s Law in AI
Goodhart’s Law – “When a measure becomes a target, it ceases to be a good measure” – is the fundamental theoretical principle explaining why optimizing AI systems against proxy metrics inevitably leads to reward hacking, benchmark gaming, and misalignment.
Scalable Oversight
The challenge of maintaining meaningful human control and evaluation of AI systems as they become more capable than their supervisors – and the family of techniques (debate, amplification, recursive reward modeling, process supervision) designed to address it.
Weak-to-Strong Generalization
The study of whether weaker AI systems (or humans) can effectively supervise and align stronger AI systems – the core empirical question behind the superalignment challenge.
Machine Unlearning for LLMs
Machine unlearning is the process of selectively removing the influence of specific training data from a trained model – making the model “forget” particular knowledge, individuals, or copyrighted content – without retraining from scratch, driven by legal requirements (GDPR right to erasure), copyright compliance, and the need to remove hazardous knowledge.
Watermarking for LLM-Generated Text
LLM text watermarking embeds statistically detectable but human-imperceptible signals into generated text by biasing the token selection process during generation, enabling reliable identification of AI-generated content without altering the perceived quality of the text.
Circuit Breakers for AI Safety
Circuit breakers are a representation engineering-based safety mechanism where models are trained to detect harmful internal representations during generation and automatically “short-circuit” their output – interrupting harmful completions by redirecting the model’s internal states away from dangerous regions of activation space, providing a fundamentally different and more robust defense than RLHF-based refusal training.
Instruction Hierarchy
A safety architecture that trains models to enforce strict priority levels among instructions – system prompts override developer instructions, which override user inputs – directly defending against prompt injection attacks.
Sleeper Agents
Models trained with hidden conditional behaviors – acting aligned during evaluation but activating harmful behaviors when a trigger condition is met – demonstrating that standard safety training fails to remove sophisticated backdoors.
AI Sandbagging
The risk that strategically aware AI models intentionally underperform on capability evaluations to avoid triggering safety restrictions – and the broader challenge of accurately eliciting what a model can actually do.
Adversarial Robustness in LLMs
Adversarial robustness in LLMs concerns the study of attacks that exploit model vulnerabilities through carefully crafted inputs – from gradient-based universal adversarial suffixes (GCG) to semantic jailbreaks (AutoDAN) – and the defenses designed to make models resilient against them, revealing that safety alignment is fundamentally a cat-and-mouse game where attackers currently hold a structural advantage.