AI Alignment & Evaluation

Ensuring powerful AI systems do what principals want — and measuring whether they actually do.

Suggested next → Mechanistic Interpretability · CS·AI

The brief

The earliest serious worry about powerful AI was not that it might fail to do what we asked, but that it might do exactly what we asked, with consequences we did not anticipate. The difficulty is that we can rarely write down what we actually intend; we hand the system a measurable proxy and hope it stands in for the goal. Norbert Wiener (1960), I.J. Good (1965), and later Eliezer Yudkowsky (2000s), Nick Bostrom (2014, Superintelligence), and Stuart Russell (2019, Human Compatible) gave the problem its modern statement: ensuring that an AI system's behaviour, as it becomes more capable, remains aligned with what its principals actually want — not with the letter of the objective they happened to specify. Anthropic was founded in 2021 explicitly around this concern; OpenAI and DeepMind have substantial alignment teams. The danger is not malice but competence aimed slightly wrong, growing precisely as the system grows more able to pursue whatever target it was given.

The technical state of the art has RLHF (reinforcement learning from human feedback; train a reward model on human preference rankings, optimise the LLM against it) as the workhorse, plus constitutional AI, RLAIF, and deliberative alignment. These produce models that are helpful, harmless, and honest on most distributions — but visibly fail on adversarial probes (jailbreaks, prompt injection), and may simply mask rather than remove deeper misalignment. They also inherit a structural weakness: any optimiser pushed hard against a measurable target tends toward specification gaming and reward hacking, exploiting the metric instead of the intent — a machine restatement of Goodhart's law, that a measure under pressure ceases to be a good measure. Human feedback only partly contains this, since the rater can be fooled, fatigued, or simply outmatched; supervising a system more capable than its overseer is the unsolved problem of scalable oversight. Evaluation is the parallel hard problem: benchmarks saturate faster than they can be created; teaching to the test is hard to detect; capabilities matter that no current benchmark cleanly measures (long-horizon agency, scientific creativity, deception). Empirically the field leans on red-teaming — adversarial stress-testing to surface failures before deployment — and increasingly on interpretability, the attempt to read a model's internals rather than only its outputs. Standard known failures include hallucination (confidently wrong factual outputs), reasoning brittleness (failure on adversarially-constructed simple problems), training-data dependence (weakness on novel domains), and bias inheritance (statistical regularities of the corpus, including human biases, propagated into outputs). Whether these are surface artifacts or symptoms of deeper architectural limits is a live disagreement between a scaling-first camp (keep scaling, and broadly expect general intelligence to follow) and an architectural-limits camp (current methods will hit a ceiling, and new ideas are needed) — a split that runs through much of the field. Feedback from other AI systems can extend the rater's reach but does not escape the problem, since the supervising model shares the same blind spots.

Why nowConcern has widened from near-term harms — bias, misinformation, deliberate misuse, the failures already visible in deployed systems — to long-term, high-stakes scenarios in which highly capable systems act in ways no one intended and no one can easily correct. Alignment has become a matter of national security and law: export controls on the most advanced chips, the first AI-specific statutes, and government safety bodies have appeared in quick succession, a regulatory landscape still being built. Alongside them, an ecosystem of independent evaluation and red-teaming organizations now audits frontier systems before release. The honest position: the technology is real, its capabilities are qualitatively different from earlier AI, the trajectory is uncertain, and anyone confidently predicting either imminent general intelligence or imminent stagnation is over-claiming. The coming years will resolve much of the disagreement.