PolymathicAll ideas →
Computer Science & AI

AI Alignment & Evaluation

Ensuring powerful AI systems do what principals want — and measuring whether they actually do.

The earliest serious worry about powerful AI was not that it might fail to do what we asked, but that it might do exactly what we asked, with consequences we did not anticipate. Norbert Wiener (1960), I.J. Good (1965), and later Eliezer Yudkowsky (2000s), Nick Bostrom (2014, Superintelligence), and Stuart Russell (2019, Human Compatible) gave the problem its modern statement: ensuring that an AI system's behaviour, as it becomes more capable, remains aligned with what its principals actually want. Anthropic was founded in 2021 explicitly around this concern; OpenAI and DeepMind have substantial alignment teams.

The technical state of the art has RLHF (reinforcement learning from human feedback; train a reward model on human preference rankings, optimise the LLM against it) as the workhorse, plus constitutional AI, RLAIF, and deliberative alignment. These produce models that are helpful, harmless, and honest on most distributions — but visibly fail on adversarial probes (jailbreaks, prompt injection), and may simply mask rather than remove deeper misalignment. Evaluation is the parallel hard problem: benchmarks saturate faster than they can be created; teaching to the test is hard to detect; capabilities matter that no current benchmark cleanly measures (long-horizon agency, scientific creativity, deception). Standard known failures include hallucination (confidently wrong factual outputs), reasoning brittleness (failure on adversarially-constructed simple problems), training-data dependence (weakness on novel domains), and bias inheritance (statistical regularities of the corpus, including human biases, propagated into outputs). Whether these are surface artifacts or symptoms of deeper architectural limits is a live disagreement between the scaling-pilled camp (Sutton, Sutskever, broadly: keep scaling; AGI follows) and the architectural-limits camp (LeCun, Marcus, much of academic AI: current methods will hit a ceiling).

Why it matters now

AI alignment has become a national-security concern, with US export controls on advanced chips, Chinese domestic-AI investment, the EU AI Act (2024), US executive orders, the UK AI Safety Institute, and similar bodies elsewhere — a regulatory landscape being constructed in real time. Frontier model evaluations (METR's autonomous-task evaluations, Apollo Research's deception evaluations, the AI Safety Institutes' pre-deployment audits) are an emerging institutional layer. The honest polymath position: the technology is real, the capabilities are qualitatively different from earlier AI, the trajectory is uncertain, and anyone confidently predicting either imminent AGI or imminent stagnation is over-claiming. The next five to ten years will resolve much of the disagreement.

Read it in Polymathic →Browse the catalogue
Polymathic — a curated catalogue of the ideas worth keeping across twelve disciplines. polymathic.app