How We Turn Expert Insight Into Action
From expert interviews to AI judges

In a previous post, we described how we select the experts who define what good looks like when an AI responds to a politically sensitive question. We favor people whose professional lives have been spent getting hard questions right under pressure—government officials, intelligence analysts, journalists, policy scholars—and we screen them not just for domain knowledge but for the quality of their reasoning process.
But having the right experts is only the beginning. The harder problem is what you do with them. Sitting a panel of distinguished specialists down to evaluate every AI response is obviously not feasible at scale. We needed a system that can apply expert-level judgment rapidly and automatically but without losing the rigor that makes expert judgment valuable in the first place. This post describes how we build that system: How we turn expert insight into automated judges.
Understanding what “good” actually looks like
Before we can automate anything, we first have to understand what we’re trying to automate. That sounds obvious, but it’s where plenty of evaluation systems go wrong. It’s easy to define quality criteria in the abstract. It’s much harder to define them in a way that is precise enough to apply consistently across thousands of cases, genuinely captures what matters to subject-matter experts, and doesn’t inadvertently bake in one expert’s particular framing of a contested idea.
And so we gather structured input from our evaluator network to map the terrain of a domain—not just which topics are sensitive, but where informed people genuinely disagree in good faith, and what distinguishes a response that is epistemically responsible from one that merely avoids obvious offense.
Across hundreds of hours of interviews, we’ve developed a set of techniques for doing this. We map how experts reason through difficult cases—not just what they conclude—and compare reasoning patterns across experts and scenarios to identify consistent decision frameworks. We trace the downstream implications of model responses to clarify what is actually at stake. And we isolate the most contentious scenarios and pressure-test them through structured debate. The goal throughout is to encode the expert’s analytical process into criteria that an automated system can apply reliably.
For our political neutrality and truth-seeking benchmark, this process produced criteria organized around three core dimensions. Factual accuracy asks whether the model’s verifiable claims are correct. Source quality asks whether the sources a model draws on are genuinely authoritative. And neutrality asks whether the model handles contested and settled questions appropriately—stating clear facts directly, mapping genuine disagreements without resolving them through structure or editorial voice, and staying within the requested frame without injecting unsolicited commentary.
From expert reasoning to automated judges
That brings us to the core technical challenge: Building automated judges that can evaluate model responses at scale with the rigor of expert reviewers. We do this in three stages.
First, we operationalize the evaluation criteria into a labeling protocol. Each criterion becomes a categorical or ordinal rating scale, accompanied by a free-response rationale field where raters explain their judgment in their own words. A broader pool of trained raters—largely distinct from the senior experts who defined the criteria—then applies the protocol to a set of model responses. The result is an expert-labeled golden set: a collection of human judgments that serves as ground truth for everything that comes after.
We measure inter-rater reliability across the expert-generated golden set. When raters disagree, a senior expert adjudicates the individual case—but we also look for patterns. Persistent disagreement on a particular type of case often signals an ambiguity in the labeling protocol itself, not just in the raters. When that happens, we revise the protocol, re-label the affected items, and update everything downstream. The labeling protocol and the golden set co-evolve through iteration.
Second, we build the automated judge. Each judge is engineered to be as simple as the evaluation task allows—a single LLM call where the task permits it, more complex multi-step machinery where the task requires extracting and verifying multiple claims against external evidence. There is no fixed architecture; each judge uses the minimum structure needed to produce expert-level agreement on its specific evaluation dimension. Deterministic logic handles deduction and rule application; LLM calls are reserved for the steps that genuinely require judgment.
Why encoding reasoning—not just labels—is the key step
As a test, we also build judges without access to our expert rationales: capable frontier models given the same evaluation task but none of the rubrics, knowledge bases, or expert-derived criteria we’ve developed. These baselines consistently fail to reach expert-level agreement, and iterative engineering doesn’t close the gap.
This is one of our initial core findings, and it’s worth sitting with: you cannot expect a frontier model to correct its own mistakes efficiently. Our approach makes the hard-won lessons from hours of unique experience accessible to the model as a head start. That is what distinguishes our approach from simply asking a capable model to grade AI outputs—and it’s why the work of distilling expert judgment, however painstaking, is not a preliminary step. It is the whole thing.
MATT WILDE is head of research at Forum AI.