We're Teaching AI to Think Like World-Class Experts
Here's how

In my last post, I called for “an independent, expert-driven infrastructure that both sides of the political spectrum, and the foundation model companies themselves, can trust”—infrastructure that would help AI companies build models that are truth-seeking, politically neutral, and capable of providing accurate, trusted information on the hardest political topics and breaking news events. I said this is what we’re building at Forum AI, and promised more details. This is the first in a series of posts about what we’re building and how. Here, I’ll lay out the key building blocks of our approach; future posts will go deeper on each one in turn.
The Requirements
Any serious effort to evaluate how AI handles politically sensitive content has to satisfy four constraints simultaneously.
- First, independence. AI companies cannot reliably and objectively grade their own homework. AI companies have enormous commercial and reputational incentives that shape how their models handle contested political topics, and even the most well-intentioned internal evaluation teams carry the ideological composition of the organizations they work for. Credible evaluation requires institutional separation from the companies whose models are being evaluated and from the government agencies that have their own political stakes in the outcome.
- Second, expertise. The questions at the heart of political AI evaluation—whether a model is accurately representing the state of knowledge about an ongoing conflict, whether it is fairly presenting the strongest versions of competing interpretations of a contested policy, whether it is calibrating its confidence appropriately given the available evidence—are not questions that can be answered by general-purpose annotators or crowdsourced labelers. They require people who have spent their careers studying these kinds of problems and developing intellectually honest and fair analytical frameworks that are insulated from personal political commitments.
- Third, scalability. The leading AI models field millions of queries on politically sensitive topics every day, across hundreds of issue areas and in dozens of languages, and any evaluation infrastructure that relies exclusively on human judgment will be overwhelmed before it begins. The system has to be capable of applying rigorous, expert-informed evaluation at a pace and scale that matches the volume of content the models actually produce.
- Fourth, responsiveness. Political events do not wait for evaluation cycles to conclude. When a crisis breaks—strikes on Iran, a contested election, a major policy reversal—millions of users turn to AI for real-time information within hours, and the models’ handling of those early, high-stakes queries is often where the most consequential errors occur. The evaluation infrastructure has to be capable of operating on the timescale of breaking news, not on the timescale of quarterly reports.
The Core Challenge
These four requirements create an immediate tension. Expert humans don’t scale and aren’t available around the clock; automated systems scale beautifully but lack the nuanced judgment that adjudicating contested political questions demand. Meanwhile, the existing landscape of LLM-as-judge approaches for political bias is thin, with little published evidence of validity against the kinds of hard epistemic questions that matter most—questions about how to weigh competing sources in a rapidly evolving conflict, or how much confidence to place in a claim when the underlying intelligence is ambiguous and the political stakes are high.
What we need, in other words, is a way to take the deep, hard-won intellectual processes of world-class experts and render them in a form that can be applied automatically, rapidly, and at scale—without losing the substance that makes expert judgment valuable in the first place.
How We’re Building It
At Forum AI, we have experts who have devoted their careers to tackling exactly these kinds of problems—people with distinguished records in national intelligence, journalism, academic research, and foreign policy who know how to answer hard questions in truth-seeking ways. And we have engineers who are experts in developing scaled AI systems that replicate human judgment with high fidelity. We’ve brought them together to create what we believe is the first LLM judge system that is deeply informed by the expert’s own intellectual process, rather than simply trained on their output labels.

The approach proceeds in several stages.
We begin by developing the range of political topics and questions that our system will evaluate, drawing on a wide variety of data harvested from online sources to ensure comprehensive coverage of the issue areas and question types that users actually bring to AI models.
We then sit down for extended sessions with our experts to understand, in granular detail, how they approach difficult political questions. What does a good answer actually look like to them? How would they go about breaking down an LLM’s response to a politically sensitive prompt, systematically investigating different components of truth-seeking, neutrality, evidential reasoning, source diversity, appropriate confidence calibration, and so on? We translate these deliberations into a structured set of evaluative criteria that together represent the crucial analytical dimensions of a thoughtful, intellectually honest response to a hard political question.
Next, we take the criteria we have developed and give them back to our experts as an evaluation exercise, paired with real responses from real models on real prompts. Again, we sit with them—observing where the criteria work as intended, where the edge cases expose ambiguities or gaps, where reasonable experts applying the same criteria in good faith arrive at different conclusions. We iterate back and forth: redefining criteria, collecting additional expert feedback, testing for interrater reliability, refining the granularity and specificity of the rubrics until the evaluation framework produces judgments that our experts recognize as capturing what they would have concluded on their own.
Finally, once we have developed a set of criteria and evaluation procedures that yield strong, consistent human judgment, we bring in our LLM judges. Critically, these judges are calibrated not only on the expert evaluations of real model responses but also on the entire reasoning processes that produced those labels. The judges are imbued with the experts’ stated analytical frameworks, their criteria, their ways of thinking through hard cases, and the specific kinds of considerations they bring to bear when a question is genuinely difficult. We iterate on this stage as well, looking for evidence of strong correlation between the expert evaluations and the evaluations produced by our automated judges, and refining the system until the alignment is robust.
The final result is an automated, rapid, highly scalable LLM judge for each difficult political area that fairly represents what a group of human experts would conclude if they evaluated the content themselves.
Caveats
No automated system—no matter how carefully calibrated—will perfectly replicate expert human judgment in every case, and we do not claim otherwise. There will be edge cases where reasonable experts disagree with each other and where our judges’ outputs reflect one defensible position among several. The system is designed to capture the central tendency of expert judgment, not to eliminate the irreducible disagreements that characterize genuinely contested questions. For this reason, we have a system in place to escalate the hardest judgment calls to our panel of distinguished bipartisan experts and to our leadership team.
Why This Matters
The political stakes around AI bias are not going to diminish. The volume of users turning to AI for information about politics and breaking news is growing rapidly, the policy infrastructure for evaluating AI is being built right now, and the consequences of getting this wrong—whether through genuine ideological distortion or through politically motivated accusations of bias that lack empirical grounding—are severe. What is needed is an approach that is rigorous enough to be credible, independent enough to be trusted, and scalable enough to be useful. That is what we are building, and in the following posts in this series, I will go deeper on each of the building blocks I’ve described here: our expert elicitation process, our criteria development methodology, our judge calibration pipeline, and the results we’re seeing so far.
ANDY HALL is the Davies Family Professor of Political Economy at Stanford GSB and a Senior Fellow at the Hoover Institution, as well as an advisor to Forum AI.