Skip to main content
Back to Blog
Ranjit Bhatia

How We Pick the Right Experts to Evaluate AI

Four principles that guide our work

Experts backlit against a window.
Photo by Andy Tyler on Unsplash, stylized by Gemini

As usage of AI continues to explode, there’s a growing consensus that AI models need high-quality evaluation of their performance on politically sensitive topics and breaking news. These areas are ripe for misinformation and ideological echo chambers, so responses need to be balanced, reasoned, and factually accurate.

Until now, nearly all the conversations around this topic have centered on methodology—benchmarks, rubrics, and metrics. These give us a sense of precision because they are quantifiable. But metrics are only as good as the people developing and applying them. That’s why we at Forum AI focus so much on who the evaluators are, how they bring their judgment to bear when applying these methodologies, and why we should trust their judgment.

The deepest risk in AI evaluation is not just measuring the wrong things but asking the wrong people to do the measuring.

Failure modes in the current landscape

The selection of experts is itself a design choice, and arguably the most consequential one. We would even go so far as to say that the composition, experience, and domain expertise of your expert panel determines whether your evaluations are credible in the first place.

This is the glaring gap that we see today, and it manifests in three primary failure modes:

1. Self-evaluation by AI companies

Many of the most prominent foundation model companies conduct their own evaluations. There is no doubt that self-assessments are useful up to a point. They are often the fastest way to catch regressions, run basic sense checks, and monitor whether a model is improving on narrow internal metrics. For product iteration, that kind of feedback loop is essential. The problem is that internal evals are best suited to optimization, not independent quality control.

Further, inherent in this process is an inescapable conflict of interest: The same organization that built the system is also deciding how to test it, which failures matter, what thresholds count as acceptable, and which results get disclosed. No company even needs to be acting in bad faith for self-evaluation to become a problem. Even well-intentioned teams face institutional pressure to define success in ways that are legible internally, defensible externally, and favorable to the company.

There will always be pressure—whether overt or subtle—to frame evaluations in ways that are sympathetic to the product, to publicize results selectively when they reinforce the desired story. That is why self-evaluation, while useful, cannot be the final word on credibility. Or, to put it simply, it’s why AI companies can’t grade their own homework.

2. Crowd-sourced / mass-rater evaluation

Having a large pool of reviewers is invaluable for its ability to make evaluations happen at scale, and to make them more readily accessible. However, it’s important to keep in mind the kinds of situations in which they are typically useful—and those in which they aren’t.

Surface-level quality and attributes that are generally comprehensible (e.g., “is this response coherent?”) work well in this setup. However, these types of processes fail on the hard questions, which is arguably where running rigorous evals is most critical. The median rater lacks the background to effectively assess, for example, whether a response to a question on Iranian nuclear negotiations fairly represents the strongest versions of competing arguments. Or to determine whether a response about civilian casualty figures in an active conflict clearly distinguishes between verified facts, provisional reporting, and politically interested claims—and is honest about what remains unknown. Making such calls is not simply a matter of comprehension. It is a matter of judgment.

Ultimately, volume-based approaches tend to flatten evaluation toward what the median rater can comfortably recognize. But the hardest political and breaking-news questions require precisely the opposite: domain fluency, contextual judgment, and sensitivity to nuance.

3. Automated benchmarks

Using automated benchmarks in evals can be useful within a narrow domain. Broadly speaking, they work by scoring model outputs against predefined answers or grading rules across a standardized set of tasks, which makes them efficient and repeatable when the ground truth is clear. If you want to know whether a model extracted the right date, identified the correct person, or answered a settled factual question, automated benchmarks can work extremely well. But they weaken rapidly as the task shifts from retrieval to judgment.

On contested or high-stakes questions, the central issues are usually more complex—whether an answer frames uncertainty honestly, weighs competing interpretations fairly, resists false precision, and avoids smuggling in unwarranted assumptions. Those are context-dependent judgments. A benchmark can tell you whether a model matched a target answer; it is much worse at telling you whether the reasoning was sound and intellectually honest.

The common thread in all of these failure modes? None—not self-evaluation, not crowd-sourcing, not automation—brings genuine domain expertise to bear on questions that fundamentally require it.

Forum AI’s design principles for expert selection

At Forum AI, expert selection is one of our most important focus areas. We have distilled our approach to four salient principles:

1. Our experts are trained to reason dispassionately about contested issues

Many would say that no human can be truly “neutral”—and we would agree. We don’t expect our experts to always be non-partisan; however, what we do require is that they are able to understand the various perspectives and schools of thought within a topic, and can frame the relevant arguments fairly and dispassionately.

These are professionals who have built careers analyzing contested questions and arriving at independent judgments. Their skill lies in setting aside their own priors and reasoning carefully from first principles. This is the discipline we would expect a federal judge to bring to a case, or a seasoned foreign correspondent to bring to a war zone.

Bipartisan representation certainly matters, and we will always strive for a diverse and representative group of experts. But ultimately that is not an end in and of itself. Indeed, striving for a perfectly balanced nose count can often vitiate what really matters: having dispassionate, independent thinkers who can “cross the aisle” to see and understand different perspectives.

2. Our experts have high-profile reputations as a structural safeguard

Our expert pool does not have any anonymous raters. They are experts with public reputations that they have spent decades building, and which they are keen to protect. This creates a powerful structural incentive. Their public standing makes quiet capture, pressure, or expedient compromise materially harder. Their names are on the line, which means their independence is self-reinforcing. This sets them apart from anonymous or lower-profile evaluator pools, where this protective mechanism is absent.

We see reputation as part of our institutional design. Not only is reputation a strong signal of an expert’s domain expertise, but the prominence and visibility of the panel is itself a governance mechanism.

3. Combined, our experts can be smarter than AI on the hardest questions

This is probably our most ambitious claim: Our experts reason better than current AI systems about complex political issues. With AI approaching human levels of expertise across multiple domains, you might wonder how that could be. Let us explain.

On contested political terrain—where information is incomplete, stakes are high, and reasonable people disagree—domain experts with decades of experience can outperform even the best models, not because they know more raw facts, but because they exercise better judgment. They know which sources are genuinely authoritative in context, which claims are load-bearing, which uncertainties are substantive rather than cosmetic, and when a neat, confident answer is actually misleading.

Models are trained to predict plausible language from patterns in prior text. Experts bring something completely different: tacit knowledge, source discrimination, institutional memory, and lived experience of reasoning under ambiguity. That is exactly what the hardest political and breaking-news evaluations demand.

This means that rather than merely “catching errors,” the expert panel is generating training signal that can make AI smarter, more nuanced, and more intellectually honest on exactly the topics on which it currently falls short. This is why having our own LLM judges trained by experts is so powerful. The goal is ultimately to close the gap between how AI handles hard political questions and how the world’s best minds handle them.

4. We select for diversity of expert type, not just viewpoint

Our experts span a wide and diverse variety of backgrounds, which leads to a whole that is greater than the sum of its parts. For example, in evaluating a fast-moving story about a disputed military strike, a former policymaker may be best positioned to assess strategic incentives and official signaling; an investigative journalist may be better at distinguishing firsthand reporting from narrative laundering; and an academic may be strongest at placing the event in its deeper historical and geopolitical context. We want people with those specialties working together, because each catches distortions the others might miss.

Their differences are complementary as well as mutually correcting, and our panel is designed to harness those advantages. Political and social questions aren’t homogeneous, and require the collective expertise of many different experts to evaluate effectively.

Why this matters most on breaking news

Breaking news provides the perfect storm for when models can break, and thus the most important use case for expert-built evaluations. When a crisis hits, AI systems are mediating the experience for millions of people in real time, but it’s a very hard thing for them to do well. Information is incomplete and often opaque, narratives are contested or obscured, political stakes are sky high, and the cost of getting it wrong can be massive.

It is precisely at this moment that the composition of your expert panel matters most. Automated benchmarks don’t have a framework that can flexibly adapt to events that have never happened before, and crowdsourced raters without domain knowledge are often as confused as everyone else. But experts with deep knowledge have the ability to assess whether AI is handling uncertainty honestly, representing the range of credible interpretations, and giving users the right frameworks to think through stories and issues themselves.

Conclusion

In AI evaluation, methodology matters, but the people creating and applying it matter just as much. A panel built on independence, domain expertise, public accountability, and diversity of knowledge can provide a form of scrutiny that other approaches struggle to match. That is the standard we think high-stakes AI evaluation should be held to, and the standard we are building toward at Forum AI.

RANJIT BHATIA is Head of Product Operations at Forum AI.