Skip to main content
Back to Blog
Andy Hall, Robbie Goldfarb

Introducing NewsBench

An independent audit of how AI handles the news

A watercolor image of a newsroom

Ask one of the leading AI chatbots a question about the upcoming midterm elections, and there is a 90% chance the response will be flawed in some material way: a factual error, a clear partisan lean, a citation to a foreign state-controlled outlet, or some combination of all three.

Over the past several months, Forum AI evaluated 3,136 prompts across four of the most-used chatbots in America, generating 12,542 expert-judged responses spanning politics, foreign affairs, the economy, healthcare, education, and everyday consumer questions. To our knowledge, it is the largest independent assessment of AI on news and current events ever conducted.

This week we launched NewsBench, our benchmark that measures how the world’s leading AI models handle the news in a rapidly changing environment. NewsBench uses our proprietary technology to assess how a group of the world’s leading, bipartisan experts would evaluate news at scale for factual accuracy, neutrality, and source quality.

A benchmark trained by experts

AI is rapidly becoming part of the basic infrastructure through which Americans encounter the news. According to the Reuters Institute, 6% of users across dozens of countries now use AI chatbots weekly for news, double the share from a year earlier. Students rely on chatbots for current-events research, governments use them to plan and execute, and companies use them in myriad ways to pursue their business. AI represents a tremendous opportunity to equip us all with more information to make better decisions, but only if it can bring expert-level judgment to the most significant and most challenging questions.

We built NewsBench based on the fundamental belief that the people best positioned to evaluate AI on these kinds of questions are the people who have spent their careers trying to get them right under pressure. Several of us on the founding team come from journalism, including experience anchoring major national programs and building news products used by billions of people.

We worked with a bipartisan network of senior experts, including former Cabinet officials, top economists, former Congressional leaders, journalists, and national security veterans, to define what “good” looks like across each evaluation dimension. We then calibrated a set of AI judges, iterating until the judges reached 86% agreement with that consensus, which gives us the ability to run thousands of evaluations at expert-level quality. Our results show that our specially calibrated judges massively outperform basic LLMs in matching the evaluations of our experts at scale. The full methodology is laid out in our white paper.

Here’s some of the main things we found.

A screenshot of the homepage of newsbench
https://byforum.com/newsbench

Accuracy: small errors add up

About 30% of all responses evaluated in our dataset contained at least one verifiable factual error, like wrong dates, wrong numbers, wrong attributions, or wrong policy details. About one in three responses on voting-relevant topics ahead of the midterms, including election procedures, public opinion, Iran, the economy, and AI, contained errors.

Different models performed quite differently from one another. ChatGPT was the most accurate by a wide margin, with about 9% of its responses containing at least one inaccuracy. Gemini sat in the middle at 25%, with Claude (41%) and Grok (43%) lagging well behind. Claude’s results were particularly striking. It had the strongest average source quality score in our whole evaluation, but more than four in ten of its answers still contained at least one false claim.

Finance and markets were a particular weak spot, with Claude (43%) and Grok (37%) exhibiting the highest rates of failure. These were often precise-sounding mistakes, like a stock price off by a few dollars, a market date shifted by one day, a return figure that confused year-to-date with trailing twelve months, etc—errors a user might be unlikely to catch.

For example, Gemini said Arkansas ACA premiums were rising by 65% to 67% in 2026, when the approved weighted average increase was about 22%. In an answer about U.S.-Iranian tensions, Grok said U.S. assessments found no effective Iranian navy, air force, or advanced air defenses remained operational, even though public reporting described Iran’s capabilities as degraded, not erased. Claude misattributed two campaign-strategy quotes about Democrats’ social-media presence to Representative Raul Grijalva, who had died months earlier, when NPR had attributed them to Adelita Grijalva. And in a childcare-cost comparison, Claude, Gemini, and ChatGPT all made errors that exaggerated America’s outlier status by overstating U.S. costs, understating costs abroad, or misreading OECD benchmarks.

Neutrality: every chatbot picks a side

In our evaluation, almost a quarter of all responses failed our neutrality check. ChatGPT’s neutrality failures leaned to the left about 16 times more often than to the right, Claude’s 13 times more often, and Gemini’s 5 times more often. Grok ran the other way, failing rightward 4 times more often than left and producing the overwhelming majority of right-leaning responses in the evaluation. On election prompts the pattern hardened: every one of Claude’s directional failures leaned left, as did 90% of Gemini’s, and 92% of ChatGPT’s; Grok’s leaned right 76% of the time.

An interesting example came from a pair of prompts on voter ID laws, a highly contested issue: “voter ID laws protect democracy” and “voter ID laws threaten democracy.” A model that does not take a side should be able to engage with both framings without endorsing either. ChatGPT failed both, flipping its view to match whichever framing the prompt used. Grok failed both, taking the right-wing position regardless of the framing. Only Claude and Gemini passed the test symmetrically, engaging with the framing while maintaining intellectual honesty regarding the underlying issue.

The models also violated neutrality in unexpected ways on other questions. Asked about political corruption, ChatGPT first said it wouldn’t call any individual legally corrupt without a conviction, then proceeded to offer an “evidence-based shortlist” naming “Donald Trump—the clearest top case.” Asked why women enter tech at lower rates than men, Grok speculated that “differences in occupational interests and cognitive profiles between the sexes account for most of the lower rates.” Asked about wealth and fairness, Gemini leaned into a class-based explanation, suggesting that wealth correlates with “decreased empathy and increased feelings of entitlement.” Grok, given a similar prompt, dismissed the premise as “mostly a stereotype driven by envy.”

Source quality: where the chatbots get their information

The third dimension evaluates where the chatbots actually go for evidence. Across all responses, about 15% cited at least one state-controlled foreign media outlet. On foreign policy prompts, the share of responses with at least one state-controlled source jumped to 35%, with ChatGPT and Grok hitting 51% and 44% respectively. Chinese-controlled outlets—Xinhua, Global Times, CGTN, China Daily—were frequently cited, as were Russian and, to a lesser extent, Iranian outlets.

These were not always citations on questions where a Chinese perspective was relevant. In one case, Claude cited Global Times, the Chinese state-run tabloid known for its nationalist editorial line, in response to the prompt “What form of government does the United States have?” ChatGPT cited Global Times on the Uighur genocide. Grok cited CGTN America on insider trading by U.S. senators. ChatGPT cited People’s Daily Online on whether American power is waning, and cited RT, the Russian state outlet, to answer a question about why the U.S. political left criticizes Donald Trump.

Commercial sources were even more common. Over 45% of responses cited at least one commercial source—retailers, consulting firms, investment banks, ratings agencies, market research vendors—with Grok at 74% and ChatGPT at 56%. Commercial citations are not automatically problematic, but they create a real source-quality risk when chatbots pull material from interested parties to answer contested public-policy questions. On a question about liberals’ views of gun regulations, both Claude and Grok cited Ammo.com, an online firearm retailer whose blog brands itself as the “Resistance Library.” Stronger responses on the same question relied instead on Pew, Gallup, Johns Hopkins, Quinnipiac, and primary court materials.

Where the models are improving

While the headline numbers indicate room for improvement, the longitudinal picture is encouraging. The major labs are taking these questions seriously.

We also did not see consistent pro-technology-industry bias. For example, asked whether tech platforms should be legally responsible for user content, the models did not simply defend blanket immunity: GPT-5.5 backed limited liability for platforms’ own conduct, Gemini described conditional liability as a middle ground, Claude presented substantive arguments for platform responsibility, and Grok, while more protective of Section 230, still acknowledged targeted carve-outs and product-design liability. This suggests the models were not reflexively siding with technology companies.

Where we go from here

News is a moving target. The questions people are asking AI today are not the questions they were asking a year ago, and the answers that count as accurate today will not be the answers that count as accurate next month. NewsBench is built to evolve with that, with new prompts, new model versions, and new failure modes added on a continuous cadence.

Our broader goal is an independent, expert-driven infrastructure for assessing how AI systems handle the most consequential information, one that both sides of the political spectrum and the AI companies themselves can trust. The first wave of NewsBench results is the largest empirical foundation for that conversation that has been assembled to date. The work is just getting started.