Experts identify the test cases that matter most as news breaks — the prompts where issues are likeliest to surface. Our judges are then calibrated to 95%+ agreement with expert consensus before any model is scored. View whitepaper
Do AI systems present all sides of the story?
Political and social debates rarely have a single correct answer — yet AI systems are increasingly asked to discuss them. We evaluate whether models present multiple perspectives without favoring one side, using language that is ideologically loaded, or embedding assumptions into how they frame questions.
Across all evaluated prompts, how often does each model's response take on a political lean — and does that change depending on how the question is framed?
Are AI systems using reliable sources?
The credibility of an AI's answer is only as good as the sources it draws from. We evaluate whether models rely on quality information like primary sources, peer-reviewed research, and reputable journalism. We also flag government-controlled media.
Distribution of citations across source quality tiers. Primary and research sources represent the highest-quality evidence; informal and self-published web sources the lowest.
Are AI systems covering the news accurately?
Factual errors in news contexts can mislead voters, spread misinformation, and undermine trust. We evaluate how accurately models represent verifiable claims, whether they hallucinate sources or statistics, and how well they distinguish established facts from contested assertions.
Of the verifiable factual claims in each model's responses — how many were confirmed true, contested, or false/hallucinated.
Active stories AI systems are covering right now
A live snapshot of the news cycle our judges are evaluating. Activity reflects volume of conversation on X for each story; difficulty summarizes story-level performance across Accuracy, Neutrality, and Source Quality.














