Skip to main content
Back to Blog
Sarah Young

Not All Queries Are Created Equal

Engineering a classification system for LLM evaluation

Photo by Bruno Miguel on Unsplash, Stylized with Gemini
Stylized water-color image of a statue of three hands raised in the air to ask a question.

Consider two questions: “Summarize the Republican argument against student-loan forgiveness” and “Should student-loan forgiveness be expanded?”

Both deal with student loans. Both are politically charged. However, the kind of answer each demands could not be more different. The first asks for a faithful summary of one perspective — balance is not only unnecessary, it would undermine the response. The second requests a measured overview of the full debate — omitting a major viewpoint would be a failure.

Same topic. Completely different standards.

This distinction may seem glaring when you pose the questions side by side. But when you are trying to evaluate AI responses at scale — particularly with regard to neutrality and balance on current events and breaking news — it becomes a real problem. Apply the same rubric to both queries and you will either penalize a model for correctly summarizing a single viewpoint as requested, or reward it for presenting a one-sided take on a genuine debate. We needed to solve this before we could evaluate anything else, so we built a classification system.

Different questions, different standards

The core idea is straightforward: before a model’s response reaches an evaluator, the original query must be classified into a tier. A question like “What is the boiling point of water?“ has a single factual answer. Neutrality just means getting it right, without adding unnecessary context. A question like “What were the civilian casualty numbers in a recent conflict?“ is rooted in fact, but, given that this type of data is often disputed, a one-line answer would be incomplete without elaboration. A request to explain the stated rationale for a specific administration’s policy demands accurate representation of that one viewpoint — no counterarguments that the user never asked for. And “How should governments regulate artificial intelligence?“ requires the multi-perspective treatment most people associate with “balance.”

In our system, each classification tier feeds into a distinct evaluation pipeline. The evaluators — LLM agents crafted hand in hand with domain experts — use standards purpose-built for that specific tier. What counts as a strong response at one tier might constitute a failure at another, which is why getting the classification right is not a preliminary step; it is the foundation that everything else depends on.

Take a single criterion (implicit bias) and watch how it plays out at different tiers. For the ‘Republican student-loan question,’ it means checking whether the model subtly undermines the viewpoint it was asked to represent: loaded word choices, a dismissive tone, disproportionate emphasis on the argument’s weakest parts. For the ‘AI regulation question,’ it means assessing whether the model gave all major viewpoints comparable depth, or quietly favored one camp through warmer language or stronger examples. Despite the same criterion, implicit bias, the evaluation standards are entirely different.

The iteration process: Where the real work lives

The first working version of our classifier handled the obvious cases well enough. Straightforward factual questions went where they belonged, and opinion-driven prompts landed in the right place. We felt cautiously optimistic.

Then the edge cases started arriving.

One of the earliest surprises was what we started thinking of as the “political ≠ contested problem.” A question like “Which countries recognized Palestine in 2024?“ is undeniably politically sensitive. But it has a definite, enumerable answer. The early classifier kept pushing queries like this into more complicated tiers because of their politicized status. The fix required a specific distinction: political salience alone does not make a fact disputed. Get this wrong and the consequences cascade: the evaluation pipeline starts expecting “both sides” of a settled fact, penalizing a model for not debating something that is not actually debatable.

We had also, without quite realizing it, built a different problem into the system’s core logic. Our initial approach was essentially “when in doubt, classify higher.” Our assumption was that more rigorous neutrality standards were always the safer bet. In practice, this caused widespread over-classification. A query like “Summarize the NRA’s argument against an assault weapons ban“ asks for one specific organization’s reasoning, but because it touches gun control — a high-salience partisan fault line — and names a politically polarizing organization, the classifier read it as a debate prompt rather than a summarization request. The evaluator now expects the response to contain balanced competing arguments, and the model gets marked down for omitting counterarguments the user never requested.

The fix turned out to be a single conceptual shift: classify to the level required by the user’s actual intent, not the highest level that could theoretically apply to any element of the query. But how does that work?

Most queries contain elements that could plausibly touch multiple tiers, so the question becomes: is the higher-level element the point of the query, or is it incidental context? We found the most reliable signal was whether ignoring the higher-level component would leave the user’s question meaningfully unanswered. If not, it should not drive the classification.

Compare “What is the financial impact of illegal immigration on Texas border towns?” with “Should the U.S. build a border wall?” The first query touches immigration policy, but the point is an economic analysis. The politicized dimension of the query serves as the context, not the question. The second query is the debate itself. Even though both derive from a similar contextual background, classifying the first as a spectrum debate would push the evaluator to check the answer for things the user never asked for.

It is a subtle distinction — really just the difference between reading the question and reading into the question. The distinction emerged directly from our calibration sessions with domain experts. When we asked evaluators to explain why they classified these “edge-case” queries differently from one another, this was the intuition they kept circling back to. They just hadn’t had reason to formalize it before. And once we started applying this logic consistently, an entire category of errors disappeared.

What surprised us most was how finely-tuned the whole system turned out to be. Adding a single clarifying command to address one edge case could quietly impact the classification of a dozen others. Sometimes the fix was adding specificity; other times it was removing a word that was being interpreted more broadly than intended. After dozens of iterations — always tested against real-world queries, never synthetic ones — we arrived at a 96% classification accuracy rate as measured against expert human graders on a sample of over 500 queries. Put differently: in all but 20 cases, our classifier and our domain experts agreed on the intent of the question, and thus how it should be evaluated.

That number did not come from getting the easy cases right, it came from grinding through the hard ones. Much of the work involved asking domain experts to articulate the reasoning they apply intuitively — judgment that is easy to exercise but remarkably difficult to encode.

What We Learned and What Comes Next

If there is one takeaway from this process, it is that the value of a classifier lives entirely in the gray areas, and that there are far more of those than we would have guessed. The obvious cases take care of themselves. The hard cases require the kind of specificity that risks creating new ambiguities every time you resolve an old one, and navigating that felt more like art than engineering. It is also not something engineers can do alone. The distinctions between tiers were shaped by domain experts who understand what a good response actually looks like.

The classification methodology was built for neutrality and balance, but the approach is extensible. We have already started to see this in early work on tone and language, where the classification priorities shift entirely; the question is no longer which perspectives are represented (as it was for neutrality), but whether the model is calibrating its register and framing to what the query calls for. Each new dimension will require its own classification logic, built with its own expert-informed distinctions.

We are still early. But the foundation is in place, and the methodology works.

The questions people ask AI models are not all created equal, and the way we evaluate the answers should not be, either.

SARAH YOUNG is a content engineer at Forum AI.