Skip to main content
Back to Blog
Kathryn Salam

AI Companies Can’t Grade Their Own Homework

We're heading toward our own meatpacking moment

Meatpackers in the sausage department of Armour's packing house in Chicago, Illinois 1893. Library of Congress, Stylized by Gemini
Meatpackers in the sausage department of Armour's packing house in Chicago, Illinois 1893. Library of Congress, Stylized by Gemini

In the late 1800s, American meatpackers operated under a simple premise: just don’t think about it too much. Companies promised their meats were safe, their practices sanitary.

Then, in 1906, Upton Sinclair published The Jungle.

The book exposed the meatpacking industry in graphic detail—rats everywhere, workers with tuberculosis hacking over racks of meat, spoiled products relabeled and sold. The public response was immediate. President Theodore Roosevelt sent investigators to Chicago’s packinghouses.

They confirmed everything, and that same year Congress passed the Pure Food and Drug Act. The industry’s objections—that regulation would crush innovation, that companies had reputational incentives to be careful, that the market would sort out the bad actors—had proved hollow.

Tainted meat had already reached American households, even the army. Several years before, during the Spanish-American War, the U.S. side is said to have seen more deaths from food-borne illnesses than from combat. A particularly bad batch of meat (over 330 tons of refrigerated beef and almost 200,000 pounds of canned beef) became a scandal after soldiers described it as smelling like an embalmed dead body (Google “Embalmed Beef Scandal” if you have a strong stomach.) The meat caused widespread dysentery and food poisoning.

In other words, the market couldn’t sort anything out. Consumers couldn’t tell safe food from dangerous food just by looking at it, and therefore could not hold meatpackers accountable with their pocketbooks. The problem was information asymmetry. Packages might be labeled “Grade A chicken” when they actually contained pork scraps, bleach, and formaldehyde. Cans looked identical whether they held beef or sludge. By the time symptoms appeared—the fever, the vomiting, and worse—the connection to a specific product was impossible to trace.

It took an actual act of Congress to get consumers what they really needed: accurate labeling, mandated disclosure of dangerous ingredients, outside inspections, and more. No more would soldiers eat embalmed cow; no more would parents feed their families rat marked as pork.

This pattern would repeat itself across industries for the next century. Each time, the lesson was the same: industries cannot usually grade themselves when public safety is at stake. Today, AI companies find themselves in a similar position to the meatpackers (hopefully minus the rats and tuberculosis). They release models with their own internal safety evaluations. They publish their own bias assessments. They set their own standards and then grade themselves on how well they meet them. Users, meanwhile, face the same information problem consumers had with canned meat. AI systems are black boxes. Bias might be subtle, surfacing only after dozens of interactions, or only for certain types of queries, or only for users with particular viewpoints. For the average user, it’s like trying to judge meat quality through a sealed can.

From cows to planes

If the analogy to meatpacking feels too far afield, consider some of the other (hard) ways in which Americans have had to learn their lesson.

Also in the 1800s, patent medicines promised miracle cures. Instead, they often delivered addiction, poisoning, and death. Companies made whatever claims they wanted, and there was no requirement to prove a drug actually worked, or even that it wouldn’t kill you. In 1937, over 100 people (many of them children) died in the Elixir Sulfanilamide disaster. The medication contained diethylene glycol—essentially antifreeze. The company had tested it for appearance, fragrance, and flavor. Not for safety.

The 1938 Federal Food, Drug, and Cosmetic Act followed, requiring companies to prove safety before marketing drugs. The standards tightened further after thalidomide caused thousands of birth defects in Europe in the late 1950s and early 1960s. Today, pharmaceutical companies don’t approve their own drugs. The FDA does.

Building codes likewise emerged from catastrophe. The Great Chicago Fire of 1871 killed hundreds and destroyed much of that city. The 1911 Triangle Shirtwaist Factory fire in New York killed 146 workers, most of them young immigrant women, who were trapped behind locked doors in a building with inadequate fire escapes.

Before these disasters, builders largely self-certified that their structures were safe. Cutting corners on safety measures was profitable. Fires and collapses revealed the problems too late. Modern building codes changed that. Independent inspectors verify that structures are habitable. The Green Business Certification Inc offers LEED certifications to attest that buildings meet environmental standards.

Or fast forward to the 2010s. Boeing’s 737 MAX disasters showed what happens when manufacturers effectively certify their own products. The FAA had delegated much of its safety oversight to Boeing itself through a program that allowed the company’s employees to act as FAA representatives.

But after two crashes killed 346 people, investigations that Boeing had withheld from regulators (and pilots) critical information about issues with its new flight control software. The company had prioritized speed to market over thorough safety review. Boeing later paid over $2.5 billion in settlements, and the 737 MAX was grounded for 20 months. The crashes exposed a simple truth: even in industries with mature regulatory frameworks, self-certification fails when economic pressures overwhelm safety incentives.

Why self-regulation still doesn’t work

These examples share some common threads. Industries claimed self-regulation would work. It didn’t. They claimed outside oversight would stifle innovation. It didn’t. The FDA now regulates about 80% of the abundant U.S. food supply and has approved over 20,000 prescription drug products for marketing. Clear standards enabled the pharmaceutical industry to become one of America’s most valuable sectors.

Another theme is that the harms of self-regulation were often invisible to consumers until they became catastrophic. You couldn’t tell if food was contaminated by looking at it. You couldn’t tell if a building would collapse in a fire by walking through it. Travelers didn’t know from their seat in row 34 that their airplanes now had software that would automatically push the craft’s nose down. By the time the problems became obvious, people had already been hurt.

AI companies today operate in much the same environment. Speed to market creates revenue. When companies evaluate themselves, the pressure to ship must win more than it loses. And AI surely makes the information problems harder. The technology is more complex than packing meat or building buildings. The models can be difficult even for experts to understand, and the harms can be diffuse and accumulative.

Consider political bias. A model might subtly favor certain viewpoints in ways that compound over millions of interactions. Although each interaction might not be enough to poison a user, over time, toxins add up. The federal government recognizes these problems. The Office of Management and Budget’s M-26-04 memo requires agencies using AI to ensure that their models are ideologically neutral, seek the truth, and are accurate. But if agencies rely on vendors’ self-evaluations, they’re just pushing the self-certification problem down the chain.

What AI needs is what other industries developed: clear evaluation criteria, standardized testing protocols, and independent assessors with no financial stake in the outcome. For political bias, a major area of focus for Forum AI, this means testing models across thousands of queries on contested topics; grading them on clear, expert-guided rubrics; and measuring whether outputs systematically favor particular perspectives. And for government applications, it means verifying that AI systems meet accuracy and fairness thresholds before deployment, not after people are harmed.

The solution exists and has been proven over history and in plenty of other industries: independent third-party certification based on transparent standards developed by experts who know what they are talking about.

To be clear, that third party is not necessarily the government. The FDA approves drugs, but private laboratories conduct much of the actual testing. LEED certification is granted by a private organization, not a government agency. In AI—a field evolving faster than the government can typically move—private certification bodies with domain expertise are best positioned to develop and enforce rigorous standards. The key is independence from the companies being evaluated, transparency in methodology, and accountability for the certifications granted.

And independent certification doesn’t have to be just about preventing harms. It can create positive incentives, too. For responsible AI companies, certification offers competitive differentiation (like LEEDS certification). It provides a credible signal to customers and partners. It reduces legal liability by demonstrating due diligence. It gives internal teams clear standards to build toward.

Some steps in this direction already exist. Open benchmarks and leaderboards seek to apply pressure by publicly comparing model performance. Academic researchers are probing systems for flaws. These efforts matter, but they are fragmented. There are dozens of different benchmarks with varying degrees of rigor and no consensus on which to trust. Because use of AI is already so broadly and deeply used—and will only become more so over time—the evaluation industry needs a reputable certifier that brings domain-specific expertise to bear.

The alternative, of course, is waiting for disaster. Some AI system will eventually cause significant harm. When that happens, the regulatory response will be swift, broad, and possibly clumsy. That’s why it makes sense to establish thoughtful frameworks now rather than clean up a catastrophe later.

History isn’t subtle about the direction it favors.

KATHRYN SALAM is head of content at Forum AI.