April 2, 2026Matt Wilde

Speed-Running Content Moderation

What fifteen years of social media safety teaches about evaluating AI

Binoculars look out over the New York City skyline. — Photo by Jenna Day on Unsplash, stylized by Gemini

A friend recently asked me for a primer on AI trust and safety. My answer surprised me a little as I was giving it. Working on AI safety at Meta—on the GenAI team—had felt, I said, a bit like speed-running the last fifteen years of content moderation decisions. Every tension the social media industry had spent a decade and a half learning to navigate—the hard way—seemed to arrive all at once when it came to AI.

The more I thought about it, the more that framing seemed worth unpacking. Those fifteen years of content moderation history aren’t just a story about how hard the problem is. They’re a story about an industry that spent its formative years asking the wrong question—Should this stay up, or should it come down?—when the question it should have been asking all along was: What do we actually want this information environment to produce? Generative AI, almost by accident, is forcing that better question into the open. But to see why, you need to understand what came before.

How content moderation actually works

Content moderation on a platform like Facebook or Instagram is not a single decision. It’s a set of interventions distributed across a layered architecture: inventory (the content itself), ranking (what gets promoted), recommendations (what the platform actively surfaces to you), and advertising (what gets commercial amplification). Each layer operates under a different standard. The intuition is that you intervene more aggressively when the platform is doing the driving—when the company recommends something new, rather than a user choosing to receive more of something they love—and when the potential audience is larger.

The natural first instinct for any new safety team is what practitioners call “guarding all writes,” that is, intervening at the moment content is created or posted, before it reaches anyone. This is transparent to the poster and it’s easy to reason about. The question being asked at the write step is: Are we willing to host this content for public consumption?

The earliest version of that question was almost entirely about the content itself. Does this post contain nudity? Hate speech? Graphic violence? Teams built classifiers to look at words and images directly. But adversaries adapted fast. They misspelled slurs, embedded text in images, used coded language that shifted weekly. Content-level signals turned out to be brittle—easy to game, expensive to maintain, and perpetually behind.

So content moderation teams learned to lean on behavioral signals instead, which turned out to be far more reliable. A misinformation classifier, for example, might barely look at the content of posts at all. Instead, it could look at the speed and pattern of resharing—the behavioral footprint of coordinated amplification. Before large language models made content classification tractable, this was often how you found the bad stuff. It also made moderation choices more defensible by keeping things framed in terms of how you were saying something rather than what you said.

But notice what was still true at every stage: the fundamental operation was always binary. Content was either violating or not. It stayed up or came down. The entire apparatus—the layers, the classifiers, the behavioral signals—was in service of that single yes/no decision.

What GenAI changes

Generative AI scrambles most of this architecture. The careful layering—inventory, ranking, recommendations, ads, each with its own intervention logic and thresholds—collapses into a single unified blob. The inventory is the model. The ranking is the model. The suggestions (“I could help you with X next—would you like me to?”) are the model. It is all one mess of weights emitting one monolithic response.

And the behavioral toolkit is severely diminished. Most GenAI interactions are effectively one-to-one conversations—one user, one AI—which means the network-level signals that made social media moderation tractable are largely absent. There is no reshare velocity, no coordination pattern to detect. Companies like OpenAI and Anthropic have found ways to use behavioral signals at the account and platform level—detecting coordinated inauthentic behavior across users, incorporating feedback mechanisms—but the rich ecosystem of network-level signals that social media teams relied on has no direct analogue. You are left, far more than before, with the content itself.

The legal landscape is also genuinely new. Social media platforms spent years under a relatively settled, if contested, liability regime. The question of whether a chatbot giving medical advice creates legal liability for its developer is still an open one. The most significant case so far is Garcia v. Character Technologies, in which the mother of a fourteen-year-old who died by suicide sued Character.AI after her son developed an intense relationship with one of the platform’s chatbots. A federal judge ruled that the chatbot’s output was a product, not protected speech—declining Character.AI’s argument that the First Amendment shielded its model’s responses. The case settled in early 2026, but the underlying question—whether AI-generated output is “speech” at all, and whether companies bear product liability for what their models say—remains unresolved and will almost certainly reach higher courts. Meanwhile, Air Canada lost a civil tribunal case after arguing its chatbot was a “separate legal entity,” and legislation introduced in the New York State Senate would begin to establish statutory frameworks in exactly this area. Not only is AI safety less tractable than its social media predecessor, the legal ground is shifting in real time.

Rules were always the wrong frame

The content moderation industry spent fifteen years locked in a debate that, in hindsight, was the wrong one. The central question was always binary: Should this stay up, or should it come down? Every policy document, every escalation framework, every transparency report was organized around that axis. And because the question was binary, every answer was contested. Takedowns became culture-war flashpoints. “Censorship” and “safety” calcified into opposing camps, each convinced the other was acting in bad faith.

A binary keep-up/take-down decision is a terrible surface on which to negotiate what a society actually wants its information environment to look like. It reduces every piece of content to a single bit—permitted or not—and then forces everyone to fight over that bit as though it were the whole question.

Generative AI, almost by accident, dissolves this frame. When the model is the inventory, the ranking, and the recommendation layer all at once, there is no piece of content sitting on a server waiting to be adjudicated. There is no post to take down. Instead there is a response being constructed in real time, and the only meaningful question is: What should it say?

That turns out to be a much more productive question. “What content do we actually want to see?” is a conversation about values, context, and purpose. It invites domain expertise. It is the question content moderation was always trying to answer but could never quite reach, because the infrastructure forced everything through a binary gate first.

My colleague Robbie Goldfarb named this shift precisely in a recent piece. Most of the industry still governs AI by defining rules—model specs, policy guidelines, constitutional frameworks. These are necessary starting points, but they share the same weakness as the old takedown regime: they ask “is this permitted?” rather than “what happens if I do this, and who is affected?” His proposed alternative is what he calls a “window of reason”—caring less about whether an output is policy-compliant and more about whether it falls within the range of responses that qualified domain experts would consider reasonable for that context.

Defense in depth, and its limits

If the right question is “what do we actually want to see,” the next question is how you build a system that reliably produces it. What has emerged in practice is an approach I’d call “defense in depth,” safety layered across three stages. But the weight falls in the wrong places.

The first layer is at the input: classifying whether a given query should be answered at all. This is the most familiar layer, because it is the closest descendant of the old binary gate. And for narrow applications it makes sense—the Chipotle support bot really should not help you write a screenplay. But for general-purpose models, heavy-handed input filtering just recreates the takedown debate in a new venue. It is still a binary decision, still context-free, still contested.

The better use of the input layer is not as a gate but as a sensor. What is this person actually asking? What is the context—conversational, emotional, professional—in which they are asking it? The input layer’s job should be to enrich the model’s understanding of the situation, not to decide whether the situation is permitted.

The second layer is training itself: filtering pretraining data, encoding limits during fine-tuning, and shaping behavior through reinforcement learning. This is where the real leverage lives, and it deserves the lion’s share of investment. If the goal is a model that tends toward good responses by default—one that exercises judgment the way a thoughtful domain expert would—then that judgment has to be baked into the weights, not bolted on at the edges. You are not trying to build a reckless system and then restrain it. You are trying to build a wise one.

The third layer is at the output: checking what the model actually produced. Most teams handle this optimistically—surfacing the response while the check is still running, on the bet that most responses are fine. That bet is probably right, and it should be right, because if you have done the first two layers well, output intervention becomes what it ought to be: a last resort, a safety net for the rare cases where the model’s judgment fails despite good context and good training. The moment output filtering is doing heavy lifting, it is a signal that something upstream is broken.

This is the real reframe. The old content moderation architecture put almost all of its energy into boundary enforcement—what stays up, what comes down. Defense in depth risks inheriting that instinct, spending its budget on input gates and output filters while underinvesting in the thing that actually determines quality: teaching the model to make good choices in the first place. The input layer should provide context. The training layer should build judgment. The output layer should catch the rare failure. In that order, with that emphasis.

Where that leaves us

The work Forum AI does—evaluating LLM responses for ideological neutrality and truth-seeking at scale, using expert-defined rubrics rather than internal benchmarks—sits mostly in the middle layer. The point is to provide something the internal safety team cannot: a reading of model behavior by people with no stake in the outcome, measured against standards that domain experts, not product engineers, helped define. We’re training the model to do the right thing by default rather than focusing on the read step.

The AI industry is, right now, roughly where social media was circa 2010. The internal safety teams exist and are working hard—I know, because I was one of them. The external accountability infrastructure—independent evaluators, agreed standards, regulatory benchmarks—is still being built.

The next fifteen years will teach the industry a great deal. The question is how much of it has to be learned the hard way.

MATT WILDE is head of research at Forum AI.