Cristina de Middel / Magnum Photos

The Hubris of Social Scientists

Sherry Glied

March 27, 2024

Researchers need to look beyond randomized trials.

Researchers need to look beyond randomized trials.

Most randomized controlled trial (RCT) evaluations of new criminal justice interventions find that they don’t do much at all, and even interventions initially identified as promising rarely succeed when they’re tested again by different investigators or in other settings.

That’s what Megan Stevenson asserts in her essay “Cause, Effect, and the Structure of the Social World,” nicely articulating a skepticism many have long harbored about supposedly “evidence-based” public policy.

That core claim is correct. Stevenson goes on to point out — also correctly — that the problem is not unique to criminal justice. RCTs in other social science and behavior fields, she explains, are similarly disappointing. She attributes these failures to “the structure of the social world,” which, in her telling, is so complex and change-resistant that most problems come rushing back no matter what supposed “fixes” we have applied to them.

There, I think, she is wrong. The problem is not the social world. It is the hubris of social scientists.

Most new ideas fail. When tested, they show null results, and when replicated, apparent findings disappear. This is a truth that is in no way limited to social policy. Social science RCTs are modeled on medical research — but fewer than 2% of all drugs that are investigated by academics in preclinical trials are ultimately approved for sale. A recent study found that just 1 in 5 drugs that were successful after Stage 1 trials made it through the FDA approval process.

It would be worrisome if there were big, effective criminal justice interventions out there that we had missed for centuries.

Even after drugs are approved for sale at the completion of the complex FDA process (involving multiple RCTs), new evidence often emerges casting those initial results in doubt. There’s a 1 in 3 chance that an approved drug is assigned a black-box warning or similar caution post-approval. And in most cases, the effectiveness of a drug in real-world settings, where it is prescribed by harried physicians and taken by distracted patients, is much lower than its effectiveness in trial settings, where the investigative team is singularly focused on ensuring that the trial adheres to the sponsor’s conditions — or where an academic investigator is focused on publishing a first-class paper. Most of the time, new ideas and products don’t work in the physical world either — and a darned good thing that is, or we’d be changing up everything all the time.

Drug trials are a best-case scenario for successful innovation because they generally build on brand-new science. In contrast, most social science problems are very, very old, and the ideas we have to address them generally employ technologies that have existed for a long time. Our forebears were not all fools — if these strategies were successful, they’d almost certainly have been implemented already (and many have been, so we take them for granted).

Operating near the feasible margin means recognizing that, even when they work, interventions are likely to have very modest effects.

Ancient Romans recognized that street lighting might reduce crime. Epictetus, the first-century Greek philosopher, would have found cognitive behavioral therapy very familiar. It would be worrisome if there were big, effective criminal justice interventions out there that we had missed for centuries. Perhaps we should start our analysis by recognizing that we stand on the shoulders of centuries of social reformers and are operating fairly close to the feasible margin.

Operating near the feasible margin means recognizing that, even when they work, interventions are likely to have very modest effects. Modest effects are not unimportant — the power of incrementalist policy is in the accumulation of increments. But they are a challenge to test through RCTs. It is inherently difficult and costly to build an RCT with sufficient statistical power to detect modest effects anywhere, and it’s much harder to do that for social policy trials than for drug trials. The smaller the effect one expects to see in a successful trial, the more people one must enroll to be able to distinguish that effect from zero (or, conversely, convincingly argue that there is likely no effect). But social interventions are typically intensive and costly, so sample sizes are often too small to detect meaningful effects.

Recent developments in the literature on the mortality reductions achieved through health insurance demonstrate this problem. While, as Stevenson points out, pre-Affordable Care Act studies, including RCTs, generally found no effects of health insurance on mortality, two rigorous studies conducted after the insurance expansions (using different methodologies) — one examining Medicaid expansions and the other focusing on private insurance expansions — found convincing evidence of mortality reductions. They looked at samples of over 4 million people and nearly 9 million people, respectively. The enormous samples allowed them to identify effects that were relatively small in percentage terms (although they translated into many thousands of lives saved annually at the national level) and to home in on the subpopulation over 45, where baseline mortality rates from health care-amenable interventions are higher than among younger adults. But that kind of opportunity for identifying policy effects of modest size doesn’t happen very often.

The literature identifying what policies have led crime to rise or fall, and where, is peculiarly anemic.

Identifying small effects through studies conducted before implementation of a large-scale policy is made even more difficult because of other challenges inherent to social science RCTs. For example, the effects of a specific social or behavioral intervention are likely to vary across contexts (even street lights have different effects on cold winter nights and balmy summer evenings). This contextual variability weakens the ability to build statistical power by pooling data across sites or over time, as in meta-analyses.

It also makes it hard to interpret the meaning of a replication failure — a study that doesn’t replicate might still be successful in its original context.

What’s more, the dedication, charisma and intelligence of an investigator will surely bias the results of an RCT more in the case of a behavioral intervention than a drug trial. The literature on teacher quality documents enormous differences in student success based on the individual teacher they were assigned — and there’s a long history of similar narratives in criminal justice (e.g., Spencer Tracy in “Boys Town”).

What is to be done? Criminal justice policy differs from education or economic development or health policy because there has been so much variability in crime rates across jurisdictions over decades and centuries. These changes swamp the incremental effects we might expect from the kinds of interventions tested in RCTs. But economists’ (understandable) fixation on causally identified RCTs has dampened our interest in studying these methodologically less-satisfying problems. The literature identifying what policies have led crime to rise or fall and where is peculiarly anemic. Perhaps if researchers recognized the futility of searching for no-better-than-modest effects of various public policy interventions in RCTs, we might divert a little of our research enthusiasm and energy toward understanding what really drives variation in criminal justice outcomes. A first step would be to be much more systematic in collecting and making available data across jurisdictions and over time, so that researchers can take more advantage of naturally occurring variation in policymaking and evaluate how policy ideas work as they are replicated in varied contexts. While RCTs offer “cleaner” results, the most important questions are about how actual policies work when and where they are implemented.