In the last decade, behavioral scientists concluded that their field had taken a wrong turn. Efforts to root out false findings and bad practices spurred a crisis now poised to transform the landscape of psychology. Meet four scientists who are leading the charge.
At the turn of the millennium, behavioral scientists sketched a picture of the mind as capricious—and often quite malleable. Exposure to words associated with the elderly could make people walk more slowly, NYU researchers reported in the 1990s. More recently, a Harvard psychologist rose to prominence by arguing that so-called power poses could elevate a person’s propensity to take risks. Washing one’s hands, a U.K. lab posited, could affect moral judgment by reducing feelings of disgust. These studies all met the standard of peer review. Such findings were reported widely. In 2011, a respected Cornell researcher even published what seemed to be evidence of precognition—the ability to perceive future events. For some behavioral scientists, that claim drove home the sense that something was deeply wrong. Indeed, each of these findings has proven difficult, if not impossible, to consistently repeat.
Today, psychological science is engulfed in an unremitting storm. Diligent scrutiny has exposed the methodological weaknesses baked into many high-profile studies. These revelations are reshaping the way researchers investigate a subject no less grand than human nature itself.
Anyone who has read a book or an article promising a simple, evidence-based life hack or a counterintuitive take on behavior has potentially absorbed an idea that is now discredited. In the early 2010s, scientists banded together with increasing urgency to corroborate published findings by repeating the studies—and often came up empty-handed. Those efforts prompted what is commonly called the “replication crisis,” the outcome of which has implications not only for the scientific field of psychology but also for real-world interventions in mental health care and education.
In 2015, a landmark paper laid bare the depth of the problem: Of 97 attempts to replicate previous research findings, fewer than 40 percent were deemed successful. Another large-scale replication project, published in November 2018, tested 28 findings dating from the 1970s through 2014. It found evidence for about half. Cancer research, economics, and other fields have struggled with replication to some extent, but psychology’s challenges are perhaps the best publicized.
Despite pushback—including the message of some influential psychologists that the crisis is overblown and the proposed fixes too draconian—a major reckoning is underway. The scientists challenging received wisdom are diverse in background and in demeanor: They can be cynical, argumentative, diplomatic, cheerful, hopeful. The crises of confidence and the resolve of the four researchers profiled here illuminate a potential paradigm shift in behavioral science.
Michael Inzlicht, social psychologist at the University of Toronto
More than a decade into his career as a psychologist, Michael Inzlicht had a dark epiphany. An eight-page paper he had read, titled “False-Positive Psychology,” made something painfully clear: Common research practices could be used to dredge up evidence for the impossible. In a section called “How Bad Can It Be? A Demonstration of Chronological Rejuvenation,” the authors used actual data to support the claim that listening to the Beatles song “When I’m Sixty-Four” could literally make participants younger. The authors knew, of course, that it was a ridiculous hypothesis. That was the point.
“Their message wasn’t unique,” Inzlicht says of the 2011 paper by Joseph Simmons, Leif Nelson, and Uri Simonsohn. Some researchers had been pointing to the misuse of statistical methods for decades. “What was brilliant about their paper was that they detailed specifically how we abuse our tools.” He calls the paper a “mind-bomb,” akin to an optical illusion: Once you see the hidden image, you can’t unsee it. Inzlicht and other psychologists apprehended that methodology they had long employed could easily produce false positives—results that seem to show real effects or meaningful correlations but, in fact, do not.
Things grew worse in the years that followed as the evidence for core concepts that Inzlicht explored was called into question. Inzlicht, who earned his Ph.D. in 2001 and is now a professor at the University of Toronto, had expended much effort investigating stereotype threat and ego depletion. The first describes a detrimental effect on the performance of people in oft-stereotyped groups, such as racial minorities or women, if they perceive a risk that they will confirm negative stereotypes. The latter is the idea that when people draw on self-control for one task, they will perform worse on a subsequent task that requires self-control. Both concepts are still being explored, but in light of more stringent tests, Inzlicht and many others no longer take their validity for granted.
Witnessing the rapid erosion of confidence in some of social psychology’s most popular findings—and in his own life’s work—Inzlicht opened up about his distress and the need for change. In 2016, he wrote an online article, “The Replication Crisis Is My Crisis,” for Undark Magazine. “Sometimes,” he begins, “I wonder if I should be fixing myself more to drink.” There is an insistence in his voice today as he sits on his couch and recalls that time. “I’m a scientist,” he says. “I care about truth. And it makes me sad to have to think that I’ve worked for 20 years trying to approach truth, and it’s for naught.”
Also troubling to Inzlicht was the refusal of others to confront the issue. Some older, widely published psychologists have not only challenged the import of replication failures, they’ve also questioned the competence and motives of those behind the attempts. “Famous people were saying, ‘Nothing to see here, folks,'” Inzlicht says. “I just got fed up. At one point I looked at myself in the mirror and said, ‘Maybe I can be a person who speaks up.'”
Among the most urgent questions Inzlicht raised: How had psychologists canonized findings that might not be real? And what could be done to stop the problem? The answers to both are complex. Scientists who believe there is a need for methodological reform contend that the risk of false positives is embedded in long-accepted practices. Like researchers in other fields, psychologists have powerful incentives to see their results published so as to advance their careers. Flexibility in the ways they are allowed to analyze and report data or define a hypothesis makes it easier to present results as though they clearly support one’s predictions, even if the truth isn’t so cut-and-dried.
Say a psychologist breaks his sample of participants into three groups and expects that each will show certain differences. But the expected outcome doesn’t materialize for one group—undercutting the hypothesis—so he leaves that group out of the paper, creating the perception that the findings are more straightforward than they really are. Or perhaps, after seeing the initial results, he decides to omit certain variables or outliers from the analysis. In psychology and other fields, a p-value is a measure of probability used to determine whether results are statistically significant. Today, it’s clear that tweaks like these can render results seriously misleading and turn out artificially “significant” p-values.
“I engaged in some of those practices,” Inzlicht admits. “I’m not proud to say it, but it’s true. I know that other people engaged in those practices. I saw with my own eyes instructors at my Ivy League university instructing us to do this. At first, I thought, Oh, you’re allowed to do that? That’s standard operating procedure?”
Another problem for psychology is that the record of published research largely shows studies that went according to plan and leaves out the ones that didn’t. “Publication bias” occurs when scientists do not report results that fall short of statistical significance—or journals decline to publish them.
To steer clear of such distortions, one of the tools Inzlicht and others have embraced is preregistration, which involves creating an archived record of what one plans to do in a study—before knowing how it will turn out. In a related format called Registered Reports, a journal agrees in advance to publish the results of a transparently mapped-out study, even if they come back negative.
When the goal is to confirm the existence of a finding, tying one’s own hands in this way can help ensure that a “significant” result is genuine, not an illusion produced by overly flexible analysis. “That’s really gratifying. It’s frustrating, too, because you realize how hard science is,” Inzlicht says. “There are a lot of null results out there. I’ve got a lot of bad ideas, it turns out. But because I want the results I publish to lean toward the truth, I’m happy with that.”
As a scientist in his mid-40s, Inzlicht is conscious that he has decades ahead of him to conduct more rigorous research. He now has more confidence in his own results. And he is more optimistic about the future, he says, than he was a few years ago. “I have little doubt that in 10 years, psychology will look different than it does today.”
Simine Vazire, personality psychologist at the University of California, Davis
It wasn’t hard for Simine Vazire to accept that findings she had read about in psychology textbooks might actually be false. “I don’t think I fell in love with psychology or social science because of the answers,” she says. “I fell in love because of the questions.”
She entered the field of personality psychology with many questions of her own: How well do people know themselves? What are the consequences of self-knowledge? Where does it come from? How can you increase it? “I started off thinking, I’m going to answer Questions A, B, C, and D,” says the soft-spoken Vazire, now a professor of psychology at the University of California, Davis. The failure of key findings to hold up under scrutiny reframed her ambitions: It exposed how much still needed to be tested and confirmed. Today, she realizes, “If I even make a little bit of progress on Question A in my lifetime, that would be amazing.” The shift in perspective, she says, was like thinking one could advance astrophysics with a backyard telescope, only to realize that the endeavor requires a phenomenally expensive laboratory.
The upgrades that many psychologists seek are no less foundational. They include increased numbers of study participants—larger sample sizes can help researchers to better discern whether a difference between two groups of participants, for example, is meaningful or just a product of chance. Vazire also supports the use of more thoroughly tested psychological measures. A valid questionnaire (she points to the most recent version of the Big Five Inventory, used to measure personality traits) will, for example, produce similar assessments when taken more than once, and its results will correlate with those of other relevant measures. One thing such changes have in common? They require additional time—and patience.
Since the replication crisis erupted, Vazire has emerged as one of the leading voices trying to convince colleagues that the added effort is worth it. In 2015, the year University of Virginia psychologist Brian Nosek and the Open Science Collaboration reported attempts to replicate a collection of psychology studies with troublingly mixed results, Vazire conferred with Nosek about launching a new organization. In 2016, the Society for the Improvement of Psychological Science (SIPS) held its inaugural conference.
SIPS has spawned or supported a variety of projects aimed at making psychological research more robust. A SIPS conference helped spark the Psychological Science Accelerator, a crowdsourced initiative, led by psychologist Chris Chartier, to bring greater statistical firepower to hypothesis testing by involving dozens of researchers across the world at the same time. SIPS has an official journal, Collabra: Psychology, that strongly encourages authors to publicly share the data that support their claims. And together with the influential Center for Open Science, headed by Nosek, it cocreated PsyArXiv, a website where psychologists can post papers that have yet to be—or may never be—printed in conventional, peer-reviewed journals, making their results accessible to other scientists.
At the same time, Vazire has served as the top editor of one of those traditional journals, Social Psychological and Personality Science. Being a gatekeeper put her scientific ideals to the test. As editor, Vazire moved for a change in policy so that attempts to replicate prior findings—which have not historically been favored by journals—would be explicitly described as an accepted type of submission. The journal also began asking submitters to confirm that they had reported important details, such as all of the measurements they had taken and whether they had excluded any data from their analyses. “Openness is important for getting to the right answer faster,” she explains, “which means being open so that if you’re wrong, everyone can help fix that.”
When she talks about her own work, Vazire is cautious. A project she started nearly a decade ago recently yielded a paper with her Ph.D. student Jessie Sun. “Do people know when they’re being jerks, basically,” is how she describes one of the research questions. Their tentative answer: Not so much. “We would need 10 or 20 more studies” with different samples and approaches, she says, to be sure.
“I think the competitor to actual progress, the decoy, is the feeling of having made a discovery that you sincerely believe, but you did it in such a way that it has a high chance of being a false positive,” she says. Such would-be discoveries may capture popular attention and the interest of colleagues. But if they turn out to be false, what is any of it besides a distraction? “If we raise the standards and those discoveries no longer turn up, then they were never there to begin with,” Vazire says. “That feels like a loss, but actually it’s a gain.”
Andrew Gelman, statistician and political scientist at Columbia University
When the line between true and false is clouded, blunt talk can be clarifying. For years, Andrew Gelman has served up withering criticism of a parade of “sloppy” and “silly” studies—assigning them scathing nicknames and deeming some so poorly designed that he pronounces them “dead on arrival.”
“Himmicanes” is the term the Columbia University professor used for a 2014 paper, published in the journal of the National Academy of Sciences, that suggested hurricanes with more masculine names caused fewer deaths because they were taken more seriously. Gelman and other critics soon undercut that conclusion by highlighting shortcomings in the analysis. Another target was “power posing”—the idea, articulated by Harvard researcher Amy Cuddy and popularized by her TED Talk, that adopting a dominant physical stance can embolden individuals on both psychological and physiological levels. Efforts to replicate the effects of power posing do not support the most surprising claims—that the poses influence hormone levels and encourage risk-taking behavior. One of the original paper’s authors has disowned it.
“If prominent work is in error,” Gelman explains, “it suggests systemic problems.” As the scale of the problems became apparent, he took to his blog—and penned articles in journals and media outlets—to hold up errors in methods and interpretation as teachable examples. These academic autopsies offer a jarring but rich education in the pitfalls of faulty research. He commonly evokes, for example, the “garden of forking paths,” a concept he developed with psychologist Eric Loken (with a nod to Borges), which illustrates how well-meaning researchers can arrive at “statistically significant” but spurious results when they have many potential ways to analyze their data.
While Gelman’s own research topics are eclectic—they have included the death penalty and cancer risk from exposure to radioactive gas—trends in voting and polling are among his main interests. So when, in 2013, he encountered the claim that ovulation could substantially influence women’s voting intentions, he hit the brakes. “Given that surveys find very few people switching their vote preferences during the campaign for any reason,” he wrote on his blog after the paper came out, “I just don’t buy it.”
In a paper on “forking paths,” Gelman and Loken explained that in the ovulation and voting research, the researchers had plenty of variables to work with. They ended up finding an association between ovulation and support for Barack Obama in a combined group of nondating and dating women. Among women in several categories of more committed relationships, ovulation was linked with support for Mitt Romney.
Importantly, however, the original team had the opportunity to inspect many other links between variables. If, for example, only the nondating participants who were ovulating happened to show more support for Obama, the researchers might have deduced a plausible explanation for that. Focusing on that analytical pathway may have seemed reasonable in hindsight. But results that meet the conventional bar for statistical significance can emerge simply by chance; given enough options for testing the data, seemingly “significant” results will likely turn up. When researchers have so many ways to “declare victory,” as Gelman puts it, results can be less meaningful than they appear.
Gelman derided the ovulation study as “headline bait” but notes that misleading conclusions don’t require willful mischief on the part of scientists.
A mix of indignation and fair-mindedness may help account for the impact of Gelman’s criticism. Though he regularly points out the ways psychologists are mistaken, he readily acknowledges that he himself has published work that he later retracted. “I don’t want to go around saying people are doing things that are immoral or whatever, because what do I know? It’s not like I have moral authority in some way,” he says during an interview in a park near Columbia, where he is coaching seventh-graders in Frisbee, intermittently calling out pointers and praise. “I don’t think it’s about the people; it’s really about the work.” Some psychologists have protested Gelman’s acerbic treatment of their studies, including a tendency to repeatedly call out errant researchers—one that he appears to have scaled back. (“It seems that using names has detracted from the message,” he says.)
Beyond breaking down avoidable statistical errors, Gelman is also pulling for changes to practices as fundamental as how psychologists measure what they observe. “‘Good measurements’ sounds like ‘apple pie’—something everyone wants,” he says. “But people often take bad measurements because they are cheaper and take less effort and because a lot of people’s statistical theory doesn’t include it.” One method he advocates is within-person comparison, which means measuring the same individuals more than once—after receiving two different types of treatment, say—rather than comparing different treatment groups.
“Andrew is at heart a mathematician—he is brilliant at seeing patterns in data,” offers Susan Gelman, his sister and a professor of psychology and linguistics at the University of Michigan. She points out that in a 2016 New York Times experiment in which various analysts used the same data to estimate which presidential candidate was leading in Florida, Gelman’s team was alone in judging that the results favored Donald Trump. “He goes where the numbers lead, regardless of what he might personally want to see,” she says.
With his distaste for false confidence, Gelman has helped to pick apart a considerable trail of misbegotten findings. Does he think psychologists more often think twice about cutting methodological corners—lest they fall into the sights of an outspoken statistician? “There are so many papers out there,” he counters. “Being afraid of me is not very rational.”
The Next Generation
Julia Rohrer, Ph.D. candidate at the International Max Planck Research School on the Life Course and the University of Leipzig
Had she entered the ranks of psychologists a decade ago, “I can imagine that it would have been a bit easier on the emotional side,” says Julia Rohrer, a German graduate student at the International Max Planck Research School on the Life Course and the University of Leipzig who plans to earn her Ph.D. in June. “You wouldn’t be constantly questioning everything you’re trying to build on.” Still, she notes, there have always been people seeking to carry out more rigorous and open science. When Rohrer was an undergraduate, one of those people, a faculty member, began to send her papers that exposed weaknesses in psychology’s research methods—alerting her to a growing tension in the field.
“I decided for myself that these things have to be tackled before we can progress,” says Rohrer, one of many young researchers approaching psychology’s past from a place of skepticism.
Her work has already helped to illuminate matters such as whether birth order—being an older or younger sibling—affects personality. Before she had earned a master’s degree, Rohrer had co-authored a published paper on the subject. “We went in thinking that people have tried to answer this question, but the research literature is a huge mess. Now we have better data, so let’s look at it again.” Reading through previous studies, she says, she noticed that “findings were all over the place, but people also used all types of different methods, and hypotheses were constantly shifting, even though they are supposed to test the same theory.” She and two other researchers analyzed data from thousands of people in three Western countries, taking steps to address the limitations of earlier studies. They confirmed that first-born children in those places tended to score slightly higher on tests of intelligence, but they found no evidence of previously suggested effects on personality traits.
Rohrer has since co-authored multiple papers exploring personality, well-being, and scientific methods. “I’m trying to do my research the best possible way I can according to my understanding,” she says. “I’ve adjusted my scientific practices to, for example, take into account that things can be calculated in different ways and to do all sorts of robustness checks to make sure I’m not fooling myself.”
She has also helped to encourage openness among scientists as a collaborator on the Loss-of-Confidence Project, which she joined in 2017. Its aim has been to encourage psychologists who no longer believe in particular studies they themselves conducted to come out and say so in writing. In a recent paper, “Putting the Self in Self-Correction,” the authors of six papers that had been published between 2005 and 2015 provided statements explaining why they had lost confidence in their findings.
“People are willing to collectively say, ‘A lot of sh*t happened—what do we do now?’ But people are still very reluctant to admit that some things might have gone wrong in their own research,” Rohrer says. “Psychologists now talk a lot about incentives and how we have to give people badges to acknowledge that they are doing things right and so on.” She’s alluding to the Open Science Badges, created by the Center for Open Science and now used by more than 50 journals to certify that the authors of a paper have, for example, preregistered one of their studies or made their data accessible. At present, Rohrer explains, the prospect of losing respect likely discourages psychologists from proactively owning their mistakes; destigmatizing and demonstrating the value of such acts of self-correction could change the incentives.
At multiple points, Rohrer says, she has thought about leaving the field—even, at one point, starting an undergraduate degree in computer science. After a past president of the Association for Psychological Science, Princeton psychologist Susan Fiske, wrote an article denouncing “vigilante critiques” and “methodological terrorism,” Rohrer and other reform-minded, early-career researchers banded together on Twitter and other online platforms, seeking a way forward. “It was really just supporting one another, discussing issues, and maybe not feeling that isolated,” she says.
Her decision to remain in psychology was informed in part by a paper from Paul E. Smaldino and Richard McElreath. In “The Natural Selection of Bad Science,” they made a case that, given the demand in science for frequent publication, research labs that were highly productive but used problematic methods were bound to be more influential over time than more careful labs. “If the skeptical people leave,” Rohrer says, “then we have the issue that only the people who do not see the problem stay.”
That is an outcome she and others hope to prevent. As she looks at the field today, Rohrer sees a foundation of insights to build on, but plenty for a new generation of scientists—one with updated norms and fresh ideas—to scrutinize and improve. “There are things we thought we knew, but now we’re not sure,” she says. “We can try to figure them out properly this time.”