- Home
- Chambers, Chris;
The Seven Deadly Sins of Psychology Page 4
The Seven Deadly Sins of Psychology Read online
Page 4
Just how prevalent is this kind of HARKing? Norbert Kerr’s survey of 156 psychologists in 1998 suggested that about 40 percent of respondents had observed HARKing by other researchers; strikingly, the surveyed psychologists also suspected that HARKing was about 20 percent more prevalent than the classic H-D method.33 A more recent survey of 2,155 psychologists by Leslie John and colleagues estimated the true prevalence rate to be as high as 90 percent despite a self-admission rate of just 35 percent.34
Remarkably, not all psychologists agree that HARKing is a problem. Nearly 25 years before suggesting the existence of precognition, Daryl Bem claimed that if data are “strong enough” then researchers are justified in “subordinating or even ignoring [their] original hypotheses.”35 In other words, Bem argued that it is legitimate to subvert the H-D method, and to do so covertly, in order to preserve the narrative structure of a scientific paper.
Norbert Kerr and others have objected to this point of view, as well they might. First and foremost, because HARKing relies on deception, it violates the fundamental ethical principle that research should be reported honestly and completely. Deliberate HARKing may therefore lie on the same continuum of malpractice as research fraud. Secondly, the act of deception in HARKing leads the reader to believe that an obtained finding was more expected, and hence more reliable, than it truly is—this, in turn, risks distorting the scientific record to place undue certainty in particular findings and theories. Finally, in cases where a post hoc hypothesis is pitted against an alternative account that the author already knows was unsupported, HARKing creates the illusion of competitive hypothesis testing. Since a HARKed hypothesis can, by definition, never be disconfirmed, this contrived scenario further exacerbates confirmation bias.
The Battle against Bias
If confirmation bias is so much a part of human nature then what hope can we have of defeating it in science? In an academic culture that prizes novel results that confirm our expectations, is there any real chance of reform? We have known about the various manifestations of bias in psychology since the 1950s—and have done little to counteract them—so it is easy to see why many psychologists are cynical about the prospect of change. However, the tide is turning. Chapter 8 will address the set of changes we must make—and are already launching—to protect psychological science against bias and the other “deadly sins” that have become part of our academic landscape. Some of these reforms are already bearing fruit.
Our starting point for any program of reform must be the acceptance that we can never completely eliminate confirmation bias—in Nietzsche’s words we are human, all too human. Decades of psychological research shows how bias is woven into the fabric of cognition and, in many situations, operates unconsciously. So, rather than waging a fruitless war on our own nature, we would do better to accept imperfection and implement measures that protect the outcome of science as much as possible from our inherent flaws as human practitioners.
One such protection against bias is study preregistration. We will return to the details of preregistration in chapter 8, but for now it is useful to consider how publicly registering our research intentions before we collect data can help neutralize bias. Consider the three main manifestations of confirmation bias in psychology: publication bias, conceptual replication, and HARKing. In each case, a strong motivation for engaging in these practices is not to generate high-quality, replicable science, but to produce results that are publishable and perceived to be of interest to other scientists. Journals enforce publication bias because they believe that novel, positive results are more likely to indicate discoveries that their readers will want to see; by comparison, replications and negative findings are considered boring and relatively lacking in intellectual merit. To fit with the demands of journals, psychologists have thus replaced direct replication with conceptual replication, maintaining the comfortable but futile delusion that our science values replication while still satisfying the demands of novelty and originality. Finally, as we have seen, many researchers engage in HARKing because they realize that failing to confirm their own hypothesis is regarded as a form of intellectual failure.
Study preregistration helps overcome these problems by changing the incentive structure to value “good science” over and above “good results.” The essence of preregistration is that the study rationale, hypotheses, experimental methods, and analysis plan are stated publicly in advance of collecting data. When this process is undertaken through a peer-reviewed journal, it forces journal editors to make publishing decisions before results exist. This, in turn, prevents publication bias by ensuring that whether results are positive or negative, novel or familiar, groundbreaking or incremental, is irrelevant to whether the science will be published. Similarly, since authors will have stated their hypotheses in advance, preregistration prevents HARKing and ensures adherence to the H-D model of the scientific method. As we will see in chapter 2, preregistration also prevents researchers from cherry-picking results that they believe generate a desirable narrative.
In addition to study preregistration, bias can be reduced by reforming statistical practice. As discussed earlier, one reason negative findings are regarded as less interesting is our cultural reliance on null hypothesis significance testing (NHST). NHST can only ever tell us whether the null hypothesis is rejected, and never whether it is supported. Our reliance on this one-sided statistical approach inherently places greater weight on positive findings. However, by shifting to alternative Bayesian statistical methods, we can test all potential hypotheses (H0, H1 … Hn) fairly as legitimate possible outcomes. We will explore this alternative method in more detail in chapter 3.
As we take this journey it is crucial that individual scientists from every level feel empowered to promote reform without damaging their careers. Confirmation bias is closely allied with “groupthink”—a pernicious social phenomenon in which a consensus of behavior is mistaken for a convergence of informed evidence. The herd doesn’t always make the most rational or intelligent decisions, and groupthink can stifle innovation and critical reflection. To ensure the future of psychological science, it is incumbent on us as psychologists to recognize and challenge our own biases.
CHAPTER 2
The Sin of Hidden Flexibility
Torture numbers and they will confess to anything.
—Gregg Easterbrook, 1999
In 2008, British illusionist Derren Brown presented a TV program called The System in which he claimed he could predict, with certainty, which horse would win at the racetrack. The show follows Khadisha, a member of the public, as Brown provides her with tips on upcoming races. In each case the tips pay off, and after succeeding five times in a row Khadisha decides to bet as much money as possible on a sixth and final race. The twist in the program is that Brown has no system—Khadisha is benefiting from nothing more than chance. Unknown to her until after placing her final bet, Brown initially recruited 7,776 members of the public and provided each of them with a unique combination of potential winners. Participants with a losing horse were successively eliminated at each of six stages, eventually leaving just one participant who had won every time—and that person just happened to be Khadisha. By presenting the story from Khadisha’s perspective, Brown created the illusion that her winning streak was too unlikely to be random—and so must be caused by The System—when in fact it was explained entirely by chance.
Unfortunately for science, the hidden flexibility that Brown used to generate false belief in The System is the same mechanism that psychologists exploit to produce results that are attractive and easy to publish. Faced with the career pressure to publish positive findings in the most prestigious and selective journals, it is now standard practice for researchers to analyze complex data in many different ways and report only the most interesting and statistically significant outcomes. Doing so deceives the audience into believing that such outcomes are credible, rather than existing within an ocean of unreported negative or inconclusive findings. Any
conclusions drawn from such tests will, at best, overestimate of the size of any real effect. At worst they could be entirely false.
By torturing numbers until they produce publishable outcomes, psychology commits our second mortal sin: that of exploiting hidden analytic flexibility. Formally, hidden flexibility is one manifestation of the “fallacy of incomplete evidence,” which arises when we frame an argument without taking into account the full set of information available. Although hidden flexibility is itself a form of research bias, its deep and ubiquitous nature in psychology earns it a dedicated place in our hall of shame.
p-Hacking
As we saw earlier, the dominant approach for statistical analysis in psychological science is a set of techniques called null hypothesis significance testing (NHST). NHST estimates the probability of an obtained positive effect, or one greater, being observed in a set of data if the null hypothesis (H0) is true and no effect truly exists. Importantly, the p value doesn’t tell us the probability of H0 itself being true, and it doesn’t indicate the size or reliability of the obtained effect—instead what it tells us is how surprised we should be to obtain the current effect, or one more extreme, if H0 were to be true.1 The smaller the p value, the greater our surprise would be and the more confidently we can reject H0.
Since the 1920s, the convention in psychology has been to require a p value of less than .05 in order to categorically reject H0. This significance threshold is known as α—the probability of falsely declaring a positive effect when, in fact, there isn’t one. Under NHST, a false positive or Type I error occurs when we incorrectly reject a true H0. The α threshold thus indicates the maximum allowable probability of a Type I error in order to reject H0 and conclude that a statistically significant effect is present.
Why is α set to .05, you might ask? The .05 convention is arbitrary, as noted by Ronald Fisher—one of the architects of NHST—nearly a century ago:
If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level.2
Setting the α threshold to .05 theoretically allows up to 1 in 20 false rejections of H0 across a set of independent significance tests. Some have argued that this threshold is too liberal and leads to a scientific literature built on weak findings that are unlikely to replicate.3 Furthermore, even if we believe that it is acceptable for 5 percent of statistically significant results to be false positives, the truth is that exploiting analytic flexibility increases α even more, increasing the actual rate of false positives.
This flexibility arises because researchers make analytic decisions after inspecting their data and are faced with many analysis options that can be considered defensible yet produce slightly different p values. For instance, given a distribution of reaction time values, authors have the option of excluding statistical outliers (such as very slow responses) within each participant. They also have the option of excluding entire participants on the same basis. If they decide to adopt either or both of these approaches, there are then many available methods they could use, each of which could produce slightly different results. As well as being flexible, a key feature of such decisions is that they are hidden and never published. The rules of engagement do not require authors to specify which analytic decisions were a priori (confirmatory) and which were post hoc (exploratory)—in fact, such transparency is likely to penalize authors competing for publication in the most prestigious journals. This combination of culture and incentives inevitably leads to all analyses being portrayed as confirmatory and hypothesis driven even where many were exploratory. In this way, authors can generate a product that is attractive to journals while also maintaining the illusion (and possibly delusion) that they have adhered to the hypothetico-deductive model of the scientific method.
The decision space in which these exploratory analyses reside is referred to as “researcher degrees of freedom.” Beyond the exclusion of outliers, it can include decisions such as which conditions to enter into a wider analysis of multiple factors, which covariates or regressors to take into account, whether or not to collect additional participants, and even how to define the dependent measure itself. In even the simplest experimental design, these pathways quickly branch out to form a complex decision tree that a researcher can navigate either deliberately or unconsciously in order to generate statistically significant effects. By selecting the most desirable outcomes, it is possible to reject H0 in almost any set of data—and by combining selective reporting with HARKing (as described in chapter 1) it is possible to do so in favor of almost any alternative hypothesis.
Exploiting researcher degrees of freedom to generate statistical significance is known as “p-hacking” and was brought to prominence in 2011 by Joe Simmons, Leif Nelson, and Uri Simonsohn from the University of Pennsylvania and University of California Berkeley.4 Through a combination of real experiments and simulations, Simmons and colleagues showed how selective reporting of exploratory analyses can generate meaningless p values. In one such demonstration, the authors simulated a simple experiment involving one independent variable (the intervention), two dependent variables (behaviors being measured), and a single covariate (gender of the participant). The simulation was configured so that there was no effect of the manipulation, that is, H0 was predetermined to be true. They then simulated the outcome of the experiment 15,000 times and asked how often at least one statistically significant effect was observed (p<.05 given that h0 was true in this scenario a nominal rate of percent false positives can be assumed at .05. the central question posed by simmons and colleagues what would happen if they embedded hidden flexibility within analysis decisions. particular tested effect analyzing either two dependent variables positive obtained on one including gender as covariate or not increasing number participants after results dropping more conditions. allowing maximal combinatorial between these four options increased from to an alarming percent.>
As striking as this is, 60.7 percent is probably still an underestimate of the true rate of false positives in many psychology experiments. Simmons and colleagues didn’t even include other common forms of hidden flexibility, such as variable criteria for excluding outliers or conducting exploratory analyses within subgroups (e.g., males only or females only). Their simulated experimental design was also relatively simple and produced only a limited range of researcher degrees of freedom. In contrast, many designs in psychology are more complicated and will include many more options. Using a standard research design with four independent variables and one dependent variable, psychologist Dorothy Bishop has showed that at least one statistically significant main effect or interaction can be expected by chance in more than 50 percent of analyses—an order of magnitude higher than the conventional α threshold.5 Crucially, this rate of false positives occurs even without exploiting the researcher degrees of freedom illustrated by Simmons and colleagues. Thus, where p-hacking occurs in more complex designs it is likely to render the obtained p values completely worthless.
One key source of hidden flexibility in the simulations by Simmons and colleagues was the option to add participants after inspecting the results. There are all kinds of reasons why researchers peek at data before data collection is complete, but one central motivation is efficiency: in an environment with limited resources it can often seem sensible to stop data collection as soon as all-important statistical significance is either obtained or seems out of reach. This temptation to peek and chase p<.05 is of course motivated by the fact that psychology journals typically require main conclusions a paper to be underpinned statistically significant results. if critical statistical test returns p=".07,"> researcher knows that reviewers and editors will regard the result as weak and unconvincing, and that the paper has little chance of being published in a competitive journal. Many researchers will therefore add participants in an attempt to nudge the p value over the line, without reporting in the published paper that they did so.
This kind of behavior may seem rational within a publishing system where what is best for science conflicts with the incentives that drive individual scientists. After all, if scientists are engaging in these practices then it is surely because they believe they have no other choice in the race for jobs, grant funding, and professional esteem. Unfortunately, however, chasing statistical significance by peeking and adding data completely undermines the philosophy of NHST. A central but often overlooked requirement of NHST is that researchers prespecify a stopping rule, which is the final sample size at which data collection must cease. Peeking at data prior to this end point in order to decide whether to continue or to stop increases the chances of a false positive. This is because NHST estimates the probability of the observed data (or more extreme) under the null hypothesis, and since the aggregate data pattern can vary randomly as individual data points are added, repeated hypothesis testing increases the odds that the data will, by chance alone, fall below the nominated α level (see figure 2.1). This increase in false positives is similar to that obtained when testing multiple hypotheses simultaneously.