The Seven Deadly Sins of Psychology Page 6
The researchers think carefully about their results. After doing some additional reading they learn that error rates can sometimes provide more sensitive measures of attention than reaction times, which would explain why classical music influenced only error rates. They also know that the “target absent” condition is more difficult for participants and therefore perhaps a more sensitive measure of attentional capacity—that would also explain why classical music boosted performance on trials without targets. Finally, they are pleased to see that the benefit of classical music on error rates with 16 distractors goes in the direction predicted by their hypothesis. Therefore, despite the fact that the main effect of classical music is not statistically significant on either of the measures (reaction times or error rates), the researchers conclude that the results support the hypothesis: classical music improves visual attention, particularly under conditions of heightened task difficulty. In the introduction of their article they phrase their hypothesis as, “We predicted that classical music would improve visual search performance. Since misidentification of visual stimuli can provide a more sensitive measure of attention than reaction time (e.g., Smith 2000), such effects may be expected to occur most clearly in error rates, especially under conditions of heightened attentional load or task difficulty.” In the discussion section of the paper, the authors note that their hypothesis was supported and argue that their results conceptually replicate a previous study, which showed that classical music can improve the ability to detect typographic errors in printed text.
What, if anything, did the researchers do wrong in this scenario? Many psychologists would argue that their behavior is impeccable. After all, they didn’t engage in questionable practices such as adding participants until statistical significance was obtained, selectively removing outliers, or testing the effect of various covariates on statistical significance. Moreover, they had an a priori hypothesis, which they tested, and they employed a manipulation check to confirm that their visual search task measured attention. However, as Gelman and Loken point out, the situation isn’t so simple—researcher degrees of freedom have still crept insidiously into their conclusions.
The first problem is the lack of precision in the researchers’ a priori hypothesis, which doesn’t specify the dependent variable that should show the effect of classical music (either reaction time or error rates, or both) and doesn’t state under what conditions that hypothesis would or would not be supported. By proposing such a vague hypothesis, the researchers have allowed any one of many different outcomes to support their expectations, ignoring the fact that doing so inflates the Type I error rate (α) and invites confirmation bias. Put differently, although they have an a priori scientific hypothesis, it is consistent with multiple statistical hypotheses.
The second problem is that the researchers ignore the lack of a statistically significant main effect of the intervention on either of the measures; instead they find a significant interaction between the main manipulation (classical music vs. control music) and the number of distractors (4, 8, 16)—but for error rates only. Since the researchers had no specific hypothesis that the effect of classical music would increase at greater distractor set sizes for error rates only, this result was unexpected. Yet by framing this analysis as though it was a priori, the researchers tested more null hypotheses than were implied in their original (vague) hypothesis, which in turn increases the chances of a false positive.
Third, the researchers engage in HARKing. Even though their general hypothesis was decided before conducting the study, it was then refined based on the data and is presented in the introduction in its refined state. This behavior is a subtle form of HARKing that conflates hypothesis testing with post hoc explanation of unexpected results. Since the researchers did have an a priori hypothesis (albeit vague) they would no doubt deny that they engaged in HARKing. Yet even if blurred in their own recollections, the fact is that they adjusted and refined their hypothesis to appear consistent with unexpected results.
Finally, despite the fact that their findings are less definitive than advertised, the researchers treat them as a conceptual replication of previous work—a corroboration of the general idea that exposure to classical music improves attention. Interestingly, it is in this final stage of the process where the lack of precision in their original hypothesis is most clearly apparent. This practice also highlights the inherent weakness of conceptual replication, which risks constructing bodies of knowledge on weak units of evidence.
Is this scenario dishonest? No. Is it fraudulent? No. But does it reflect questionable research practices? Yes. Unconscious as it may be, the fact is that the researchers in this scenario allowed imprecision and confirmation bias to distort the scientific record. Even among honest scientists, researcher degrees of freedom pose a serious threat to discovery.
Biased Debugging
Sometimes hidden flexibility can be so enmeshed with confirmation bias that it becomes virtually invisible. In 2013, Mark Stokes from the University of Oxford highlighted a situation where an analysis strategy that seems completely sensible can lead to publication of false discoveries.20 Suppose a researcher completes two experiments. Each experiment involves a different method to provide convergent tests for an overarching theory. In each case the data analysis is complicated and requires the researcher to write an automated script. After checking through each of the two scripts for obvious mistakes, the researcher runs the analyses. In one of the experiments the data support the hypothesis indicated by the theory. In the second experiment, however, the results are inconsistent. Puzzled, the researcher studies the analysis script for the second experiment and discovers a subtle but serious error. Upon correcting the error, the results of the second experiment become consistent with the first experiment. The author is delighted and concludes that the results of both experiments provide convergent support for the theory in question.
What’s wrong with this scenario? Shouldn’t we applaud the researcher for being careful? Yes and no. On the one hand, the researcher has caught a genuine error and prevented publication of a false conclusion. But note that the researcher didn’t bother to double-check the analysis of the first experiment because the results in that case turned out as expected and—perhaps more importantly—as desired. Only the second experiment attracted scrutiny because it ran counter to expectations, and running counter to expectations was considered sufficient grounds to believe it was erroneous. Stokes argues that this kind of results-led debugging—also termed “selective scrutiny”—threatens to magnify false discoveries substantially, especially if it occurs across an entire field of research.21 And since researchers never report which parts of their code were debugged or not (and rarely publish the code itself), biased debugging represents a particularly insidious form of hidden analytic flexibility.
Are Research Psychologists Just Poorly Paid Lawyers?
The specter of bias and hidden analytic flexibility inevitably prompts us to ask: what is the job of a scientist? Is it to accumulate evidence as dispassionately as possible and decide on the weight of that evidence what conclusion to draw? Or is it our responsibility to advocate a particular viewpoint, seeking out evidence to support that view? One is the job of a scientist; the other is the job of a lawyer. As psychologist John Johnson from Pennsylvania State University said in a 2013 blog post at Psychology Today:
Scientists are not supposed to begin with the goal of convincing others that a particular idea is true and then assemble as much evidence as possible in favor of that idea. Scientists are supposed to be more like detectives looking for clues to get to the bottom of wha
t is actually going on. They are supposed to be willing to follow the trail of clues, wherever that may lead them. They are supposed to be interested in searching for the critical data that will help decide what is actually true, not just for data that supports a preconceived idea. Science is supposed to be more like detective work than lawyering.22
Unfortunately, as we have seen so far, psychology falls short of meeting this standard. Whether conscious or unconscious, the psychological community tortures data until the numbers say what we want them to say—indeed what many psychologists, deep down, would admit we need them to say in the competition for high-impact publications, jobs, and grant funding. This situation exposes a widening gulf between the needs of the scientists and the needs of science. Until these needs are aligned in favor of science and the public who fund it, the needs of scientists will always win—to our own detriment and to that of future generations.
Solutions to Hidden Flexibility
Hidden flexibility is a problem for any science where the act of discovery involves the accumulation of evidence. This process is less certain in some sciences than in others. If your evidence is clearly one way or the other, such as the discovery of a new fossil or galaxy, then inferences from statistics may be unnecessary. In such cases no analytic flexibility is required, whether hidden or disclosed—so none takes place. It is perhaps this association between statistical analysis and noisy evidence that prompted physicist Ernest Rutherford to allegedly once remark: “If your experiment needs statistics, you ought to have done a better experiment.”
In many life sciences, including psychology, discovery isn’t a black-and-white issue; it is matter of determining, from one experiment to the next, the theoretical contribution made by various shades of gray. When psychologists set arbitrary criteria (p<.05 on the precise shade of gray required to achieve publication hence career success also incentivize a host conscious and unconscious strategies cross that threshold. in battle between science storytelling there is simply no competition: wins every time.>
How can we get out of this mess? Chapter 8 will outline a manifesto for reform, many aspects of which are already being adopted. The remainder of this chapter will summarize some of the methods we can use to counter hidden flexibility.
Preregistration. The most thorough solution to p-hacking and other forms of hidden flexibility (including HARKing) is to prespecify our hypotheses and primary analysis strategies before examining data. Preregistration ensures that readers can distinguish the strategies that were independent of the data from those that were (or might have been) data led. This is not to suggest that data-led strategies are necessarily incorrect or misleading. Some of the most remarkable advances in science have emerged from exploration, and there is nothing inherently wrong with analytic flexibility. The problems arise when that flexibility is hidden from the reader, and possibly from the researcher’s own awareness. By unmasking this process, preregistration protects the outcome of science from our own biases as human practitioners.
Study preregistration has now been standard practice in clinical medicine for decades, driven by concerns over the effects of hidden flexibility and publication bias on public health. For basic science, these risks may be less immediate but they are no less serious. Hidden flexibility distorts the scientific record, and, since basic research influences and feeds into more applied areas (including clinical science), corruption of basic literature necessarily threatens any forward applications of discovery.
Recent years have witnessed a concerted push to highlight the benefits of preregistration in psychology. An increasing number of journals are now offering preregistered article formats in which part of the peer review process happens before researchers conduct experiments. This type of review ensures adherence to the hypothetico-deductive model of the scientific method, and it also prevents publication bias. Resources such as the Open Science Framework also provide the means for researchers to preregister their study protocols.
p-curve. The p-curve tool developed by Simonsohn and colleagues is useful for estimating the prevalence of p-hacking in published literature. It achieves this by assuming that collections of studies dominated by p-hacking will exhibit a concentration of p values that peaks just below .05. In contrast, an evidence base containing positive results in which p-hacking is rare or absent will produce a positively skewed distribution where smaller p values are more common than larger ones. Although this tool cannot diagnose p-hacking within individual studies, it can tell us which fields within psychology suffer more from hidden flexibility. Having identified them, the community can then take appropriate corrective action such as an intensive program of direct replication.
Disclosure statements. In 2012, Joe Simmons and colleagues proposed a straightforward way to combat hidden flexibility: simply ask the researchers.23 This approach assumes that most researchers (a) are inherently honest and will not deliberately lie, and (b) recognize that exploiting researcher degrees of freedom reduces the credibility of their own research. Therefore, requiring researchers to state whether or not they engaged in questionable practices should act as a disincentive to p-hacking, except for researchers who are either willing to admit to doing it (potentially suffering loss of academic reputation) or are willing to lie (active fraud).
The disclosure statements suggested by the Simmons team would require authors to state in the methods section of submitted manuscripts how they made certain decisions about the study design and analysis. These include whether the sample size was determined before the study began and if any of experimental conditions or data were excluded. Their “21 word solution” (as they call it) is:
We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.
Although elegant in their simplicity, disclosure statements have several limitations. They can’t catch forms of p-hacking that exploit defensible ambiguities in analysis decisions, such as testing the effect of adding a covariate to the design or focusing the analysis on a particular subgroup (e.g., males only) following exploration. Furthermore, the greater transparency of methods, while laudable in its own right, doesn’t stop the practice of HARKing; note that researchers are not asked whether their a priori hypotheses were altered after inspecting the data. Finally, disclosure statements cannot catch unconscious exploitation of researcher degrees of freedom, such as forgetting the full range of analyses undertaken or more subtle forms of HARKing (as proposed by Gelman and Loken). Notwithstanding these limitations, disclosure statements are a worthy addition to the range of tools for counteracting hidden analytic flexibility.
Data sharing. The general lack of data transparency is a major concern in psychology, and will be discussed in detail in chapter 4. For now, it is sufficient to note that data sharing provides a natural antidote to some forms of p-hacking—particularly those where statistical significance is based on post hoc analytic decisions such as different methods for excluding outliers or focusing on specific subgroups. Publishing the raw data allows independent scientists with no vested interest to test how robust the outcome is to alternative analysis pathways. If such examinations were to reveal that the authors’ published approach was the only one out of a much larger subset to produce p<.05 the community would be justifiably skeptical of study conclusions. even though relatively few scientists scrutinize each other raw data mere possibility that this could happen should act as a natural deterrent to deliberate p-hacking.>
Solutions to allow “optional stopping.” Leslie John’s survey showed how a major source of p-hacking is violation of stopping rules; that is, continuously adding participants to an experiment until the p value drops below the significance threshold. When H0 is true, p values between 0 and 1 are all equally likely to occur, therefore with the addition of enough participants a p value below .05 will eventually be found by chance. Psychologists often neglect stopping rules because, in most studies, there is n
o strong motivation for selecting a particular sample size in the first place.
Fixed stopping rules present a problem for psychology because the size of the effect being investigated is often small and poorly defined. Fortunately, there are two solutions that psychologists can use to avoid violating stopping rules. The first, highlighted by Michael Strube from Washington University (and more recently by Daniël Lakens), allows researchers to use a variable stopping rule with NHST by lowering the α level in accordance with how regularly the researcher peeks at the results.24 This correction is similar to more conventional corrections for multiple comparisons. The second approach is to adopt Bayesian hypothesis testing in place of NHST.25 Unlike NHST, which estimates the probability of observed (or more extreme) data under the null hypothesis, Bayesian tests estimate the relative probabilities of the data under competing hypotheses. In chapter 3 we will see how this approach has numerous advantages over more conventional NHST methods, including the ability to directly estimate the probability of H0 relative to H1. Moreover, Bayesian tests allow researchers to sequentially add participants to an experiment until the weight of evidence supports either H0 or H1. According to Bayesian statistical philosophy “there is nothing wrong with gathering more data, examining these data, and then deciding whether or not to stop collecting new data—no special corrections are needed.”26 This flexibility is afforded by the likelihood principle, which holds that “[w]ithin the framework of a statistical model, all of the information which the data provide concerning the relative merits of two hypotheses is contained in the likelihood ratio of those hypotheses.”27 In other words, the more data you have, the more accurately you can conclude support for one hypothesis over another.