- Home
- Chambers, Chris;
The Seven Deadly Sins of Psychology Page 5
The Seven Deadly Sins of Psychology Read online
Page 5
Just how common is p-hacking? The lack of transparency in the research community makes a definitive answer impossible to glean, but important clues can be found by returning to the 2012 study by Leslie John and colleagues from chapter 1. Based on a survey of more than 2,000 American psychologists, they estimated that 100 percent have, on at least one occasion, selectively excluded data after looking at the impact of doing so, and that 100 percent have collected more data for an experiment after seeing whether results were statistically significant. They also estimate that 75 percent of psychologists have failed to report all conditions in an experiment, and that more than 50 percent have stopped data collection after achieving a “desired result.” These results indicate that, far from being a rare practice, p-hacking in psychology may be the norm.
FIGURE 2.1. The peril of adopting a flexible stopping rule in null hypothesis significance testing. In the upper panel, an experiment is simulated in which the null hypothesis (H0) is true and a statistical test is conducted after each new participant up to a maximum sample size of 50. A researcher who strategically p-hacks would stop as soon as p drops below .05 (dotted line). In this simulation, p crosses the significance threshold after collecting data for 19 participants (red symbols), despite the fact that there is no real effect to be discovered. In the lower panel we see how the frequency of interim analyses influences the false positive rate, defined here as the probability of finding a p value
Peculiar Patterns of p
The survey by John and colleagues suggests that p-hacking is common in psychology, but can we detect more objective evidence of its existence? One possible clue lies in the way p values are reported. If p-hacking is as common as claimed, then it should distort the distribution of p values in published work. To illustrate why, consider the following scenarios. In one, a team of researchers collect data by adding one participant at a time and successively analyze the results after each participant until statistical significance is obtained. Given this strategy, what p value would you expect to see at the end of the experiment? In another scenario, the team obtain a p value of .10 but don’t have the option to collect additional data. Instead, they try ten different methods for excluding statistical outliers. Most of these produce p values higher than .05, but one reduces the p value to .049. They therefore select that option in the declared analysis and don’t report the other “failed” attempts. Finally, consider a situation where p = .08 and the researchers have several degrees of freedom available, in terms of characterizing the dependent variable—specifically, they are able report the results either in terms of response times or performance accuracy, or an integrated measure of both. After analyzing all these different measures they find that the integrated measure “works best,” revealing a p value of .037, whereas the individual measures alone reveal only nonsignificant effects (p>.05).
Although each of these scenarios is different, they all share one thing: in each case the researcher is attempting to push the p value just over the line. If this is your goal then it would make sense to stop p-hacking as soon as the p value drops below .05—after all, why spend additional resources only to risk that an effect that is currently publishable might “disappear” with the addition of more participants or by looking at the data in a different way? By focusing on merely crossing the significance threshold, the outcome should be to create a cluster of p values just below .05.
A number of individual cases of such behavior have been alleged. In a Science paper published in 2012, researchers presented evidence that religious beliefs could be reduced by instructing people to complete a series of tasks that require rational, analytic thinking. Despite sample sizes across four experiments ranging from 57 to 179, each experiment returned p values within a range of p = .03 to p = .04. Critics have argued either that the authors knew precisely how many participants would be required in each experiment to attain statistical significance or that they p-hacked, consciously or unconsciously, in order to always narrowly reach it.6 There is, however, an entirely innocent explanation. Through no fault of the authors, their paper could be one of many unbiased studies considered by Science, with the journal selectively publishing the one that “struck gold” in finding a sequence of four statistically significant effects. Where it is impossible to distinguish biased practices by researchers from publication bias by journals, the authors naturally deserve the benefit of the doubt.
Because of the difficulty in differentiating publication bias from researcher bias, individual cases of p-hacking are difficult, if not impossible, to prove. However, if p-hacking is the norm then such cases may accrue across the literature to produce an overall preponderance of p values just below the significance threshold. In 2012, psychologists E. J. Masicampo and Daniel Lalande asked this question for the first time by examining the distribution of 3,627 p values sampled from three of the most prestigious psychology journals. Overall they found that smaller p values were more likely to be published than larger ones, but they also discovered that the number of p values just below .05 was about five times higher than expected.7
Masicampo and Lalande’s findings have since been replicated by Nathan Leggett and colleagues from the University of Adelaide. Not only did they find the same spike in p values just below .05, but they also showed that the spike increased between 1965 and 2005.8 The reason for growing numbers of “just-significant” results is not known for certain (and has itself been robustly challenged),9 but if it is a genuine phenomenon then one possible explanation is the huge advancement in computing technology and statistical software. Undertaking NHST in 1965 was cumbersome and laborious (and often done by hand), which acted as a natural disincentive toward p-hacking. In contrast, modern software packages such as SPSS and R can reanalyze data many ways in just seconds.
It has been suggested that the studies of the Masicampo team and the Leggett team reveal evidence of p-hacking on a massive scale across thousands of studies, but is it possible to show such effects within more specific fields? A tool developed by Simonsohn, Leif, and Simmons called “p-curve” analysis promises to do just this.10 The logic of p-curve is that the distribution of statistically significant p values within a set of studies reveals their evidential value (see figure 2.2). For unbiased (non-p-hacked) results where H0 is false, we should see more p values clustered toward the lower end of the spectrum (e.g., p<.01 than immediately below the significance threshold p values between .04 to .05 this in turn should produce a distribution of and that is positively skewed. contrast when researchers engage p-hacking we see clustering greatest just with fewer instances at lower be negatively although p-curve has attracted controversy it promising addition existing array tools detect hidden analytic flexibility.11>
The problem of p-hacking is not unique to psychology. Compared with typical behavioral experiments, functional brain imaging includes far more researcher degrees of freedom. As the blogger Neuroskeptic has pointed out, the decision space of even the simplest fMRI study can include hundreds of analysis options, providing ample room for p-hacking.12 At the time of writing, no studies of the distribution of p values have yet been undertaken for fMRI or electroence
phalography (EEG), but indirect evidence suggests that p-hacking may be just as common in these fields as in psychological science. Josh Carp of the University of Michigan has reported that out of 241 randomly selected fMRI studies, 207 employed unique analysis pipelines; this implies that fMRI researchers have numerous defensible options at their disposal and are making those analysis decisions after inspecting the data. As expected by a culture of p-hacking, earlier work has shown that the test-retest reliability of fMRI is moderate to low, with an estimated rate of false positives within the range of 10–40 percent.13 We will return to problems with unreliability in chapter 3, but for now it is sufficient to note that p-hacking presents a serious risk to the validity of both psychology and cognitive neuroscience.
FIGURE 2.2. The logic of the p-curve tool developed by Uri Simonsohn and colleagues. Each plot shows a hypothetical distribution of p values between 0 and .05. For example, the x-value of .05 corresponds to all p values between .04 and .05, while .01 corresponded to all p values between 0 and .01. In the upper panel, the null hypothesis (H0) is true, and there is no p-hacking; therefore p values in this plot are uniformly distributed. In the middle panel, H0 is false, leading to a greater number of smaller p values than larger ones. The positive (rightward) skew in this plot doesn’t rule out the presence of p-hacking but does suggest that the sample of p values is likely to contain evidential value. In the lower panel, H0 is true and more p values are observed closer to .05. The negative (leftward) skew in this plot suggests the presence of p-hacking.
Institutionalized p-hacking damages the integrity of science and may be on the rise. If virtually all psychologists engage in p-hacking (even unconsciously) at least some of the time, and if p-hacking increases the rate of false positives to 50 percent or higher, then much of the psychological literature will be false to some degree. This situation is both harmful and, crucially, preventable. It prompts us to consider how the scientific record would change if p-hacking wasn’t the norm and challenges us to reflect on how such practices can be tolerated in any community of scientists. Unfortunately, p-hacking appears to have crept up on the psychological community and become perceived as a necessary evil—the price we must pay to publish in the most competitive journals. Frustration with the practice is, however, growing. As Uri Simonsohn said in 2012 during a debate with fellow psychologist Norbert Schwarz:
I don’t know of anybody who runs a study, conducts one test, and publishes it no matter what the p-value is.… We are all p hackers, those of us who realize it want change.14
Ghost Hunting
Could a greater emphasis on replication help solve the problems created by p-hacking?15 In particular, if we take a case where a p-hacked finding is later replicated, does the fact that the replication succeeded mean we don’t need to worry whether the original finding exploited researcher degrees of freedom?
If we assume that true discoveries will replicate more often than false discoveries then a coordinated program of replication certainly has the potential to weed out p-hacked findings, provided that the procedures and analyses of the original study and the replication are identical. However, the argument that replication neutralizes p-hacking has two major shortcomings. The first is that while direct (close) replication is vital, even a widespread and systematic replication initiative wouldn’t solve the problem that p-hacked studies already waste resources by creating blind alleys. Second, as we saw earlier and will see again later, such a direct replication program simply doesn’t exist in psychology. Instead, the psychological community has come to rely on the more loosely defined “conceptual replication” to validate previous findings, which satisfies the need for journals to publish novel and original results. Within a system that depends on conceptual replication, researcher degrees of freedom can be just as easily exploited to “replicate” a false discovery as to create one from scratch. A p-hacked conceptual replication of a p-hacked study tells us very little about reality apart from our ability to deceive ourselves.
The case of John Bargh and the elderly priming effect provides an interesting case where researcher degrees of freedom may have led to so-called phantom replication. Recall that at least two attempts to exactly replicate the original elderly priming effect have failed.16 By way of rebuttal, Bargh has argued that two other studies did successfully replicate the effect.17 However, if we look closely at those studies, we find that in neither case was there an overall effect of elderly priming—in one study the effect was statistically significant only once participants were divided into subgroups of low or high self-consciousness; and in the other study the effect was only significant when dividing participants into those with positive or negative attitudes toward elderly people. Furthermore, each of these replications used different methods for handling statistical outliers, and each analysis included covariates that were not part of the original elderly priming experiments. These differences hint at a potential phantom replication. Driven by the confirmation bias to replicate the original (high-profile) elderly priming effect, the researchers in these subsequent studies may have consciously or unconsciously exploited researcher degrees of freedom to produce a successful replication and, in turn, a more easily marketable publication.
Whether these conceptual replications were contaminated by p-hacking cannot be known for certain, but we have good reason to be suspicious. Despite one-time prevalence estimates approaching 100 percent, researchers usually deny that they p-hack.18 In Leslie John’s survey in 2012, only about 60 percent of psychologists admitted to “Collecting more data after seeing whether results were significant,” whereas the prevalence estimate derived from this admission estimate was 100 percent. Similarly, while ~30 percent admitted to “Failing to report all conditions” and ~40 percent admitted to “Excluding data after the impact of doing so,” the estimated prevalence rates in each case were ~70 percent and ~100 percent, respectively. These figures needn’t imply dishonesty. Researchers may sincerely deny p-hacking yet still do it unconsciously by failing to remember and document all the analysis decisions made after inspecting data. Some psychologists may even do it consciously but believe that such practices are acceptable in the interests of data exploration and narrative exposition. Yet, regardless of whether researchers are p-hacking consciously or unconsciously, the solution is the same. The only way to verify that studies are not p-hacked is to show that the authors planned their methods and analysis before they analyzed the data—and the only way to prove that is through study preregistration.
Unconscious Analytic “Tuning”
What do we mean exactly when we say p-hacking can happen unconsciously? Statisticians Andrew Gelman and Eric Loken have suggested that subtle forms of p-hacking and HARKing can join forces to produce false discoveries.19 In many cases, they argue, researchers may behave completely honestly, believing they are following best practice while still exploiting researcher degrees of freedom.
To illustrate how, consider a scenario where a team of researchers design an experiment to test the a priori hypothesis that listening to classical music improves attention span. After consulting the literature they decide that a visual search task provides an ideal way of measuring attention. The researchers choose a version of this task in which participants view a screen of objects and must search for a specific target object among distractors, such as the letter “O” among many “Q”s. On each trial of the task, the participant judges as quickly as possible whether the “O” is present or absent by pressing a button—half the time the “O” is present and half the time it is absent. To vary the need for attention, the researchers also manipulate the number of distractors (Qs) between three conditions: 4 distractors (low difficulty, i.e., the “O” pops out when it is present), 8 distractors (medium difficulty) and 16 distractors (high difficulty). The key dependent variables are the reaction times and error rates in judging whether the letter “O” is present or absent. Most studies report reaction times as the measure for this task and find that reaction times increase wi
th the number of distractors. Many studies also report error rates. The researchers decide to adopt a repeated measures design in which each participant performs this task twice, once while listening to classical music and once while listening to nonclassical music (their control condition). They decide to test 20 participants.
So far the researchers feel they have done everything right: they have a prespecified hypothesis, a task selected with a clear rationale, and a sample size that is on parity with previous studies on visual search. Once the study is complete, the first signs of their analysis are encouraging: they successfully replicate two effects that are typically observed in visual search tasks, namely that participants are significantly slower and more error prone under conditions with more distractors (16 is more difficult than 8 which, in turn, is more difficult than 4), and that they are significantly slower to judge when a target is absent compared to when it is present. So far so good. On this basis, the authors judge that the task successfully measured attention.
But then it gets trickier. The researchers find no statistically significant main effect of classical music on either reaction times or error rates, which does not allow them to reject the null hypothesis. However they do find a significant interaction between the type of music (classical, nonclassical) and the number of distractors (4, 8, 16) for error rates (p = .01) but not for reaction times (p = .7). What this interaction means is that, for error rates, the effect of classical music differed significantly between the different distractor conditions. Post hoc comparisons show that error rates were significantly reduced when participants were exposed to classical music compared with control music, in displays with 16 distractors (p = .01) but not in displays with 8 or 4 distractors (both p>