The Seven Deadly Sins of Psychology Page 9
As we saw earlier, concealing details about the range of attempted and statistically nonsignificant analyses can dramatically elevate the rate of false positives. At the same time, failing to disclose details of experimental procedures obstructs other scientists from replicating the work and, in cases where such replications are attempted, sparks additional contention. A case in point returns us to the elderly priming controversy of 2012 between Bargh and Doyen. Recall that Doyen and colleagues failed to replicate Bargh’s influential finding that priming people with concepts related to elderly people can lead them to walk more slowly. Doyen claimed that the original result was due to the experimenter in Bargh’s original experiments being unblinded to the experimental conditions and inadvertently “priming” the subjects. When Doyen and colleagues repeated the experiment in such a way as to ensure effective blinding, the effect disappeared.
In response, Bargh claimed that the experimenter in his original study was blinded to the experimental condition. Unfortunately, it is impossible to tell either way because the method section in Bargh’s original study was too vague. Because of this ambiguity, the debate between Bargh and Doyen focused on what the original study did rather than the scientific validity of the elderly priming effect itself—and this dispute spread among the wider psychological community. Such arguments lead nowhere and generate ill feeling: where methodological imprecision is allowed to sully the scientific record, any failed replication can be attacked for not following some unpublished—but apparently critical—detail of the original methodology. And where the original researchers seek to clarify their methodology after the fact, such attempts can be perceived, rightly or wrongly, as face-saving and dishonest.
If lack of methodological disclosure is such a problem, why don’t journals simply require more detail? There are no good answers to this question but plenty of bad ones. Space limitations in printed journals require authors to cut details from manuscripts that may be seen by editors as extraneous or unnecessary. Strangely, these policies have persisted even in online (un-printed) journals where space is no concern. Even worse, because direct replication is so rare, much of the detail necessary for replication is regarded as unnecessary by authors, reviewers, and editors. The absence of this detail, in turn, feeds a vicious cycle that makes direct replication of such work difficult or impossible. Methods sections have thus faded from their original purpose of providing the recipe for replication, instead acting as a way for reviewers and editors to check that the authors followed procedural norms and avoided obvious methodological flaws. While this is a necessary role of a method section, it is far from sufficient.
Reason 4: Statistical Fallacies
In chapter 2 we saw how undisclosed analytic flexibility can lead to a range of questionable practices in psychology that undermine reliability, such as p-hacking and HARKing. Beneath these errors in the application of statistics, however, lie even deeper and more fundamental misunderstandings of null hypothesis significance testing (NHST).
The most frequently misunderstood aspect of NHST is the p value itself.36 To illustrate just how confusing p values can be, consider a scenario where you conduct an experiment testing the effectiveness of a new kind of intervention on the number of people who successfully quit smoking. You give one group of participants the new intervention while the other group is given a control (baseline) intervention. You find that significantly more participants quit following the new intervention compared with the control intervention. The statistical test returns p = .04, which at α = .05 allows you to reject the null hypothesis.
Now consider which of the following statements is correct:
A. The obtained p value of .04 indicates a 4 percent probability that the new intervention was not effective (the null hypothesis).
B. The obtained p value of .04 indicates a 4 percent probability that the results are due to random chance rather than the new intervention.
C. The fact that there is a statistically significant difference means that your intervention works and is clinically significant.
D. The obtained p value indicates that the observed data would occur only 4 percent of the time if the null hypothesis were true.
E. Previous research found that a similar intervention failed to produce a significant improvement compared the control intervention, returning a significance level of p = .17. This indicates that your new intervention is more promising that this previous intervention.
F. After your study is published, another research group publishes a paper reporting that a different intervention also leads to a statistically significant improvement compared to the same control intervention (p = .001). Because their p value is smaller than your p value (.04), they conclude that their treatment is more effective than your treatment.
Which of these statements is correct? In fact, all of them are wrong in different ways. Statement A is the most prevalent misconception of NHST: that a p value indicates the probability of the null hypothesis (or any hypothesis) being true. Another common variant of this fallacy is that p = .04 reflects a 96 percent chance of that the effect is “real.” These statements confuse the p value with the posterior probability of the null hypothesis. Recall that NHST estimates the probability of the observed data given the hypothesis, p(data|hypothesis), rather than the probability of the hypothesis given the data, p(hypothesis|data). To determine the posterior probability of any hypothesis being true we need to use Bayes’ theorem, which we will turn to later in this chapter. Statement B is a variant of statement A and equally wrong. For the same reason that a p value can’t tell us the probability that there is no effect, it can’t tell us the probability that the effect is due to chance.37
What about statement C? Can we conclude that the new treatment was clinically significant because it was statistically more effective than the control intervention? No, because the p value tells us nothing about the size of the effect. As statistical power increases, so does the chance of detecting increasingly small effects. Therefore, depending on the statistical power of your experiment, p = .04 could indicate an effect so small as to be trivial in terms of clinical significance. Statistical significance and “real world” significance are independent concepts and should not be confused, even though they frequently are.
Statement D moves us closer to the true definition of a p value but is still crucially wrong. The p value doesn’t tell us the probability of observing your data if the null hypothesis is true—it tells us the probability of observing your data or more extreme data if the null hypothesis is true.
In statement E, we’re asked whether your intervention, which yielded a statistically significant effect (p = .04), can be said to be more effective than an alternative intervention that produced a statistically nonsignificant effect (p = .17). Although this interpretation is tempting—and extremely common—it is incorrect. To conclude that two effects are different requires a statistical test of their difference: in other words, you would need to show that your intervention produced a significantly larger improvement in smoking cessation compared with the alternative intervention. In most cases this requires a test of the statistical interaction, and only if that interaction was statistically significant could you conclude a quantitative dissociation between effects. In 2011, psychologists Sander Nieuwenhuis, Birte Forstmann, and E. J. Wagenmakers explored the prevalence of this fallacy in 157 articles published in four of the most prestigious journals in neuroscience. Remarkably, they found that 50 percent of papers assumed a meaningful difference between a statistically significant effect and a statistically nonsignificant effect, without testing for the critical interaction that would justify such a conclusion.38
Finally, let’s consider statement F. This scenario is similar to statement E except that both p values are statistically significant. Again, the statement is false without performing a test to compare the magnitude of the improvement between your intervention and the newer intervention. Remember that a p value on its own tells us nothing about the
size of an observed effect—it merely tells us how surprised we should be to observe that effect, or larger, if the null hypothesis were true. It is entirely possible that a large effect that yields p = .04 is more important for theory or applications than a smaller effect from a much larger sample that produces p = .001. This tendency for scientists to draw conclusions about effect sizes based on p values reflects the same statistical fallacy as statement A: that we tend to interpret p values as telling us the likelihood of our hypothesis being true, with smaller p values making us feel more confident that we are “right.” This thinking is not allowed under NHST because a p value can never tell us the probability of truth. Many psychologists (and scientists in related fields) have fallen into the trap of confusing p values with a measure that does tell us of the relative likelihood of one hypothesis over another: the Bayes factor. We will return to Bayes factors later in this chapter.
Reason 5: Failure to Retract
Scientific publishing is largely a matter of trust, which means that when major errors are identified it is important that untrustworthy papers are retracted promptly from the literature. Articles can be retracted for many reasons, ranging from honest mistakes such as technical errors or failure to reproduce findings, to research misconduct and fraud.39 If replication is the immune system of science, then retraction can be thought of as the last line of defense—a surgical excision.
Putting aside fraud, in many sciences the failure to replicate a previous result because of technical error or unknown reasons is sufficient grounds for retracting the original paper.40 In physics, chemistry, and some areas of biology, results are often so clearly positive or negative that failure to replicate indicates a critical mistake with either the original study or the replication attempt. Psychology, however, rarely retracts articles because of replication failures alone. In 2013, researchers Minhua Zhang and Michael Grieneisen found that the social sciences, including psychology, are several times less likely than other sciences to retract articles because of “distrust of data or interpretations.”41 In a previous study they showed that overall retraction rates in psychology are just 27 percent of the average calculated across more than 200 other research areas.42
The resistance to retraction in psychology is so hardened that it can lead to farcical interactions between researchers. One recent case highlighted by Dan Simons relates again to the work of Yale psychologist John Bargh.43 In 2012, Bargh and colleague Idit Shalev published a study claiming that lonelier people prefer warmer baths and showers, thereby compensating for a lack of “social warmth” through physical warmth.44 In 2014, psychologist Brent Donnellan and colleagues reported a failure to replicate this finding—and not just in a single experiment but across nine experiments and more than 3,000 participants, over 30 times the sample size of the original study.45 Despite this failure to replicate, as well as the presence of unexplained anomalies in the original data, Bargh and Shalev refused to retract their original paper. In many other sciences, a false discovery of this magnitude would automatically trigger excision of the original work from the scientific record. In psychology, unreliability is business as usual.
What does the rate of retractions in a particular scientific field say about the quality of science in that field? On the one hand you might think (optimistically) that fewer retractions is a good sign, indicating that fewer studies are in need of retraction. More realistically, fewer retractions can be thought to indicate lower confidence scientists place in their own methods and a lower bar for publication in the first place. If we take a science such as experimental physics, where studies tend to have high statistical power, methods are well defined and de facto preregistered, then the failure to reproduce a previous result is considered a major cause for concern. But in a weaker science where lax statistical standards and questionable research practices are the norm, attempts to reproduce prior work will often fail, and it should therefore come as no surprise that retraction is rare. By lowering the bar for publication, we necessarily raise it for retraction. A strict policy requiring retraction of unreliable results could cause the thread of psychological research to unravel, consigning vast swathes of the literature to the scrap heap.
Solutions to Unreliability
What is the point of science if we can’t trust what it tells us? Lack of reliability is unquestionably one of the gravest challenges facing psychological science. However there are reasons to be hopeful. In the remainder of this chapter we will consider some solutions to the five main problems outlined above, including lack of replication, low statistical power, lack of disclosure about methods, statistical misunderstandings, and failure to retract flawed studies.
Replication and power. If the problem is a lack of replication and statistical power, then the obvious solution is to produce more high-powered research and instill a culture where, like other sciences, replication is viewed as a foundation of discovery. Our challenge is how to achieve this reform within an academic culture that places little to no value on direct replication, and where jobs and grant funding are awarded based on the quantity of high-impact publications rather than the reliability of the underlying science. Placing a premium on work that can be independently replicated is one major part of the solution. Psychologist Will Gervais from the University of Kentucky has shown how a policy of replication automatically rewards researchers who conduct high-powered studies.46 Under the current system, a researcher who publishes many underpowered studies (“Dr. Mayfly”; N = 40 participants per experiment) will have a substantial career advantage over a researcher who publishes a smaller number of high-powered studies (“Dr. Elephant”; N = 300 participants per experiment). Using simple statistical modeling, Gervais found that after a six-year period, Dr. Mayfly will have published nearly twice the number of papers as Dr. Elephant. However this advantage is reversed when publication requires every experiment to be directly replicated just once. Both researchers now publish fewer papers, but because Dr. Mayfly wastes so much time and resources chasing down false leads, Dr. Elephant publishes more than double the number of papers as Dr. Mayfly. By aligning the incentive to publish more papers with the incentive to publish replicable science, the careful scientist is rewarded over the cowboy.
So much for theory, but how can we incentivize replications in practice? E. J. Wagenmakers and Birte Forstmann have proposed an innovative solution in which journal editors issue public calls for replication attempts of studies that hold particular interest or weight in the literature. Teams of scientists would then compete for the bid, with the winning team guaranteed a publication in the journal that issues the call.47 The journal Perspectives on Psychological Science recently launched a similar initiative called Registered Replication Reports in which groups of scientists work collab-oratively to directly replicate findings of particular importance, with publication of preregistered protocols guaranteed regardless of the outcome. Sanjay Srivastava has gone even further and called for a “Pottery Barn rule” (you break it, you buy it) in which the journal that publishes any original finding is required to publish direct replications of the study, regardless of the outcome.48 Srivastava’s idea is challenging to implement but would incentivize journals to publish only the work they see as credible. It would also capitalize on evidence that replication studies have the most impact when they are published in the same journal as the original paper.49
Journal policies must also play a key role in encouraging, rewarding, and normalizing replication. The journal BMC Psychology, launched in 2013, is one of a handful journals to explicitly take up the gauntlet, noting in its guide to authors that it will “publish work deemed by peer reviewers to be a coherent and sound addition to scientific knowledge and to put less emphasis on interest levels, provided that the research constitutes a useful contribution to the field.” In a provocative article for BMC Psychology, editor Keith Laws has noted that this policy explicitly welcomes negative findings and replication studies.50
Bayesian hypothesis testing. One of the strang
est aspects of conventional statistical tests is that they don’t actually tell us what we need to know. When we run an experiment we want to know the probability that our hypothesis is correct, or the probability that the obtained results were due to a genuine effect rather than the play of chance. But the doctrine of NHST doesn’t allow such interpretations—the p value only tells us how surprised we should be to obtain results at least as extreme if the null hypothesis were true. This counterintuitive reasoning of p(data|hypothesis) is why many scientists still harbor basic misunderstandings about the definition of a p value.
A more intuitive approach to statistical testing is to consider the opposite logic to NHST and estimate p(hypothesis|data)—that is, the probability of the hypothesis in question being true given the data in hand. But to achieve this we need to look beyond NHST—a relatively new invention—and return to an eighteenth-century mathematical law called Bayes’ theorem.
Bayes’ theorem is a simple rule that allows us to calculate the conditional probability of a particular belief being true (hypothesis) given a particular set of evidence (data) and our prior belief. Specifically, it estimates the posterior probability of a proposition by integrating the observed data with the strength of our prior expectations. Bayes’ theorem is expressed formally as:
where P(H|D) is the posterior probability of the hypothesis given the data, P(H) is the prior probability of the hypothesis, P(D|H) is the probability of the data given the hypothesis, and P(D) is the probability of the data itself.
One of the simplest applications of Bayes’ theorem is in medical diagnosis. Suppose you are a doctor who suspects your patient may be suffering from Alzheimer’s disease. To examine this possibility you give your patient a 15-minute behavioral test for mild cognitive impairment. The test reveals a positive result, potentially indicative of the disease. So what is the probability that your patient actually has Alzheimer’s disease? We can represent this as P(H|D): the posterior probability of the hypothesis that the patient has Alzheimer’s disease given that test was positive. Using Bayes’ theorem, we can estimate this probability if we have three other pieces of information: the sensitivity of the test, which is the probability that the test would yield a positive result in a patient who actually has Alzheimer’s disease, P(D|H), the prior probability of the patient having the disease in the first place, P(H), and the overall probability of a positive test result regardless of whether or not the patient has the disease, P(D). Now, let’s further suppose that the test has a sensitivity of 80 percent, which is to say that that it has an 80 percent chance of detecting Alzheimer’s disease where the disease is present; thus P(D|H) = 0.80. Let’s also assume that the chance of the patient having Alzheimer’s disease within the population is 1 percent (P(H) = 0.01). To complete the picture, all we need to know now is the chance of the test returning a positive result regardless of whether or not the patient has the disease. To calculate this we can expand P(D) according to the “law of total probability,” which proves that the overall probability of a particular occurrence (in this case testing positive for Alzheimer’s disease) is the sum of its constituent conditional probabilities: