The Seven Deadly Sins of Psychology Page 8
One thing seems clear—the culture of replication in the physical sciences is a world apart from the one that prevails in psychology. Katie Mack, astrophysicist at the University of Melbourne, told me that in her field there are many situations where reproducing a result is considered essential for moving the area forward. “A major result produced by only one group, or with only one instrument, is rarely taken as definitive.” Mack points out that even findings that have been replicated many times over are valued, such as the Hubble constant, which describes the rate of expansion of the universe. “Many groups have measured it with many different methods (or in some cases the same method), and each new result is considered noteworthy and definitely publishable.”18
New rules and constraints on replications—especially when issued by heavyweights such as Kahneman—are likely to discourage academic journals from publishing them, and in this respect journals hardly need further discouragement. With rare exceptions, none of the most prominent journals in psychology or neuroscience regularly publish direct replications, and even the megajournal PLOS ONE—one of the few outlets to explicitly disavow the traditional focus on novelty—warns authors that submissions that “replicate or are derivative of existing work will likely be rejected if authors do not provide adequate justification.”19 Populating the literature with false or unconfirmed discoveries is thus deemed acceptable while verifying the replicability of those discoveries demands special explanation.
In contrast to physicists, many senior psychologists seem content with a lack of direct replication. In 2014, Jason Mitchell from Harvard University argued that “hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value.”20 Mitchell’s central thesis was that a failed replication is most likely the result of human error on the part of the researcher attempting the replication rather than the unreliability of the phenomenon originally reported. But in a lapse of logic, Mitchell ignored the possibility that the very same human error can give rise to false discoveries.21
Psychologists Wolfgang Stroebe and Fritz Strack have also published a provocative defense of the status quo.22 Like Kahneman, they argue that in at least some areas of psychology, direct replications are not possible because the original results depend on a host of hidden methodological variables that are just as invisible to the original researchers as to those attempting the replication. Therefore, they argue, when a direct replication fails it is most likely due to the dependence of the effect on hidden moderators rather than the unreliability of the original finding. Instead of defending direct replication, Stroebe and Strack take the view that conceptual replications are more informative because they test the “underlying theoretical construct.” They conclude that direct replications are uninteresting regardless of the outcome because they bring us no closer to knowing whether the original study was “a good test of the theory.”
Stroebe and Strack’s article triggered a trenchant rebuke from psychologist Dan Simons. Simons argues that this characterization of psychology renders it blatantly unscientific because if the failure of a direct replication can always be explained by hidden variables (or moderators) then the original result, by definition, can never be falsified. In such a world there can be no false discoveries. The task of ensuring reproducibility thus becomes impossible because, as Simons puts it:
the number of possible moderators is infinite: perhaps the effect depends on the phases of the moon, perhaps it only works at a particular longitude and latitude, perhaps it requires subjects who ate a lot of corn as children.… We cannot accumulate evidence for the reliability of any effect. Instead, all findings, both positive and negative, can be attributed to moderators unless proven otherwise. And we can never prove otherwise.23
In addition, Strack and Stroebe’s attack on direct replication can be leveled equally at their preferred alternative of conceptual replication. Whether it succeeds or fails, a conceptual replication could also arise because of the action of hidden moderators or random causes; indeed the fact that a conceptual replication adopts a different methodology can only increase its vulnerability to such confounds. Hidden moderators or (as yet) unexplained causal factors are no doubt a reality in much psychological research, as in other sciences. The rational solution to such complexity is a program of comprehensive direct replication followed by cautious, incremental advances. But discounting direct replication because of the hypothetical actions of unobserved (and unobservable) moderators is tantamount to an argument for magic.
Reason 2: Lack of Power
In chapter 2 we saw how hidden analytic flexibility corrupts science by elevating the rate of false positives. As shown by Joe Simmons and colleagues, exploiting researcher degrees of freedom can have a profound effect on the α level—the probability of erroneously rejecting a true null hypothesis—increasing it from a nominal .05 to a startling .60 or higher. These questionable practices are certainly a major source of unreliable findings in psychology. However, an additional and frequently overlooked source is the prospect of experiments missing true discoveries. Under the logic of NHST, this is referred to as β: the probability of failing to reject a false null hypothesis. To draw a courtroom analogy, α can be seen as the probability of convicting an innocent defendant, while β is the probability of acquitting a guilty one. The probability of correctly rejecting a false null hypothesis—that is, of correctly convicting a guilty perpetrator—is thus calculated as 1–β. This value is referred to as statistical power and tells us the probability that a statistical test will detect an effect of a given size, where that effect truly exists.
Since the 1960s, researchers have known that psychology suffers from low statistical power. Psychologist Jacob Cohen was among the first to draw the issue of power and β into prominence. Cohen surveyed all articles that appeared in 1960 across three issues of the Journal of Abnormal and Social Psychology. He found that, on average, they had only a 48 percent chance of detecting effects of medium size and only an 18 percent chance of detecting even smaller effects.24 Even for large effect sizes, Cohen found that the studies in his cohort bore a 17 percent chance of missing them.
Cohen’s analysis of statistical power was groundbreaking and led to a chain of investigations into power throughout the 1960s and 1970s. Even so, these studies had little influence on research practices. When psychologists Peter Sedlmeier and Gerd Gigerenzer returned to the question in 1989 they found that the average statistical power in psychology had scarcely changed between 1960 and 1984.25 In 2001, Scott Bezeau and Roger Graves took yet another look, repeating the analysis on 66 studies published in the field of clinical neuropsychology between 1998 and 1999. Again, they confirmed an overall power to detect medium effect sizes of about 0.50—virtually identical to previous investigations.26 For four decades, psychological research has remained steadfastly underpowered, despite widespread awareness of the problem. Most recently, in 2013, Kate Button and colleagues from the University of Bristol extended these findings to neuroscience, uncovering an even lower median power of 0.21.27
Why do psychologists neglect power? Psychologist Klaus Fiedler and colleagues have argued that the prevalence of underpowered studies follows from a cultural fixation on α at the expense of β. Doing so, they argue, leads to high rates of false negatives—that is, missed true discoveries—that can cause lasting damage to theory generation. While false positives at least have the potential to be disconfirmed by additional research (albeit minimally, owing to low rates of replication), the unrelenting pressure for researchers to produce publishable positive results means that failed hypotheses are likely to be swiftly discarded and forgotten. Fiedler and colleagues warn that when a correct hypothesis is wrongly rejected and abandoned because of a false negative then “even the strictest tests of the remaining hypotheses can only create an illusion of validity.”28
Not surprisingly, a common finding from the studies of Cohen onward is that very few published studies in
psychology determine sample sizes through a priori (prospective) power analysis. Instead, sample sizes tend to be decided by a rule of thumb in which the sample size is roughly matched to previous experiments that succeeded in obtaining statistically significant effects. The problem with this approach is that it ignores the statistical properties of replication: when a study detects a true positive with a p value just below .05 (e.g., p = .049) then an exact replication with the same sample size (assuming the same effect size) has only about a 50 percent chance of successfully repeating the discovery.29 Increasing the power to 0.8 or higher generally requires researchers to at least double the original sample size. Given the prevalence of “just significant” results in psychology (see chapter 2), when researchers determine sample sizes based on “what worked last time” then the overall statistical power of a field will converge on 0.5 or less. So it is no coincidence that Cohen and others have consistently estimated the power of psychological research to depend on the flip of a coin. Valuing our instincts in sample size selection over and above formal a priori power analysis places a glass ceiling on the statistical sensitivity of any scientific endeavor.
Another reason for the neglect of statistical power in psychology is the almost complete absence of direct replication. Because no two studies employ exactly the same design, this allows researchers to shrug off concerns about power on the grounds that nothing exactly the same has ever been done before, so there is nothing comparable in the literature on which to base a power analysis. This argument is, of course, fallacious for a number of reasons. First, even conceptual replications often apply only small tweaks to previous experimental methods; therefore an estimate of the expected effect size (and hence power) can be gleaned, especially when looking over a body of previous similar work. Second, even in rare cases where power cannot be estimated from previous work, researchers can at least motivate the sample size to detect an effect of a particular size (e.g., small, medium, or large effects, as defined by Cohen).30
It may seem obvious that low power reduces the reliability of psychological research by increasing the rate of false negatives (β), but what about the risk of false positives (α)? Because underpowered studies lack sensitivity to detect true effects, any effects they do reveal must be large in order to achieve statistical significance. Therefore, you might think that a positive result (p<.05 from a low-powered study would actually be more convincing than the equivalent finding high-powered all isn bar for detecting true discovery in necessarily higher>
While it is the case that the effect size required to achieve statistical significance is greater for low-powered experiments, this doesn’t mean that the probability of those significant results being true is necessarily higher. Here we must be careful not to confuse the probability of the data under the null hypothesis (p) with the probability that an obtained positive result is a true positive. The probability of a true positive cannot be inferred directly from the p value of an experiment; it must instead be estimated through the positive predictive value (PPV). If you imagine traversing a desert in search of water, the PPV can be thought of as the chance that a shimmer on the horizon is an oasis rather than a mirage. Mathematically it is calculated as the number of true positives (oases) divided by the total number of all positive observations (shimmers), both true (oases) and false (mirages):
To see how the PPV relates to power, we need to substitute these quantities for probabilities. In the numerator, the number of true positives can be replaced by the statistical power of the experiment (1–β) multiplied by the prior probability (R) that the effect being studied is truly positive:
Then, since the total number of obtained positive results is the sum of all true positives and false positives, we can replace the denominator with the probability of obtaining a true positive, (1–β) × R, added to the probability of obtaining a false positive (α):
FIGURE 3.2. The statistical power of an experiment tells us its positive predictive value (PPV): the probability that a statistically significant effect is a true positive. This curve shows the relationship between power and PPV for a hypothetical experiment where the prior probability of the null hypothesis being false is 0.2 (i.e., a reasonably unlikely hypothesis) and α =.05. When power is just 0.15 (lower-left dotted line), then the PPV is also just .375. But when power is increased to .95 (upper-right dotted line), the PPV rises to 0.79. This curve plateaus at 0.8; to achieve a higher PPV, the researcher would either need to lower the α or test a more plausible H1.
As we can see in figure 3.2, the PPV is therefore directly related to statistical power: as power increases, so does the chance that a statistically significant effect is a true positive. For example, if we assume a relatively low statistical power of 0.15, combined with the prior probability that the null hypothesis is false of 0.20 (R), then the probability of a positive result being a true positive is only 0.375 (PPV). Increasing the statistical power from 0.15 to 0.95 increases the PPV from 0.375 to 0.79. Therefore, high power not only helps us limit the rate of false negatives, but it also caps the rate of false positives—just as a powerful experiment increases the chances of finding an oasis, it also reduces the chances of leading us toward a mirage.
As well as failing to recognize the importance of statistical power, psychologists routinely underestimate how complexity in experimental designs reduces power. Psychologist Scott Maxwell from the University of Notre Dame has argued that failure to appreciate the price of complexity could be a major cause of the historically low rates of power in psychology. Even in simple factorial designs, power losses quickly accumulate. Suppose, for instance, you wanted to know the effect of a new cognitive training intervention on weight loss in men and women. To address this question you set up a 2 × 2 factorial design. One factor is the type of training: whether participants receive the new intervention or a control condition (i.e., a placebo). The other factor is the gender of the participants (male or female). Each male and female participant is randomly assigned to one of the treatment groups. Let’s further suppose you have key research questions about each of the factors. First, you want to know whether your new intervention works better than the control intervention. This will be the main effect of training type, collapsed across gender. Second, you are interested in whether men or women are generally more successful at losing weight. This will be the main effect of gender, collapsed across training type. And finally, you want to know whether training intervention is more effective in one gender than the other. This will be the interaction of gender × training type. Based on similar studies in this area, you settle on a sample size of 120 participants, with 30 participants in each of the four groups. What would your power be to detect a medium-sized effect for each of the main effects and the interaction? Maxwell calculated it to be just 0.47, thus on more than 50 percent of such experiments at least one of your tests would return a false negative.31
In 2002, Rachel Smith and colleagues from the University of Michigan reported a series of computer simulations to test the chances of obtaining β errors in experimental designs that have even more factors and explore more complex interactions. What they found was disturbing—at effect sizes and sample sizes commonly observed in psychological research, the rate of false negative conclusions was as high as 84 percent.32 Their message was clear: researchers ignore power at their peril.
Reason 3: Failure to Disclose Methods
The main purpose of including method sections in published research articles is to provide readers with enough information about the design and analysis to be able to replicate the experiment. A critical reader of any method section should be asking not only whether the reported procedure is sound but also whether it provides sufficient details to be repeatable. Unfortunately, an additional source of unreliability in psychology lies in the systematic failure of studies to disclose sufficient methodological detail to allow exact replication.
Evidence of a systematic lack of disclosure was first
uncovered in the survey of 2,000 psychologists by Leslie John and colleagues reported in chapter 2. Based on self-reports, John and colleagues estimated that more than 70 percent of psychologists have failed to report all experimental conditions in a paper, and that virtually all psychologists have failed to fully disclose the experimental measures in a study. In 2013, a team of psychologists led by Etienne LeBel from the University of Western Ontario sought to determine the prevalence of such practices. LeBel and colleagues contacted a random 50 percent of authors who had published in four of the most prestigious psychology journals throughout 2012 and asked whether they had fully disclosed the exclusion of observations, experimental conditions, experimental measures, and the method used to determine final sample size.33 Of 347 authors contacted, 161 replied—and of these, 11 percent reported that they had failed to report the exclusion of observations or experimental conditions. But even more startling was that 45 percent of respondents admitted failing to disclose all the experimental measures they acquired. The most popular reason for concealing these details was that the excluded measure was “unrelated to the research question.”34 LeBel and colleagues conclude that the time has come for disclosure statements to become a mandatory part of the manuscript submission process at all psychology journals. Disclosure statements could be as simple as the “21 word” solution proposed in the same year by Simmons and colleagues (see chapter 2), which simply requires authors to say: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”35