Large-scale replication failures have shaken confidence in the social sciences, psychology in particular. Most researchers acknowledge the problem, yet there is widespread debate about the causes and solutions. Using “big data,” the current project demonstrates that unintended consequences of three common questionable research practices (retaining pilot data, adding data after checking for significance, and not publishing null findings) can explain the lion’s share of the replication failures. A massive dataset was randomized to create a true null effect between two conditions, and then these three questionable research practices were applied. They produced false discovery rates far greater than 5% (the generally accepted rate), and were strong enough to obscure, or even reverse, the direction of real effects. These demonstrations suggest that much of the replication crisis might be explained by simple, misguided experimental choices. This approach also produces empirically-based statistical corrections to account for these practices when they are unavoidable, providing a viable path forward.
Keywords: Replication; Questionable Research Practices; False Discovery Rates; Pilot Participants; Optional Stopping; Filedrawer Problem
DOI: https://doi.org/10.36850/jrn.2023.e44
Fundamentally, any science must be built upon a literature of replicable results. When this reproducibility is questioned, as has happened recently in psychology [e.g., 1], it not only undermines the theories that produced those experiments, it hinders the synthesis of the literature required to develop new theories and experiments. For psychology, the “replication crisis” has created a credibility crisis, eroding general confidence in published work, but it has also provided an opportunity for psychologists to better understand and reform their own methods to improve the scientific rigor of the field [see 2, 3 for further discussion and proposals].
The goal of the current study was to buttress these calls for reform by (a) precisely quantifying the impact of commonplace, but incorrect, practices on the integrity of scientific inferences in actual data, and (b) deriving a potential set of statistical corrections for these practices. The scope of the replication failures (e.g., 53%) suggests that the problem extends beyond individual bad actors, and must result from some combination of the general difficulty in finding results in certain domains [e.g., 4], a lack of transparency in and fidelity to the already published methods [e.g., 5, 6], and systemic questionable research practices. Here, we demonstrate that a small set of questionable research practices—retaining pilot data, adding data to achieve significance, and not publishing null results—are sufficient to produce false discovery rates (FDRs) in excess of 40% across thousands of independent experiments and obscure, and even flip, the direction of real effects.
The three specific practices discussed here come from a broader class that are known to be incorrect, but are nevertheless commonplace in psychology research [7, 8, 9, 10, 11]. Each is an example of a post-hoc decision being placed between data collection and publication. In an idealized implementation of the scientific method, researchers develop a hypothesis and an experiment to test it, collect and analyze the data, interpret the results, and publish (Figure 1; left boxes). To avoid bias, experimentation can only ever run in one direction—researchers can never go backwards based on the outcome of a later step. Most pointedly, if researchers modify their methods by making post-hoc decisions based upon results within the same experiment, they have violated core assumptions of the scientific process and statistical tests.
First, researchers regularly run “pilot” studies wherein they test a small number of participants to check the efficacy of their design and the likelihood of finding significant results. This important step in the scientific process does not represent a problem, unless the pilot data are included in final analyses. Retaining pilot data creates a circularity [e.g., 12, 13] wherein some proportion of the reported dataset was selected for inclusion because it agreed with the hypothesis (Figure 1). This practice creates an unfair head start that inflates FDRs and can even flip real effects.
Second, adding data to achieve significance (or “optional stopping” [e.g., 3]) constitutes a backwards flow (Figure 1) where the outcome of an experiment (i.e., a non-significant result) causes an adjustment in methods (e.g., adding more participants to “increase power”). This process systematically alters data collection, creating a ‘heads I win, tails we flip again’ sequence that inflates false discoveries [14]. Finally, even assuming that there was no backward flow during data collection and analysis, researchers still do not regularly publish null results. Not publishing nulls results, known as the “Filedrawer problem” [15], artificially keeps null results out of the literature (Figure 1), inflating the chances of false discoveries and estimates of effect size.
The above explanations may seem intuitive and some of these questionable practices have been modeled before [e.g., 3, 12, 16], but they remain pervasive, perhaps because they are difficult to avoid or because their impact has not been fully quantified in real data. Here, we directly demonstrate the potential impact of these practices. As will be described below, we used a dataset of over 50,000 individuals performing a common cognitive task: visual search. Note that the dataset was gleaned from performance in a mobile app in uncontrolled settings, but these data have previously been found to be broadly similar to performance in laboratory psychophysic studies [e.g., 17]. From this dataset, we generated 1 million standard sized psychology experiments and then applied these three questionable practices.
Study 1 shows in data known to contain only null findings (we systematically randomized real data) that these three questionable practices, even under conservative estimates, can produce FDRs in excess of 40%. To help in visualizing these results we have made a webpage (www.bigcogsci.com/post-hoc.html) that allows users to interactively explore FDRs resulting from the combination of these three practices across a broad range of designs. Importantly, these practices by no means represent the only instances of this class of post-hoc decisions, making this outcome all the more alarming.
Study 2 demonstrates that these three practices are powerful enough to take a fundamental finding from the visual search literature (that accuracy is higher with fewer distractions present) and completely distort it. These practices make hypotheses self-fulfilling, completely obscuring effects that do not agree with an investigator’s hypothesis, and even reversing real effects. Put another way (and as shown below), it is possible to apply these questionable research practices to this dataset to produce thousands of independent replications that erroneously disprove every theory of visual search.
Studies 1 and 2 below highlight the problems questionable practices can cause, and our first goal is to encourage researchers to avoid them whenever possible. However, such questionable practices are often commonplace, and they are so for a reason—they are sometimes a natural part of the research process. As such, below we also provide corrections (see also Sagarin et al., 2014) that allow for resorting to these practices when necessary (e.g., retaining pilot data in fields where data collection is prohibitively expensive), at least within the context of cognitive psychology experiments.
Visual search is one of the most heavily studied tasks in cognitive psychology [see 18, 19 for reviews], and as such its effects and mechanisms are relatively well known. In a typical visual search trial, participants are asked to find a target item (e.g., a simple object like a letter), which may or may not be present, amongst distractors that vary in number and/or similarity to the target. One of the most replicated findings in this literature is that search performance is worse with more distractors in terms of both accuracy (saying whether the target is present) and response time (the time spent searching the display).
Here we took advantage of perhaps the largest single visual search dataset available for research purposes, Airport Scanner (Kedlin Co.), a game available on Android and Apple devices wherein players assume the role of airport security screeners and search bags (i.e., trials) for prohibited items (i.e., targets) amongst allowed items (i.e., distractors). The dataset is proprietary and used with permission from Kedlin Co. The task mirrors the structure and performance of standard visual search laboratory-based paradigms [17, 20, 21, 22], making it an excellent dataset for comparison to standard cognitive psychology experiments.
As of March 2022, the Airport Scanner dataset contained data from over 3.8 billion trials across more than 15.5 million distinct installations (i.e., participants). For the current purposes, data from 886,338 participants were used. To analyze the data, a set of in-house scripts were developed, using the numpy and multiprocessing toolboxes combined with the George Washington University supercomputing cluster to rapidly process the data. The output of these python scripts were input into a set of in-house Matlab scripts to produce the plots presented here.
First, the full set of 886,338 participants was filtered to remove a variety of potential confounds (e.g., trials wherein the participant repeated a level), limiting the final dataset to only those participants (N=56,081) who completed at least 48 trials (24 target present and 24 target absent) in each of two conditions: displays containing exactly 7 or exactly 9 distractor items. Aggregated across this dataset there was an expected and reliable visual search result—participants were more accurate with 7 compared to 9 distractor items present (1.1% accuracy difference across all participants). This small magnitude of the difference is sensible given that having 7 versus 9 distractors is a subtle experimental change.
In Study 1, we randomized the data (see details below) to create a null dataset in which there was no difference between the conditions. The large dataset was then divided into independent experiments, each with a typical number of participants for a visual search experiment [e.g., 23, 24]. We then applied the research practices described above (see below for further details) to see what proportion of experiment led to a spurious significant difference. This FDR should be pinned to the criterion of the statistical test (e.g. .05 for p < .05) unless an unfair bias has been introduced by the questionable practices.
In contrast to modeling approaches this technique left in place all of the natural variability across participants and trials and made no assumptions about the distribution of effects across participants or experiments. This technique results in thousands of independent replications in which the effect of these practices can be quantified. The proportion of significant experiments establishes the FDR resulting from the practices and the distribution of the resulting p-values allows for the derivation of corrected thresholds to compensate for the introduced bias. These corrected thresholds have the advantage of being derived from real data, thus requiring fewer assumptions than those produced by modeling approaches [e.g., 25, 26]. However, one must still assume some level of generalization between the task under consideration and the variability in the Airport Scanner dataset.
In Study 2, we did not randomize the data, leaving in place the real difference in performance between trials with 7 vs. 9 distractors. We then show that the direction of the effect can be weakened or even reversed when the three questionable practices discussed here were applied and pilot data were selected for inclusion based on showing the opposite effect.
Study 1 investigated how retaining pilot data, adding participants, and not publishing nulls can undermine a field by producing unacceptably high FDRs. To do so, we turned the Airport Scanner dataset into a null-effect dataset so that it was an absolute fact that the only difference between the conditions was pure noise.
To generate a null-effect dataset, we took the ordered lists of the accuracy of each participant’s first 24 target-present trials that contained exactly 7 distractors and 24 target-present trials that contained exactly 9 distractors. Half of the 7-distractor trials were then randomly flipped with half of the 9-distractor trials. This procedure maintained trial order, within-, and between-subject variability, and allowed for an independent set of trials to be flipped for each participant. In the resulting null effect dataset the true difference between the distractor conditions (i.e., effect size) and the variance explained by those conditions was known to be zero. Further, the rate at which any experiment should find a significant difference should match the FDR given by the threshold of the statistical test (.05). Starting from these known values we could then evaluate the distortions introduced by any particular questionable research practice.
To begin, we applied each practice to the null effect dataset individually (see details below). To calculate the FDR—the likelihood of finding a significant effect despite the dataset containing a true null—we counted the number of significant experiments (as defined by a two-tailed t-test that produced a p-value < .05) and divided by the total number of experiments. This yielded estimates of the FDR caused by retaining pilot data and adding data.
We also derived corrected alpha-values—the threshold required to reset the FDR to .05. We considered the distribution of p-values observed across the experiments with each practice applied and took the p-value threshold beneath which only 5% of the values were observed as the correct p-value (i.e., the p-value threshold that separates the full null dataset into 95% of the experiments being not significant and 5% being significant).
The results are presented in four sections—(1) retaining pilot data, (2) adding data, (3) not publishing null results, and (4) the combination of these three practices.
Most experiments go through a repeated process of piloting, wherein some small amount of data is collected under a design, the results checked, and adjustments made for the next pilot. This piloting narrows in on a version that shows some hint of the hypothesized effect. This process is often necessary to avoid expending resources on experiments doomed to failure, and it is also statistically benign, unless the pilot data from the final design are included in the reported dataset. There is strong pressure to include these data as it conserves resources, particularly in subfields that rely on expensive and/or difficult to collect data (e.g., fMRI, neuropsychology, infant research).
To simulate the practice, we first drew 2, 4, or 8 random participants from the full, null dataset and checked whether these pilot participants showed, on average, a minimal absolute accuracy difference of 4% between the conditions. If they showed the required difference, we drew enough participants to constitute a full experiment (16-32), recorded the results, and then proceeded with the next pilot draw. If they did not, we simply proceeded to the next draw up to a maximum of 4 times. Even in this null effect dataset, it was generally possible to observe this large a difference in 4 or less pilots (2 participants: 97.3%, 4 participants: 88%, 8 participants: 72.2%). This process was repeated until 100000 experiments had been simulated.
Note that beyond the pilot participants, every additional participant reduces the proportion of the total sample that was biased. As a result, the inflation of the FDR was most extreme with smaller numbers of total participants (Figure 2A). Retaining 8 pilot participants increased the FDR by more than 3.5 times with 16 participants in the full design and by more than 2.5 times with 32 participants. Retaining only 2 pilot participants still caused an appreciable inflation of the FDR, even with 32 total participants (Figure 2A; green line).
Quantifying the impact of retaining pilot data on the FDR using this method also enables corrections. When researchers find themselves in a situation where they cannot avoid this practice, they can adjust their alpha level for significance based on what practices they willfully engaged in to determine a new threshold that accounts for the inflated FDR. However, retaining pilot data creates sufficiently large distortions that it engenders large corrections to reset the FDR to the threshold value (.05; Figure 2B). These corrections are somewhat attenuated by larger numbers of total participants, but even so, the corrected p-value thresholds remain very low, and may be very difficult to achieve if other corrections (e.g., multiple comparisons) are also needed.
Retaining pilot data is a direct selection of data for inclusion based on whether they agree with the hypothesis. While correcting for this practice is possible, retaining pilot data should be avoided as it distorts significance estimates. While adding more participants free from the selection bias attenuates these distortions, they are never truly extinguished (see web applet for a larger range of total participants; www.bigcogsci.com/post-hoc.html).
When an experiment produces results that are suggestive of a significant difference but above threshold (e.g., p=.07 when using an alpha threshold of .05), researchers sometimes add data in hopes of achieving significance. Given the publication pressure against null results (see below), and the large investment required for data collection, it is unsurprising that researchers often engage in this practice [11]. The pressure to add participants to a marginal effect is even stronger in fields where data are harder and/or more expensive to collect and analyze. Underpowered designs are made more common by the lack of good estimates of a priori experimental power, leading many aspects of experimental design to be chosen by tradition, lab chief fiat, or simple intuition. Finally, in psychological research, “add a few more participants” is sometimes a request of reviewers and/or editors during the publication process. As the below analyses will show, this is a request likely to make a reviewer feel justified as even adding a single participant inflates the FDR.
To simulate this practice, 16 participants were drawn from the null dataset and checked for a significant difference between the conditions (p<.05). If significant, we stopped and began the next experiment. If not significant, we added 1, 2, 4, or 8 new participants and checked again. We performed this process until we either achieved significance or reached the maximum number of participants for a given experiment (32; see web applet for a larger range). This process was repeated until 100000 experiments had been simulated.
To calculate the FDR for any particular total number of participants we aggregated the false discoveries for all experiments up to that size. This approach effectively captured the FDR for the maximum number of participants the experimenter was willing to add, with data collection stopping as soon as significance was reached. This optional stopping rule caused a significant inflation of FDR rates as more participants were added (Figure 3A). For example, adding one participant at a time and checking for significance more than doubled the FDR after only 8 additional participants. This effect was somewhat attenuated by adding larger numbers of participants before checking for significance but even when adding 8 participants at a time the FDR still nearly doubled after just two such sets were added.
Because the participants added were independent draws from a null distribution, the observed p-values were not particularly extreme. In fact, they tended to cluster, as has been observed in meta-analyses of the literature [27], just below the threshold value (.05). Thus the required corrections were not draconian (Figure 3B). Interestingly, the corrections required for adding larger numbers of participants were slightly larger as the smaller and more select set of significant experiments tended to have slightly more extreme p-values.
Adding data directly impinges on the internal validity of the significance estimates, inflating FDR even when large numbers of participants were added at once. Luckily, this practice did not produce p-values as extreme as those observed for retaining pilot data, leading to milder corrections. Nonetheless, this practice should be avoided if possible but reported directly in the methods section if used. The relative mildness of the impact of this questionable research practice on the FDR should not be taken as license to ignore the problem or to forgo the a priori power analyses that have been recommended, even for large N designs. These power analyses can help avoid the problem entirely and can help optimize experimental design to minimize costs in budget and schedule.
Experiments which produce null results are far less likely to be published than studies with a significant outcome [28]. This occurs for a wide range of reasons. For example, researchers typically have low confidence in interpreting a null result, as it could arise from experimental errors or a lack of power. Further, it is only recently that outlets for publishing null results have begun to gain traction [e.g., 14]. Previously, there were very few, if any, prestigious journals that would publish null results. So, given the ambiguities in interpretation and the lack of credit and support for doing so, it is understandable that most researchers will not invest the significant resources needed to publish a null result [29].
The FDR is the number of significant replications divided by the number of replications run. The effect of not publishing nulls is to artificially increase the FDR by reducing the apparent number of studies run. That is, when studies do not enter the collective knowledge of the research field, the denominator is artificially shrunk when dividing the number of significant experiments (numerator) by the total number of known experiments run (denominator). For example, imagine a literature where only 1 out of every 5 null results are ultimately published (a conservative estimate). To derive the effective FDR across this literature we simply count the number of significant experiments from our null dataset and divide by the total number of experiments run, but assuming that only 20% of the null experiments are reported. For example, with 10000 simulated experiments and 500 significant experiments that are all reported, the FDR = 500 / (500 + 9500 * 0.2) ≈ 0.208. To model this practice we simply assumed a varying rate of reporting of hidden null replications.
If there are no hidden replications, the FDR was at the threshold level (.05). As the denominator in the FDR calculation is scaled down, the FDR rapidly inflated to a break-even point of a 50% FDR when only ~5% of null replications (Figure 4A) are reported. The commensurate correction required to adjust for the distortion was equally extreme (Figure 4B).
This practice is perhaps the most problematic for a number of reasons. First, it leads directly to very severe inflation of the FDR across a literature. Second, the precise size of the “shadow” literature is entirely unknown, making it hard to guess at how much of an impact the practice is having. Third, not reporting null results also leads to an inflation of the reported effect sizes across studies, an effect that has already been shown in the literature [e.g., 30, 31; see also Study 2]. In the absence of this publication bias reproducibility increases [e.g., 32].
Up until this point, we have explored the inflation of the FDR caused by particular questionable research practices in isolation. However, these practices likely occur together. To demonstrate the impact of their combined effects, we began with the same procedure for assessing the retention of pilot data and then pulled additional participants to reach a minimum experiment size of 16. We then checked for significance as we added each participant (as described in the adding data section above) up to a maximum of 32. We repeated this process until we created 1000000 experiments.
Combining the three practices together causes a massive inflation of FDRs (Figure 5A) and concomitantly severe p-value corrections (Figure 5B). Even with no hidden replications (left panels) the combination of adding participants and retaining pilot participants was already sufficient to inflate the FDR to 3 times the standard (.15 vs. .05) for even moderately sized pilots (Figure 5A; left panel; center column). Correcting this FDR back to .05 required p-value thresholds between .03 and .025 (Figure 5B; left panel; center column). Failing to report just 33% of null results increased the FDR to 5 times the standard for the same size pilots (Figure 5A; middle panel; center column) and resulted in far lower corrected alpha values (.02 to .015; Figure 5B; middle panel; center column). With 50% reporting of null results the FDR for those same sized pilots increases to 6 times the standard (Figure 5A; right panel; center column) and the corrected alpha values to between .015 and .01 (Figure 5B; right panel; center column).
It is clear that these three questionable research practices—retaining pilot data, adding participants, and not publishing null results—alone are sufficient to explain a large portion of the “replication crisis” (e.g., a large percentage of failed replications in psychology studies, [1]). Perhaps most concerning is the fact that the true extent of the unpublished literature remains unclear. At more extreme values than those shown here, the FDR inflation becomes untenable (see www.bigcogsci.com/post-hoc.html).
In Study 2 we examined an extreme example of how questionable research practices can produce dramatic consequences, distorting and even flipping the direction of real effects.
We extracted data from the Airport Scanner dataset without the data shuffling process that produced the null dataset used in Study 1. Similar to Study 1, Result #1, a piloting procedure was implemented for either 2, 4, or 8 participants such that the pilot was retained when the participants collectively showed at least a 4% effect. However, here the effect was selected for accuracy being higher for displays containing 9 versus 7 distractors (i.e., retaining pilot data for the opposite effect than what should be found). If the effect was not over 4%, another draw was taken, up to a maximum of 4 draws. Because we were selecting against the true effect in the real dataset, 4 pilot studies were sufficient to find the desired effect in only 21.8% of simulations with 8 pilot participants, 45.4% for 4 participants, and 66.7% for 2 participants. In those simulations that succeeded in finding the desired reversed effect, we added enough additional participants to the pilots to reach 16 and then added more participants in groups of 1, 2, 4, or 8 (ultimately averaging over these conditions) up to a maximum of 32, checking for significance with each addition and stopping data collection if significance was achieved. Finally, when deriving effects across the replications we only considered those experiments that produced a significant difference between the conditions (i.e., not publishing nulls).
To maintain independence amongst the experiments (i.e., no reuse of participants) we only created experiments until the full dataset of 56,081 participants had been exhausted (in this case, it yielded ~1500 experiments total in each of the designs). Then we randomized the full dataset again and repeated this procedure until we had made 100 sets of ~1500 independent experiments.
To establish the detection rate, we took the total number of experiments in each set that showed a significant difference in each direction as measured by a two-tailed t-test and divided by the total number of experiments in the set. To calculate the reported effect size we averaged the effect sizes observed in all of the experiments without regard to the direction of the effect. We then calculated a standard error across those experiments. Finally, all three of these measures were calculated in each of the 100 sets and averaged across them.
Amongst all of the replications, Figure 6A shows the proportion that showed the real or opposite effect (detection rate) as a function of the amount of bias introduced by retaining pilot data in a direction opposite to the true effect. Figure 6B shows the reported proportion correct for trials with 7 and 9 distractors amongst only the significant experiments—which reflects not publishing null results.
In experiments where no participants were added and no pilot data were retained (unbiased experiments), detection rates for this small effect were low but clearly greater than the standard FDR (Figure 6A: dashed line). These experiments were ~13 times more likely to find an effect in the true than opposite direction. Because the significant experiments tend towards larger effect sizes, the reported difference between the conditions was inflated from the true size of 1.1% to 3.6% (Figure 6B; black line). In sum, not publishing null results more than tripled estimates of the effect size.
Next, applying even a small amount of bias by retaining 2 pilot participants that demonstrated an effect in the opposite direction (i.e., greater accuracy with more distractors present) obscured the true effect. Specifically, this reduced detection rates for the true effect, inflated detection rates for effects in the opposite direction (Figure 6A), and attenuated the reported difference between the conditions (Figure 6B; green line).
With only 4 retained pilot participants, detection rates (Figure 6A) and the average direction of the difference between the conditions reserved (Figure 6B; cyan line). Retaining 8 pilot participants almost completely eliminated detection of the true effect and massively inflated detection of an effect in the opposite direction; making effects in the opposite direction 47 times more likely to be detected (Figure 6A). Further, the reported difference between the conditions was now even more strongly in the wrong direction (~-4.5%) than the unbiased reports were in the right direction (~3.6%; Figure 6B; purple line).
In combination, these commonplace questionable practices made the hypothesized effect much easier to detect and the true effect almost impossible to find. These results constitute ~1500 independent replications in real data showing that there is either no cost, or even a benefit, of having more distracting information (a “result” at odds with the existing literature and commonsense). Further, the pairwise differences between accuracy at each level of retaining pilot data were all significantly different from each other (Figure 2B; all p’s<.05), despite the fact that they are based on the same data.
It is important to note that the effect being reversed here is quite small given it is a subtle difference in performance based on there being either 7 or 9 distractors present in the display. A larger effect would likely be less influenced by these manipulations, yet nonetheless, it is striking that using only common practices that are all but ignored in peer review and rarely (if ever) included in methods sections, we have reproducibly invalidated a core finding in visual search. Given that we were able to invalidate this core finding with more data than has been collected in the history of the field, this result could have been published as a highly significant and reliable effect, despite being entirely wrong.
Considering these outcomes in light of typical academic debates is also sobering; when one group argues for one theory and another argues for the opposite theory, they “battle” with data, but those data are collected in the context of their a priori hypotheses. These commonplace practices may unintentionally lead to a never-ending tug-of-war where each group finds data supporting their own hypotheses and never finds evidence in support of their opposition’s.
The above analyses demonstrated that it is possible to generate erroneously significant experiments (Study 1) at a rate comparable to the rate of reported replication failures [e.g., 1] and even reverse a real effect (Study 2) from three commonplace, questionable research practices. Further, we have not even applied the full range of practices that introduce these sorts of distortions; anytime a filter or a backwards flow of information is introduced (See Figure 1), an opportunity is created for the hypothesis to color the results that are published in the literature. For example, researchers often try different versions of their analyses (e.g., means vs. medians, different outlier exclusion criteria), all of which constitute multiple bites at the apple and could exacerbate the distortions quantified here.
The current findings are both heartening and disheartening. On one hand, they suggest that questionable practices might be to blame for much of psychology’s current replication crisis—not nefarious actions. On the other hand, these findings are concerning since these practices are common and clearly damaging. There is a robust literature about the impact of some of these kinds of design choices [e.g., 3], but this has yet to fully penetrate the collective consciousness of the field. Researchers must embrace how damaging these practices actually are. For example, most researchers probably know it is not good practice to retain pilot data, but not how big a distortion it can create. We hope that the current report increases awareness that these practices are likely reducing reproducibility and may even be causing reversed effects to be reported. Note that these problems can only be alleviated but not solved by simply running large numbers of participants (e.g., via Mechanical Turk); regardless of whether these practices happen in experiments with 10 or 100 participants, biases will be introduced and FDRs inflated. Experiments should be as large as is feasible and these practices also avoided for the strongest literature.
Throughout this paper we have focused on three specific questionable research practices in the specific context of hypothesis-driven research. There are many ways to expand upon these efforts, and it is noteworthy that many aspects of psychological research are exploratory and undertaken without a clear hypothesis as to the nature or direction of the effects that will result from the experiment. Such exploratory studies are worth considering in their own right as there are some practices (e.g., selecting pilot data for showing an effect) that are less likely and others (e.g., adding participants, not publishing nulls) whose effects are likely to be exacerbated. In general, exploratory studies will be less susceptible to a confirmation bias for a particular effect but more impacted by practices that clean up the data to find significance. Researchers should explicitly report when studies are exploratory, and, when possible, establish a paradigm and analysis chain that is held constant in subsequent independent replications. Certainly neither hypothesis-driven nor exploratory research can claim any immunity to the impact of the practices discussed here.
Here, we have modeled the impact of several questionable research practices on FDRs when considering research done as a single experiment. However, many published papers include several experiments that collectively support a single point, though often with very different operationalizations. Further, the literature as a whole will often contain multiple demonstrations of related significant effects. These forms of converging evidence will clearly reduce the effective inflation of FDRs that result from most questionable research practices, with the exception of the filedrawer problem, which it may actually exacerbate by encouraging the non-publication of failed replications. A reasonable estimate is that one can simply raise the FDRs reported here to the power of the number of convergent results. However, this presumes wholly independent datasets and independently derived analytical approaches, neither of which is necessarily true for any particular result. Ultimately, the extant literature must be carefully and, when possible, quantitatively examined when converging evidence is key. In general, the best strategy going forward is to avoid these practices whenever possible and transparently and completely report them when not. This strategy requires better initial experimental design and, therefore, a more accurate way of estimating the power of an experiment a priori. The corrections detailed here might offer a way to account for these questionable practices when they are unavoidable, strengthening the case for preregistration by allowing flexibility without sacrificing integrity, at least within cognitive psychology. This system would encourage the reporting of the results of the initial (preregistered) design and the results subsequent to any post-hoc decisions and corrections. Already there have been increasing calls for this sort of transparency and standardization in research practices [3, 33, 34, 35, 36, 37].
More generally, these practices arise from the particular incentive structure built into the publishing process [29, 38, 39] that encourages researchers to pursue significant and novel results rather than thorough reporting of well-designed and conducted experiments. Chasing significance encourages the development of lab specific analytical approaches [40], p-hacking [35], misreporting of p-values [41], and not publishing null results [15, 28, 42]. All of these practices introduce biases, reduce the ability to quantitatively aggregate data across studies, and create methodological vagueness in fields dependent on convergence and replication. It is perhaps time to reevaluate the centrality of significance to our publication system [see 29 for further discussion, 43].
We thank Ben Sharpe and Kedlin Co. for access to the Airport Scanner dataset, Thomas Liljetoft and Alex Hartley for help with the web-based applet, and The George Washington University Cognitive Neuroscience community, especially Justin Ericson, for helpful comments, as well as the Reviewers whose comments significantly improved the organization and clarity of the manuscript.
Conceptualization: DJK, SRM
Data curation: DJK, SRM
Formal Analysis: DJK
Funding acquisition: DJK, SRM
Investigation: DJK, SRM
Methodology: DJK, SRM
Project administration: DJK, SRM
Resources: DJK, SRM
Software: DJK, SRM
Supervision: DJK, SRM
Validation: DJK, SRM
Visualization: DJK
Writing – original draft: DJK, SRM
Writing – review & editing: DJK, SRM
The authors declare no conflicts of interest.
This work was partially supported by the Army Research Office (grants W911NF-09-1-0092, W911NF-14-1-0361, and W911NF-16-1-0274 to SRM) and through US Army Research Laboratory Cooperative Agreements #W911NF-19-2-0260 & #W911NF-21-2-0179 to DJK and SRM and the National Science Foundation (2022572 to DJK).
Kravitz, D.J. & Mitroff, S.R. (2017). Estimate of a priori power and false discovery rates from thousands of independent replications. [Abstract and Talk]. In Vision Sciences Society, 2017, May.
Data are available at https://www.bigcogsci.com/post-hoc.html.
Received: 2022-03-12
Revisions Requested: 2022-05-06
Revisions Received: 2022-06-06
Accepted: 2022-08-08
Published: 2023-06-22
Plagiarism: Editorial review of the iThenticate reports found no evidence of plagiarism.
References: A citation manager did not identify any references in the RetractionWatch database.
This paper followed a standard single-blind review process. Peer review was managed by Episteme Health Inc in 2022 prior to transfer of the journal to the Center of Trial and Error.
For the benefit of readers, reviewers are asked to write a public summary of their review to highlight the key strengths and weaknesses of the paper. Signing of reviews is optional.
The manuscript uses randomized real data to demonstrate the effects of three questionable research practices (QRPs; retaining pilot data, adding data to achieve significance, and not publishing null results) on false discovery rates. The manuscript also refers to an online tool that helps to comprehend the effects of the three QRPs.
The paper explores an important topic, as a more accurate estimation of the effects of QRPs can provide further evidence to promote transparent research practices. A strength of this paper is that a large real dataset provided the foundation of the synthetic datasets, therefore the researchers did not have to rely on assumptions, such as distributions and distribution parameters.
The manuscript describes a timely and important topic and is well written. The description of the methods and results are thorough, and the visualizations work well to communicate the findings.
No public comments.
The authors empirically demonstrate how several questionable research practices may inflate false discovery rates and thus hinder reproducibility in psychological science. Specifically, they show how FDR can be substantially increased by: retaining pilot data that agrees with a desired effect, stopping data collection when the data agree with the desired effect, not reporting null effects, and combining the three previous practices. They further show that a real effect can be reversed through some of these practices. While many researchers are aware of these questionable practices, it can be difficult to appreciate the practical effect on false discovery rates. The present work raises awareness about the severity of these practices and suggests that a simple method may be helpful for correcting p-values in very simple situations where some of these practices may be difficult to avoid (i.e., due to funding limitations).
Except where otherwise noted, the content of this article is licensed under Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
1. Collaboration OS. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716. https://doi.org/10.1126/science.aac4716
2. Ioannidis JPA. Why most published research findings are false. PLoS Medicine. 2005;2(8). https://doi.org/10.1371/journal.pmed.0020124
3. Simmons JP, Nelson LD, Simonsohn U. False-positive psychology. Psychological Science. 2011;22(11):1359-1366. https://doi.org/10.1177/0956797611417632
4. Ulrich R, Miller J. Questionable research practices may have little effect on replicability. eLife. 2020;9. https://doi.org/10.7554/eLife.58237
5. Bryan CJ, Yeager DS, O’Brien JM. Replicator degrees of freedom allow publication of misleading failures to replicate. Proceedings of the National Academy of Sciences. 2019;116(51):25535-25545. https://doi.org/10.1073/pnas.1910951116
6. Ellefson MR, Oppenheimer DM. Is replication possible without fidelity? Psychological Methods. Published online 2022. https://doi.org/10.1037/met0000473
7. Fox N, Honeycutt N, Jussim L. Better understanding the population size and stigmatization of psychologists using questionable research practices. Meta-Psychology. 2022;6. https://doi.org/10.15626/mp.2020.2601
8. Andrade C. HARKing, cherry-picking, p-hacking, fishing expeditions, and data dredging and mining as questionable research practices. The Journal of Clinical Psychiatry. 2021;82(1). https://doi.org/10.4088/JCP.20f13804
9. George BJ, Beasley TM, Brown AW, et al. Common scientific and statistical errors in obesity research. Obesity. 2016;24(4):781-790. https://doi.org/10.1002/oby.21449
10. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLOS Biology. 2015;13(3). https://doi.org/10.1371/journal.pbio.1002106
11. John LK, Loewenstein G, Prelec D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science. 2012;23(5):524-532. https://doi.org/10.1177/0956797611430953
12. Kriegeskorte N, Simmons WK, Bellgowan PSF, Baker CI. Circular analysis in systems neuroscience: The dangers of double dipping. Nature Neuroscience. 2009;12(5):535-540. https://doi.org/10.1038/nn.2303
13. Vul E, Harris C, Winkielman P, Pashler H. Puzzlingly high correlations in fmri studies of emotion, personality, and social cognition. Perspectives on Psychological Science. 2009;4(3):274-290. https://doi.org/10.1111/j.1745-6924.2009.01125.x
14. Registered reports and replications in attention, perception, & psychophysics. Attention, Perception, & Psychophysics. 2013;75(5):781-783. https://doi.org/10.3758/s13414-013-0502-5
15. Rosenthal R. The file drawer problem and tolerance for null results. Psychological Bulletin. 1979;86(3):638-641. https://doi.org/10.1037/0033-2909.86.3.638
16. Hosseini M, Powell M, Collins J, et al. I tried a bunch of things: The dangers of unexpected overfitting in classification of brain data. Neuroscience & Biobehavioral Reviews. 2020;119:456-467. https://doi.org/10.1016/j.neubiorev.2020.09.036
17. Mitroff SR, Biggs AT, Adamo SH, Dowd EW, Winkle J, Clark K. What can 1 billion trials tell us about visual search? Journal of Experimental Psychology: Human Perception and Performance. 2015;41(1):1-5. https://doi.org/10.1037/xhp0000012
18. Eckstein MP. Visual search: A retrospective. Journal of Vision. 2011;11(5):14-14. https://doi.org/10.1167/11.5.14
19. Nakayama K, Martini P. Situating visual search. Vision Research. 2011;51(13):1526-1537. https://doi.org/10.1016/j.visres.2010.09.003
20. Biggs AT, Adamo SH, Mitroff SR. Rare, but obviously there: Effects of target frequency and salience on visual search accuracy. Acta Psychologica. 2014;152:158-165. https://doi.org/10.1016/j.actpsy.2014.08.005
21. Biggs AT, Adamo SH, Dowd EW, Mitroff SR. Examining perceptual and conceptual set biases in multiple-target visual search. Attention, Perception, & Psychophysics. 2015;77(3):844-855. https://doi.org/10.3758/s13414-014-0822-0
22. Mitroff SR, Biggs AT. The ultra-rare-item effect. Psychological Science. 2014;25(1):284-289. https://doi.org/10.1177/0956797613504221
23. Cain MS, Biggs AT, Darling EF, Mitroff SR. A little bit of history repeating: Splitting up multiple-target visual searches decreases second-target miss errors. Journal of Experimental Psychology: Applied. 2014;20(2):112-125. https://doi.org/10.1037/xap0000014
24. Wolfe JM. What can 1 million trials tell us about visual search? Psychological Science. 1998;9(1):33-39. https://doi.org/10.1111/1467-9280.00006
25. Lee EC, Whitehead AL, Jacques RM, Julious SA. The statistical interpretation of pilot trials: Should significance thresholds be reconsidered? BMC Medical Research Methodology. 2014;14(1). https://doi.org/10.1186/1471-2288-14-41
26. Wagenmakers EJ, Gronau QF, Vandekerckhove J. Five bayesian intuitions for the stopping rule principle. PsyArXiv. Published online 2019. https://doi.org/10.31234/osf.io/5ntkd
27. Kühberger A, Fritz A, Scherndl T. Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size. PLoS ONE. 2014;9(9). https://doi.org/10.1371/journal.pone.0105825
28. Franco A, Malhotra N, Simonovits G. Publication bias in the social sciences: Unlocking the file drawer. Science. 2014;345(6203):1502-1505. https://doi.org/10.1126/science.1255484
29. Kravitz DJ, Baker CI. Toward a new model of scientific publishing: Discussion and a proposal. Frontiers in Computational Neuroscience. 2011;5. https://doi.org/10.3389/fncom.2011.00055
30. Duval S, Tweedie R. Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics. 2000;56(2):455-463. https://doi.org/10.1111/j.0006-341X.2000.00455.x
31. Friese M, Frankenbach J. P-hacking and publication bias interact to distort meta-analytic effect size estimates. Psychological Methods. 2020;25(4):456-471. https://doi.org/10.1037/met0000246
32. Kong X, ENIGMA Laterality Working Group, Francks C. Reproducibility in the absence of selective reporting: An illustration from largescale brain asymmetry research. Human Brain Mapping. 2020;43(1):244-254. https://doi.org/10.1002/hbm.25154
33. Nelson LD, Simmons J, Simonsohn U. Psychology’s renaissance. Annual Review of Psychology. 2018;69(1):511-534. https://doi.org/10.1146/annurev-psych-122216-011836
34. Curtis MJ, Bond RA, Spina D, et al. Experimental design and analysis and their reporting: New guidance for publication in bjp. British Journal of Pharmacology. 2015;172(14):3461-3471. https://doi.org/10.1111/bph.12856
35. Holman L, Head ML, Lanfear R, Jennions MD. Evidence of experimental bias in the life sciences: Why we need blind data recording. PLOS Biology. 2015;13(7). https://doi.org/10.1371/journal.pbio.1002190
36. Sagarin BJ, Ambler JK, Lee EM. An ethical approach to peeking at data. Perspectives on Psychological Science. 2014;9(3):293-304. https://doi.org/10.1177/1745691614528214
37. Simmons JP, Nelson LD, Simonsohn U. A 21 word solution. SSRN Electronic Journal. Published online 2012. https://doi.org/10.2139/ssrn.2160588
38. Dijk D, Manor O, Carey LB. Publication metrics and success on the academic job market. Current Biology. 2014;24(11):R516-R517. https://doi.org/10.1016/j.cub.2014.04.039
39. Ware JJ, Munafò MR. Significance chasing in research practice: Causes, consequences and possible solutions. Addiction. 2015;110(1):4-8. https://doi.org/10.1111/add.12673
40. Smaldino PE, McElreath R. The natural selection of bad science. Royal Society Open Science. 2016;3(9). https://doi.org/10.1098/rsos.160384
41. Szucs D, Ioannidis JPA. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology. 2017;15(3). https://doi.org/10.1371/journal.pbio.2000797
42. Ferguson CJ, Heene M. A vast graveyard of undead theories. Perspectives on Psychological Science. 2012;7(6):555-561. https://doi.org/10.1177/1745691612459059
43. Ioannidis JPA, Munafò MR, Fusar-Poli P, Nosek BA, David SP. Publication and other reporting biases in cognitive sciences: Detection, prevalence, and prevention. Trends in Cognitive Sciences. 2014;18(5):235-241. https://doi.org/10.1016/j.tics.2014.02.010