Erroneous analyses of interactions in neuroscience: a problem of significance

Erroneous analyses of interactions in neuroscience:
a problem of significance
Sander Nieuwenhuis1,2, Birte U Forstmann3 & Eric-Jan Wagenmakers3
In theory, a comparison of two experimental effects requires a
That is, as famously noted by Rosnow and Rosenthal “surely, God statistical test on their difference. In practice, this comparison loves the 0.06 nearly as much as the 0.05”. Thus, when making a com-
is often based on an incorrect procedure involving two
parison between two effects, researchers should report the statistical separate tests in which researchers conclude that effects
significance of their difference rather than the difference between differ when one effect is significant (P < 0.05) but the other
is not (P > 0.05). We reviewed 513 behavioral, systems and
Our impression was that this error of comparing significance levels cognitive neuroscience articles in five top-ranking journals
is widespread in the neuroscience literature, but until now there were (Science, Nature, Nature Neuroscience, Neuron and The
no aggregate data to support this impression. We therefore exam- Journal of Neuroscience) and found that 78 used the correct
ined all of the behavioral, systems and cognitive neuroscience stud- procedure and 79 used the incorrect procedure. An additional
ies published in four prestigious journals (Nature, Science, Nature analysis suggests that incorrect analyses of interactions are
Neuroscience and Neuron) in 2009 and 2010 and in every fourth issue even more common in cellular and molecular neuroscience.
of the 2009 and 2010 volumes of The Journal of Neuroscience. In 157 We discuss scenarios in which the erroneous procedure is
of these 513 articles (31%), the authors describe at least one situation particularly beguiling.
in which they might be tempted to make the error. In 50% of these cases (78 articles; , the authors used the correct approach: “The percentage of neurons showing cue-related activity increased they reported a significant interaction. This may be followed by with training in the mutant mice (P < 0.05), but not in the control the report of the simple main effects (that is, separate analyses for mice (P > 0.05).” “Animals receiving vehicle (control) infusions the main effect of training in the mutant mice and control mice). In the into the amygdala showed increased freezing to the conditioned other 50% of the cases (79 articles), the authors made at least one error stimulus compared with a control stimulus (P < 0.01); in animals of the type discussed here: they reported no interaction effect, but receiving muscimol infusions into the amygdala, this difference only the simple main effects, pointing out the qualitative difference between their significance values (for example, vehicle infusions were These two fictive, but representative, statements il ustrate a statisti- associated with a statistical y significant increase in freezing behavior; cal error that is common in the neuroscience literature. The researchers muscimol infusions were not associated with a reliable increase in who made these statements wanted to claim that one effect (for exam- freezing behavior).
ple, the training effect on neuronal activity in mutant mice) was larger Are all these articles wrong about their main conclusions? We do or smaller than the other effect (the training effect in control mice). not think so. First, we counted any paper containing at least one erro- To support this claim, they needed to report a statistical y significant neous analysis of an interaction. For a given paper, the main conclu- interaction (between amount of training and type of mice), but instead sions may not depend on the erroneous analysis. Second, in roughly they reported that one effect was statistically significant, whereas one third of the error cases, we were convinced that the critical, but the other effect was not. Although superficially compelling, the lat- missing, interaction effect would have been statistically significant ter type of statistical reasoning is erroneous because the difference (consistent with the researchers’ claim), either because there was between significant and not significant need not itself be statistically an enormous difference between the two effect sizes or because the significant Consider an extreme scenario in which training- reported methodological information allowed us to determine the induced activity barely reaches significance in mutant mice (for approximate significance level. Nonetheless, in roughly two thirds of example, P = 0.049) and barely fails to reach significance for control the error cases, the error may have had serious consequences. In all mice (for example, P = 0.051). Despite the fact that these two P values of these cases, the nonsignificant difference, although smal er in size, lie on opposite sides of 0.05, one cannot conclude that the training was in the same direction as the significant difference. In addition, effect for mutant mice differs statistically from that for control mice. the methodological information did not allow us to determine the significance level of the missing interaction test. We have no way of assessing the severity of these cases. Most of the errors may not have Department of Psychology, Leiden University, Leiden, The Netherlands. 2Leiden Institute for Brain and Cognition, Leiden, The Netherlands. 3Cognitive Science severe implications. In some cases, however, the error may contribute Center Amsterdam, University of Amsterdam, Amsterdam, The Netherlands. substantially to the article’s main conclusions.
Correspondence should be addressed to S.N. ([email protected]).
Because of our background expertise, our main analysis focused on behavioral, systems and cognitive neuroscience. However, it is nature neuroscienceVOLUME 14 | NUMBER 9 | SEPTEMBER 2011
Table 1 Outcome of the main literature analysis
pre-test and a post-test can be seen as a spe- cial case of the situation described above, in which the pre-test (before the experimental manipulation) is the control condition and the post-test (after the manipulation) is For this analysis, we included every article of which the abstract referred to behavior, cognitive function or the experimental condition. An example is “Acute fluoxetine treatment increased social likely that the incorrect analysis of interactions is not just limited approach behavior (as indexed by sniff time) in our mouse model of to these disciplines. To confirm this intuition, we reviewed an depression (P < 0.01)” . Errors of this type are less common additional 120 cel ular and molecular neuroscience articles published and often less explicit. In this example, the researchers contrast only in Nature Neuroscience in 2009 and 2010 (the first five Articles in each the post-test scores of the two groups, on the tacit assumption that issue). We did not find a single study that used the correct statistical they need not take into account the corresponding pre-test scores, procedure to compare effect sizes. In contrast, we found at least 25 perhaps because the pre-test scores do not reliably differ between studies that used the erroneous procedure and explicitly or implic- groups. Thus, the researchers implicitly base their claim on the dif- itly compared significance levels. In general, data collected in these ference between the significant post-test difference and the nonsig- cellular and molecular neuroscience studies were analyzed mostly nificant pre-test difference, when instead they should have directly with t tests (possibly corrected for multiple comparisons or unequal compared the effect sizes, for example, by examining the time × variances) and occasionally with one-way ANOVAs, even when the group interaction in a repeated-measures analysis of variance.
experimental design was multifactorial and required a more sophis- The third type of error occurs when comparing several brain areas and claiming that a particular effect (property) is specific for one of Our literature analyses showed that the error occurs in many dif- these brain areas. In this type of situation, researchers do not com- ferent situations: when researchers compared the effects of a phar- pare a designated region of interest with a control area, but instead macological agent versus placebo; patients versus controls; one versus compare a number of brain areas with more or less equal ‘a priori another task condition, brain area or time point; genetical y modified status’. An example would be “Escape latency in the Morris water versus wild-type animals; younger versus older participants; etc. We maze was affected by lesions of the entorhinal cortex (P < 0.05), but describe three general types of situations in which the error occurs was spared by lesions of the perirhinal and postrhinal cortices (both and illustrate each with a prototypical (fictive) example.
P values > 0.1), pointing to a specific role for the enthorinal cortex First, most of the errors that we encountered in our analysis in spatial memory” (). Although this type of conclusion is less occurred when comparing effect sizes in an experimental group/ salient than the explicit claim of a difference between brain areas, the condition and a control group/condition (for example, sham-TMS, specificity claim nevertheless requires a direct statistical comparison. vehicle infusion, placebo pill, wild-type mice). The two examples at That is, at the very least, spatial memory should be more impaired in the start of this article belong to this type. Another example would animals with enthorinal lesions than in animals with lesions in other be “Optogenetic photoinhibition of the locus coeruleus decreased areas. Thus, the specificity claim requires that the researchers report the amplitude of the target-evoked P3 potential in viral y transduced a significant time × lesion type interaction, followed by significant animals (P = 0.012), but not in control animals (P = 0.3)” (). pair-wise comparisons between the specific brain area and the other The researchers contrast the significance levels of the two effect brain areas.
sizes instead of reporting the significance level of a direct statistical These three examples involve errors that we would classify as being comparison between the effect sizes. The claim that the effect of potentially serious, as the nonsignificant effect is in the same direc- the optogenetic manipulation on P3 amplitude is larger in the tion as the significant effect (except for the perirhinal cortex), and virally transduced animals than in the control animals requires a because the information in is not sufficient to estimate significant interaction between the manipulation (photoinhibition the significance of the missing interaction test. The reason is that each versus baseline) and group (virally transduced versus control mice). of these three graphs contains repeated measurements (for example, Because the plotted results reflect the group averages of individual before and after treatment). In the case of repeated measurements averages that we generated ourselves (for ten mice in each group), on the same group(s) of subjects, the standard-error bars do not give we know that the interaction in this example is not significant the information needed to assess the significance of the differences (P > 0.05). Thus, the claim that the researchers intend to make is not between the repeated measurements, as they are not sensitive to the correlations between these measurement. Standard-error bars can Figure 1 Graphs illustrating the various types
of situations in which the error of comparing significance levels occurs. (a) Comparing effect
sizes in an experimental group/condition and a control group/condition. (b) Comparing effect
sizes during a pre-test and a post-test. (c) Comparing several brain areas and claiming
that a particular effect (property) is specific for one of these brain areas. (d) Data presented in a,
after taking the difference of the two repeated-measures (photoinhibition and baseline). Error bars indicate s.e.m.; ns, nonsignificant (P > 0.05), *P < 0.05, **P < 0.01.
VOLUME 14 | NUMBER 9 | SEPTEMBER 2011 nature neuroscience only be used to assess the significance of between-group differences. for simplicity may be the reasons why the error occurs in journals Thus, the reader can only judge whether an interaction would be such as Nature and Science. Reporting interactions in an analysis of significant if the means and standard errors reflect the difference variance design may seem overly complex when one is writing for between repeated measurements (as in , which is based on the a general readership. Perhaps, in some cases, researchers choose to same data as. Thus, unlike we can use report the difference between significance levels because the corres- to estimate the significance of the interaction by comparing the size ponding interaction effect is not significant. Peer reviewers should of the gap (or in other situations the degree of overlap) between the help authors avoid such mistakes. The statistical error may also be a manifestation of the cliff effect, the phenomenon that many We have discussed errors that occur when researchers compare people’s confidence in a result drops abruptly when a P value experimental effects. However, in our analysis, we found that the error increases just beyond the 0.05 level. Indeed, people are generally also occurs when researchers compare correlations. A fictive example tempted to attribute too much meaning to the difference between would be “Hippocampal firing synchrony correlated with memory significant and not significant. For this reason, the use of confidence performance in the placebo condition (r = 0.43, P = 0.01), but not in intervals may help prevent researchers from making this statistical the drug condition (r = 0.19, P = 0.21)”. When making a comparison error. Whatever the reasons for the error, its ubiquity and potential between two correlations, researchers should directly contrast the two effect suggest that researchers and reviewers should be more aware correlations using an appropriate statistical method.
that the difference between significant and not significant is not itself As noted by other the error of comparing significance levels is necessarily significant.
especially common in the neuroimaging literature, in which results are typically presented in color-coded statistical maps indicating the AUTHOR CONTRIBUTIONS
significance level of a particular contrast for each (visible) voxel. S.N. conceived the project and made the figure. S.N., B.U.F. and E.-J.W. conducted the literature analyses and wrote the paper.
A visual comparison between maps for two groups might tempt the researcher to state, for example, that “the hippocampus was signifi- COMPETING FINANCIAL INTERESTS
cantly activated in younger adults, but not in older adults”. However, The authors declare no competing financial interests.
the implied claim is that the hippocampus is activated more strongly Published online at in younger adults than in older adults, and such a claim requires a Reprints and permissions information is available online at direct statistical comparison of the effects. Similarly, claims about differences in activation across brain regions must be supported by a significant interaction between brain region and the factor underlying the contrast of interest. For example, “Compared to non-moral dilem- 1. Gelman, A. & Stern, H. The difference between “significant” and “not significant” is not itself statistically significant. Am. Stat. 60, 328–331 (2006).
mas, the moral dilemmas activated only the insular cortex, suggest- 2. Rosnow, R.J. & Rosenthal, R. Statistical procedures and the justification of ing that this area is uniquely involved in making moral judgments”. knowledge in psychological science. Am. Psychol. 44, 1276–1284 (1989).
Identification of the significant response in the insular cortex does 3. Loftus, G.R. & Masson, M.E.J. Using confidence intervals in within-subject designs. Psychon. Bull. Rev. 1, 476–490 (1994).
not imply that this region is uniquely or more strongly involved in 4. Cumming, G., Fidler, F. & Vaux, D.L. Error bars in experimental biology. J. Cel Biol. 177,
making moral judgments than other regions. It merely implies that, 5. Henson, R. What can functional neuroimaging tell the experimental psychologist? although the null hypothesis has been rejected in this region, it has Q. J. Exp. Psychol. A 58, 193–233 (2005).
6. Poldrack, R.A. et al. Guidelines for reporting an fMRI study. Neuroimage 40,
It is interesting that this statistical error occurs so often, even in 7. Rosenthal, R. & Gaito, J. The interpretation of levels of significance by psychological journals of the highest standard. Space constraints and the need researchers. J. Psychol. 55, 33–38 (1963).
nature neuroscienceVOLUME 14 | NUMBER 9 | SEPTEMBER 2011


Category Names of Medication Conditions Medication Description Treated Benzodiazapine Xanax, Benzodiazapines affect neurotransmitter chemicals that nerves Klonopin, use to communicate with one another. Since scientists believe anxiety is caused by excessive activity in the nerves of the brain, and this drug is thought to help reduce this activity to a manageable

Microsoft word - 2-6 cell smart lipo balancer.doc

2-6 cell Smart Lithium Cell Balance Discharger This product features the charging and discharging of LI-PO cells. Normally, Battery charger will only consider the total voltage of the cells while they do not care whether each cell is balanced charged. For example: when the total voltage is 8.4V, if the cells are not balanced charged, one may have 4.1V while the other 4.3V. To make things wo

Copyright ©2010-2018 Medical Science