Concept: Hypothesis testing
This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.
We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64-1.46) for nominally statistically significant results and D = 0.24 (0.11-0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
A typical rule that has been used for the endorsement of new medications by the Food and Drug Administration is to have two trials, each convincing on its own, demonstrating effectiveness. “Convincing” may be subjectively interpreted, but the use of p-values and the focus on statistical significance (in particular with p < .05 being coined significant) is pervasive in clinical research. Therefore, in this paper, we calculate with simulations what it means to have exactly two trials, each with p < .05, in terms of the actual strength of evidence quantified by Bayes factors. Our results show that different cases where two trials have a p-value below .05 have wildly differing Bayes factors. Bayes factors of at least 20 in favor of the alternative hypothesis are not necessarily achieved and they fail to be reached in a large proportion of cases, in particular when the true effect size is small (0.2 standard deviations) or zero. In a non-trivial number of cases, evidence actually points to the null hypothesis, in particular when the true effect size is zero, when the number of trials is large, and when the number of participants in both groups is low. We recommend use of Bayes factors as a routine tool to assess endorsement of new medications, because Bayes factors consistently quantify strength of evidence. Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications.
The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.
Multivariate experiments are often analyzed by multistage multiple-comparison procedures (MCPs) that prohibit univariate testing on individual dependent variables if an overall multivariate analysis of variance (MANOVA) test fails to reject the relevant overall null hypothesis. Although the sole function of the MANOVA test in such analyses is to control the overall Type I error rate, it is known that the most popular MANOVA-protected MCPs do not control the maximum familywise error rate (MFWER). In this article, we show that the MFWER associated with standard MANOVA-protected MCPs can be so large that the protection provided by the initial MANOVA test is illusory. We show that the MFWER can be controlled nonconservatively with modified protected MCPs and with single-stage MCPs that allow for the construction of simultaneous confidence intervals on effect sizes. We argue that, given the ease with which these MCPs can be implemented, there is no justification for continued use of the standard procedures. (PsycINFO Database Record © 2013 APA, all rights reserved).
In comparing multiple treatments, 2 error rates that have been studied extensively are the familywise and false discovery rates. Different methods are used to control each of these rates. Yet, it is rare to find studies that compare the same methods on both of these rates, and also on the per-family error rate, the expected number of false rejections. Although the per-family error rate and the familywise error rate are similar in most applications when the latter is controlled at a conventional low level (e.g., .05), the 2 measures can diverge considerably with methods that control the false discovery rate at that same level. Furthermore, we shall consider both rejections of true hypotheses (Type I errors) and rejections of false hypotheses where the observed outcomes are in the incorrect direction (Type III errors). We point out that power estimates based on the number of correct rejections do not consider the pattern of those rejections, which is important in interpreting the total outcome. The present study introduces measures of interpretability based on the pattern of separation of treatments into nonoverlapping sets and compares methods on these measures. In general, range-based (configural) methods are more likely to obtain interpretable patterns based on treatment separation than individual p-value-based measures. Recommendations for practice based on these results are given in the article. Although the article is complex, these recommendations can be understood without the necessity for detailed perusal of the supporting material. (PsycINFO Database Record © 2013 APA, all rights reserved).
We present a suite of Bayes factor hypothesis tests that allow researchers to grade the decisiveness of the evidence that the data provide for the presence versus the absence of a correlation between two variables. For concreteness, we apply our methods to the recent work of Donnellan et al. (in press) who conducted nine replication studies with over 3,000 participants and failed to replicate the phenomenon that lonely people compensate for a lack of social warmth by taking warmer baths or showers. We show how the Bayes factor hypothesis test can quantify evidence in favor of the null hypothesis, and how the prior specification for the correlation coefficient can be used to define a broad range of tests that address complementary questions. Specifically, we show how the prior specification can be adjusted to create a two-sided test, a one-sided test, a sensitivity analysis, and a replication test.
Abstract Measuring a change in the existence of disease symptoms before and after a treatment is examined for statistical significance by means of the McNemar test. When two treatment groups of patients are compared, Feuer and Kessler (1989) proposed a two-sample McNemar test. In this paper, we show that this test usually inflates the type I error in the hypothesis testing, and propose a new two-sample McNemar test that is superior in terms of preserving type I error. We also make the connection between the two-sample McNemar test and the test statistic for the equal residual effects in a 2 × 2 crossover design. The limitations of the two-sample McNemar test are also discussed.
The p value obtained from a significance test provides no information about the magnitude or importance of the underlying phenomenon. Therefore, additional reporting of effect size is often recommended. Effect sizes are theoretically independent from sample size. Yet this may not hold true empirically: non-independence could indicate publication bias.
SummaryWe introduce a new Empirical Bayes approach for large-scale hypothesis testing, including estimating false discovery rates (FDRs), and effect sizes. This approach has two key differences from existing approaches to FDR analysis. First, it assumes that the distribution of the actual (unobserved) effects is unimodal, with a mode at 0. This “unimodal assumption” (UA), although natural in many contexts, is not usually incorporated into standard FDR analysis, and we demonstrate how incorporating it brings many benefits. Specifically, the UA facilitates efficient and robust computation-estimating the unimodal distribution involves solving a simple convex optimization problem-and enables more accurate inferences provided that it holds. Second, the method takes as its input two numbers for each test (an effect size estimate and corresponding standard error), rather than the one number usually used ([Formula: see text] value or [Formula: see text] score). When available, using two numbers instead of one helps account for variation in measurement precision across tests. It also facilitates estimation of effects, and unlike standard FDR methods, our approach provides interval estimates (credible regions) for each effect in addition to measures of significance. To provide a bridge between interval estimates and significance measures, we introduce the term “local false sign rate” to refer to the probability of getting the sign of an effect wrong and argue that it is a superior measure of significance than the local FDR because it is both more generally applicable and can be more robustly estimated. Our methods are implemented in an R package ashr available from http://github.com/stephens999/ashr.