This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.
In recent years, researchers have attempted to provide an indication of the prevalence of inflated Type 1 error rates by analyzing the distribution of p-values in the published literature. De Winter & Dodou (2015) analyzed the distribution (and its change over time) of a large number of p-values automatically extracted from abstracts in the scientific literature. They concluded there is a ‘surge of p-values between 0.041-0.049 in recent decades’ which ‘suggests (but does not prove) questionable research practices have increased over the past 25 years.’ I show the changes in the ratio of fractions of p-values between 0.041-0.049 over the years are better explained by assuming the average power has decreased over time. Furthermore, I propose that their observation that p-values just below 0.05 increase more strongly than p-values above 0.05 can be explained by an increase in publication bias (or the file drawer effect) over the years (cf. Fanelli, 2012; Pautasso, 2010, which has led to a relative decrease of ‘marginally significant’ p-values in abstracts in the literature (instead of an increase in p-values just below 0.05). I explain why researchers analyzing large numbers of p-values need to relate their assumptions to a model of p-value distributions that takes into account the average power of the performed studies, the ratio of true positives to false positives in the literature, the effects of publication bias, and the Type 1 error rate (and possible mechanisms through which it has inflated). Finally, I discuss why publication bias and underpowered studies might be a bigger problem for science than inflated Type 1 error rates, and explain the challenges when attempting to draw conclusions about inflated Type 1 error rates from a large heterogeneous set of p-values.
The Friedman rank sum test is a widely-used nonparametric method in computational biology. In addition to examining the overall null hypothesis of no significant difference among any of the rank sums, it is typically of interest to conduct pairwise comparison tests. Current approaches to such tests rely on large-sample approximations, due to the numerical complexity of computing the exact distribution. These approximate methods lead to inaccurate estimates in the tail of the distribution, which is most relevant for p-value calculation.
Hypothesis weighting improves the power of large-scale multiple testing. We describe independent hypothesis weighting (IHW), a method that assigns weights using covariates independent of the P-values under the null hypothesis but informative of each test’s power or prior probability of the null hypothesis (http://www.bioconductor.org/packages/IHW). IHW increases power while controlling the false discovery rate and is a practical approach to discovering associations in genomics, high-throughput biology and other large data sets.
We introduce a publication policy that incorporates “conditional equivalence testing” (CET), a two-stage testing scheme in which standard NHST is followed conditionally by testing for equivalence. The idea of CET is carefully considered as it has the potential to address recent concerns about reproducibility and the limited publication of null results. In this paper we detail the implementation of CET, investigate similarities with a Bayesian testing scheme, and outline the basis for how a scientific journal could proceed to reduce publication bias while remaining relevant.
Testing many null hypotheses in a single study results in an increased probability of detecting a significant finding just by chance (the problem of multiplicity). Debates have raged over many years with regard to whether to correct for multiplicity and, if so, how it should be done. This article first discusses how multiple tests lead to an inflation of the α level, then explores the following different contexts in which multiplicity arises: testing for baseline differences in various types of studies, having >1 outcome variable, conducting statistical tests that produce >1 P value, taking multiple “peeks” at the data, and unplanned, post hoc analyses (i.e., “data dredging,” “fishing expeditions,” or “P-hacking”). It then discusses some of the methods that have been proposed for correcting for multiplicity, including single-step procedures (e.g., Bonferroni); multistep procedures, such as those of Holm, Hochberg, and Šidák; false discovery rate control; and resampling approaches. Note that these various approaches describe different aspects and are not necessarily mutually exclusive. For example, resampling methods could be used to control the false discovery rate or the family-wise error rate (as defined later in this article). However, the use of one of these approaches presupposes that we should correct for multiplicity, which is not universally accepted, and the article presents the arguments for and against such “correction.” The final section brings together these threads and presents suggestions with regard to when it makes sense to apply the corrections and how to do so.
The theory has been put forward that if a null hypothesis is true, P-values should follow a Uniform distribution. This can be used to check the validity of randomisation.
The p-value is currently one of the key elements for testing statistical hypothesis despite its critics. Bayesian statistics and Bayes Factors have been proposed as alternatives to improve the scientific decision making when testing a hypothesis. This study compares the performance of two Bayes Factor estimations (the BIC-based Bayes Factor and the Vovk-Sellke p-value calibration) with the p-value when the null hypothesis holds.
The medial femoral condyle (MFC) flap has become a popular choice for treatment of small bony defects. We aim to describe outcomes after MFC flap treatment of upper and lower extremity osseous defects and test the null hypothesis that no factors influence risks for nonunion, increased time to union, and complications.
Zhan et al. () presented a kernel RV coefficient (KRV) test to evaluate the overall association between host gene expression and microbiome composition, and showed its competitive performance compared to existing methods. In this article, we clarify the close relation of KRV to the existing generalized RV (GRV) coefficient, and show that KRV and GRV have very similar performance. Although the KRV test could control the type I error rate well at 1% and 5% levels, we show that it could largely underestimate p-values at small significance levels leading to significantly inflated type I errors. As a partial remedy, we propose an alternative p-value calculation, which is efficient and more accurate than KRV p-value at small significance levels. We recommend that small KRV test p-values should always be accompanied and verified by the permutation p-value in practice. In addition, we analytically show that KRV can be written as a form of correlation coefficient, which can dramatically expedite its computation and make permutation p-value calculation more efficient.