This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.
Much has been written regarding p-values below certain thresholds (most notably 0.05) denoting statistical significance and the tendency of such p-values to be more readily publishable in peer-reviewed journals. Intuition suggests that there may be a tendency to manipulate statistical analyses to push a “near significant p-value” to a level that is considered significant. This article presents a method for detecting the presence of such manipulation (herein called “fiddling”) in a distribution of p-values from independent studies. Simulations are used to illustrate the properties of the method. The results suggest that the method has low type I error and that power approaches acceptable levels as the number of p-values being studied approaches 1000.
BACKGROUND: Nutritional epidemiology is a highly prolific field. Debates on associations of nutrients with disease risk are common in the literature and attract attention in public media. OBJECTIVE: We aimed to examine the conclusions, statistical significance, and reproducibility in the literature on associations between specific foods and cancer risk. DESIGN: We selected 50 common ingredients from random recipes in a cookbook. PubMed queries identified recent studies that evaluated the relation of each ingredient to cancer risk. Information regarding author conclusions and relevant effect estimates were extracted. When >10 articles were found, we focused on the 10 most recent articles. RESULTS: Forty ingredients (80%) had articles reporting on their cancer risk. Of 264 single-study assessments, 191 (72%) concluded that the tested food was associated with an increased (n = 103) or a decreased (n = 88) risk; 75% of the risk estimates had weak (0.05 > P ≥ 0.001) or no statistical (P > 0.05) significance. Statistically significant results were more likely than nonsignificant findings to be published in the study abstract than in only the full text (P < 0.0001). Meta-analyses (n = 36) presented more conservative results; only 13 (26%) reported an increased (n = 4) or a decreased (n = 9) risk (6 had more than weak statistical support). The median RRs (IQRs) for studies that concluded an increased or a decreased risk were 2.20 (1.60, 3.44) and 0.52 (0.39, 0.66), respectively. The RRs from the meta-analyses were on average null (median: 0.96; IQR: 0.85, 1.10). CONCLUSIONS: Associations with cancer risk or benefits have been claimed for most food ingredients. Many single studies highlight implausibly large effects, even though evidence is weak. Effect sizes shrink in meta-analyses.
This review identifies 10 common errors and problems in the statistical analysis, design, interpretation, and reporting of obesity research and discuss how they can be avoided. The 10 topics are: 1) misinterpretation of statistical significance, 2) inappropriate testing against baseline values, 3) excessive and undisclosed multiple testing and “P-value hacking,” 4) mishandling of clustering in cluster randomized trials, 5) misconceptions about nonparametric tests, 6) mishandling of missing data, 7) miscalculation of effect sizes, 8) ignoring regression to the mean, 9) ignoring confirmation bias, and 10) insufficient statistical reporting. It is hoped that discussion of these errors can improve the quality of obesity research by helping researchers to implement proper statistical practice and to know when to seek the help of a statistician.
A typical rule that has been used for the endorsement of new medications by the Food and Drug Administration is to have two trials, each convincing on its own, demonstrating effectiveness. “Convincing” may be subjectively interpreted, but the use of p-values and the focus on statistical significance (in particular with p < .05 being coined significant) is pervasive in clinical research. Therefore, in this paper, we calculate with simulations what it means to have exactly two trials, each with p < .05, in terms of the actual strength of evidence quantified by Bayes factors. Our results show that different cases where two trials have a p-value below .05 have wildly differing Bayes factors. Bayes factors of at least 20 in favor of the alternative hypothesis are not necessarily achieved and they fail to be reached in a large proportion of cases, in particular when the true effect size is small (0.2 standard deviations) or zero. In a non-trivial number of cases, evidence actually points to the null hypothesis, in particular when the true effect size is zero, when the number of trials is large, and when the number of participants in both groups is low. We recommend use of Bayes factors as a routine tool to assess endorsement of new medications, because Bayes factors consistently quantify strength of evidence. Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications.
In this commentary we consider the validity of tobacco industry-funded research on the effects of standardised packaging in Australia. As the first country to introduce standardised packs, Australia is closely watched, and Philip Morris International has recently funded two studies into the impact of the measure on smoking prevalence. Both of these papers are flawed in conception as well as design but have nonetheless been widely publicised as cautionary tales against standardised pack legislation. Specifically, we focus on the low statistical significance of the analytical methods used and the assumption that standardised packaging should have an immediate large impact on smoking prevalence.
It is known that statistically significant (positive) results are more likely to be published than non-significant (negative) results. However, it has been unclear whether any increasing prevalence of positive results is stronger in the “softer” disciplines (social sciences) than in the “harder” disciplines (physical sciences), and whether the prevalence of negative results is decreasing over time. Using Scopus, we searched the abstracts of papers published between 1990 and 2013, and measured longitudinal trends of multiple expressions of positive versus negative results, including p-values between 0.041 and 0.049 versus p-values between 0.051 and 0.059, textual reporting of “significant difference” versus “no significant difference,” and the reporting of p < 0.05 versus p > 0.05. We found no support for a “hierarchy of sciences” with physical sciences at the top and social sciences at the bottom. However, we found large differences in reporting practices between disciplines, with p-values between 0.041 and 0.049 over 1990-2013 being 65.7 times more prevalent in the biological sciences than in the physical sciences. The p-values near the significance threshold of 0.05 on either side have both increased but with those p-values between 0.041 and 0.049 having increased to a greater extent (2013-to-1990 ratio of the percentage of papers = 10.3) than those between 0.051 and 0.059 (ratio = 3.6). Contradictorily, p < 0.05 has increased more slowly than p > 0.05 (ratios = 1.4 and 4.8, respectively), while the use of “significant difference” has shown only a modest increase compared to “no significant difference” (ratios = 1.5 and 1.1, respectively). We also compared reporting of significance in the United States, Asia, and Europe and found that the results are too inconsistent to draw conclusions on cross-cultural differences in significance reporting. We argue that the observed longitudinal trends are caused by negative factors, such as an increase of questionable research practices, but also by positive factors, such as an increase of quantitative research and structured reporting.
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ).
We studied sexually dimorphic differences in the ilium using geometric morphometric analysis of 10 osteometric landmarks recorded by multislice computed tomography, based on three-dimensional reconstructions of 188 children (95 boys, 93 girls) of mixed origins living in the area of Toulouse, southern France, and ranging in age from 1 to 18 years. We used geometric morphometrics methodology first to test sexual dimorphism in size (centroid size) and shape (Procrustes residuals) and second to examine patterns of shape change with age (development) and size change with age (growth). On the basis of statistical significance testing, the ilium shape became sexually dimorphic at 11 years of age, although visible shape differences were observed as early as 1 year of age. There was no statistically significant difference in size between sexes. Trajectories of shape (development) and size (growth) differed throughout ontogeny and between sexes.
Abstract The present study evaluates the cytotoxic and genotoxic potential of pyracarbolid using both micronuleus (MN) assay, in human lymphocytes, and Allium cepa assay, in the root meristem cells. In Allium test, EC50 value was determined in order to selecting the test concentrations for the assay and the root tips were treated with 25 ppm (EC50/2), 50 ppm (EC50) and 100 ppm (EC50 × 2) concentrations of pyracarbolid. One percent of dimethyl sulphoxide (DMSO) and methyl methane sulfonate (MMS) were used as negative and positive controls, respectively. In the micronucleus assay, the cultures were treated with four concentrations (250, 500, 750 and 1000 µg/ml) of pyracarbolid for 24 and 48 h, negative and positive controls were also used in the experiment parallely. The results showed that mitotic index (MI) significantly reduced with increasing the pyracarbolid concentration at each exposure time. It was also obtained that prophase and metaphase index decreased significantly in all concentration at each exposure time. Anaphase index decreased as well and results were found to be statistically significant, except 24 h. A significant increase was observed in MN frequency in all concentrations and both treatment periods when compared with the controls. Pyracarbolid also caused a significant reduction in the cytokinesis block proliferation index (CBPI) in all concentration and both exposure time.