Concept: Statistical hypothesis testing
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 4 years ago
Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25-50:1, and to 100-200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.
This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.
What are the statistical practices of articles published in journals with a high impact factor? Are there differences compared with articles published in journals with a somewhat lower impact factor that have adopted editorial policies to reduce the impact of limitations of Null Hypothesis Significance Testing? To investigate these questions, the current study analyzed all articles related to psychological, neuropsychological and medical issues, published in 2011 in four journals with high impact factors: Science, Nature, The New England Journal of Medicine and The Lancet, and three journals with relatively lower impact factors: Neuropsychology, Journal of Experimental Psychology-Applied and the American Journal of Public Health. Results show that Null Hypothesis Significance Testing without any use of confidence intervals, effect size, prospective power and model estimation, is the prevalent statistical practice used in articles published in Nature, 89%, followed by articles published in Science, 42%. By contrast, in all other journals, both with high and lower impact factors, most articles report confidence intervals and/or effect size measures. We interpreted these differences as consequences of the editorial policies adopted by the journal editors, which are probably the most effective means to improve the statistical practices in journals with high or low impact factors.
Much has been written regarding p-values below certain thresholds (most notably 0.05) denoting statistical significance and the tendency of such p-values to be more readily publishable in peer-reviewed journals. Intuition suggests that there may be a tendency to manipulate statistical analyses to push a “near significant p-value” to a level that is considered significant. This article presents a method for detecting the presence of such manipulation (herein called “fiddling”) in a distribution of p-values from independent studies. Simulations are used to illustrate the properties of the method. The results suggest that the method has low type I error and that power approaches acceptable levels as the number of p-values being studied approaches 1000.
Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64-1.46) for nominally statistically significant results and D = 0.24 (0.11-0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Data analysis is used to test the hypothesis that “hitting is contagious”. A statistical model is described to study the effect of a hot hitter upon his teammates' batting during a consecutive game hitting streak. Box score data for entire seasons comprising [Formula: see text] streaks of length [Formula: see text] games, including a total [Formula: see text] observations were compiled. Treatment and control sample groups ([Formula: see text]) were constructed from core lineups of players on the streaking batter’s team. The percentile method bootstrap was used to calculate [Formula: see text] confidence intervals for statistics representing differences in the mean distributions of two batting statistics between groups. Batters in the treatment group (hot streak active) showed statistically significant improvements in hitting performance, as compared against the control. Mean [Formula: see text] for the treatment group was found to be [Formula: see text] to [Formula: see text] percentage points higher during hot streaks (mean difference increased [Formula: see text] points), while the batting heat index [Formula: see text] introduced here was observed to increase by [Formula: see text] points. For each performance statistic, the null hypothesis was rejected at the [Formula: see text] significance level. We conclude that the evidence suggests the potential existence of a “statistical contagion effect”. Psychological mechanisms essential to the empirical results are suggested, as several studies from the scientific literature lend credence to contagious phenomena in sports. Causal inference from these results is difficult, but we suggest and discuss several latent variables that may contribute to the observed results, and offer possible directions for future research.
BACKGROUND: Better knowledge of the suprascapular notch anatomy may help to prevent and to assess more accurately suprascapular nerve entrapment syndrome. Our purposes were to verify the reliability of the existing data, to assess the differences between the two genders, to verify the correlation between the dimensions of the scapula and the suprascapular notch, and to investigate the relationship between the suprascapular notch and the postero-superior limit of the safe zone for the suprascapular nerve. METHODS: We examined 500 dried scapulae, measuring seven distances related to the scapular body and suprascapular notch; they were also catalogued according to gender, age and side. Suprascapular notch was classified in accordance with Rengachary’s method. For each class, we also took into consideration the width/depth ratio. Furthermore, Pearson’s correlation was calculated. RESULTS: The frequencies were: Type I 12.4%, Type II 19.8%, Type III 22.8%, Type IV 31.1%, Type V 10.2%, Type VI 3.6%. Width and depth did not demonstrate a statistical significant difference when analyzed according to gender and side; however, a significant difference was found between the depth means elaborated according to median age (73 y.o.). Correlation indexes were weak or not statistically significant. The differences among the postero-superior limits of the safe zone in the six types of notches was not statistically significant. CONCLUSIONS: Patient’s characteristics (gender, age and scapular dimensions) are not related to the characteristics of the suprascapular notch (dimensions and Type); our data suggest that the entrapment syndrome is more likely to be associated with a Type III notch because of its specific features.
BACKGROUND: Nutritional epidemiology is a highly prolific field. Debates on associations of nutrients with disease risk are common in the literature and attract attention in public media. OBJECTIVE: We aimed to examine the conclusions, statistical significance, and reproducibility in the literature on associations between specific foods and cancer risk. DESIGN: We selected 50 common ingredients from random recipes in a cookbook. PubMed queries identified recent studies that evaluated the relation of each ingredient to cancer risk. Information regarding author conclusions and relevant effect estimates were extracted. When >10 articles were found, we focused on the 10 most recent articles. RESULTS: Forty ingredients (80%) had articles reporting on their cancer risk. Of 264 single-study assessments, 191 (72%) concluded that the tested food was associated with an increased (n = 103) or a decreased (n = 88) risk; 75% of the risk estimates had weak (0.05 > P ≥ 0.001) or no statistical (P > 0.05) significance. Statistically significant results were more likely than nonsignificant findings to be published in the study abstract than in only the full text (P < 0.0001). Meta-analyses (n = 36) presented more conservative results; only 13 (26%) reported an increased (n = 4) or a decreased (n = 9) risk (6 had more than weak statistical support). The median RRs (IQRs) for studies that concluded an increased or a decreased risk were 2.20 (1.60, 3.44) and 0.52 (0.39, 0.66), respectively. The RRs from the meta-analyses were on average null (median: 0.96; IQR: 0.85, 1.10). CONCLUSIONS: Associations with cancer risk or benefits have been claimed for most food ingredients. Many single studies highlight implausibly large effects, even though evidence is weak. Effect sizes shrink in meta-analyses.
P values and hypothesis testing methods are frequently misused in clinical research. Much of this misuse appears to be owing to the widespread, mistaken belief that they provide simple, reliable, and objective triage tools for separating the true and important from the untrue or unimportant. The primary focus in interpreting therapeutic clinical research data should be on the treatment (“oomph”) effect, a metaphorical force that moves patients given an effective treatment to a different clinical state relative to their control counterparts. This effect is assessed using 2 complementary types of statistical measures calculated from the data, namely, effect magnitude or size and precision of the effect size. In a randomized trial, effect size is often summarized using constructs, such as odds ratios, hazard ratios, relative risks, or adverse event rate differences. How large a treatment effect has to be to be consequential is a matter for clinical judgment. The precision of the effect size (conceptually related to the amount of spread in the data) is usually addressed with confidence intervals. P values (significance tests) were first proposed as an informal heuristic to help assess how “unexpected” the observed effect size was if the true state of nature was no effect or no difference. Hypothesis testing was a modification of the significance test approach that envisioned controlling the false-positive rate of study results over many (hypothetical) repetitions of the experiment of interest. Both can be helpful but, by themselves, provide only a tunnel vision perspective on study results that ignores the clinical effects the study was conducted to measure.