Concept: Statistical power
A focus on novel, confirmatory, and statistically significant results leads to substantial bias in the scientific literature. One type of bias, known as “p-hacking,” occurs when researchers collect or select data or statistical analyses until nonsignificant results become significant. Here, we use text-mining to demonstrate that p-hacking is widespread throughout science. We then illustrate how one can test for p-hacking when performing a meta-analysis and show that, while p-hacking is probably common, its effect seems to be weak relative to the real effect sizes being measured. This result suggests that p-hacking probably does not drastically alter scientific consensuses drawn from meta-analyses.
- Proceedings of the National Academy of Sciences of the United States of America
- Published over 4 years ago
Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25-50:1, and to 100-200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.
This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.
What are the statistical practices of articles published in journals with a high impact factor? Are there differences compared with articles published in journals with a somewhat lower impact factor that have adopted editorial policies to reduce the impact of limitations of Null Hypothesis Significance Testing? To investigate these questions, the current study analyzed all articles related to psychological, neuropsychological and medical issues, published in 2011 in four journals with high impact factors: Science, Nature, The New England Journal of Medicine and The Lancet, and three journals with relatively lower impact factors: Neuropsychology, Journal of Experimental Psychology-Applied and the American Journal of Public Health. Results show that Null Hypothesis Significance Testing without any use of confidence intervals, effect size, prospective power and model estimation, is the prevalent statistical practice used in articles published in Nature, 89%, followed by articles published in Science, 42%. By contrast, in all other journals, both with high and lower impact factors, most articles report confidence intervals and/or effect size measures. We interpreted these differences as consequences of the editorial policies adopted by the journal editors, which are probably the most effective means to improve the statistical practices in journals with high or low impact factors.
Reproducibility is a defining feature of science, but the extent to which it characterizes current research is unknown. We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects. Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
We have empirically assessed the distribution of published effect sizes and estimated power by analyzing 26,841 statistical records from 3,801 cognitive neuroscience and psychology papers published recently. The reported median effect size was D = 0.93 (interquartile range: 0.64-1.46) for nominally statistically significant results and D = 0.24 (0.11-0.42) for nonsignificant results. Median power to detect small, medium, and large effects was 0.12, 0.44, and 0.73, reflecting no improvement through the past half-century. This is so because sample sizes have remained small. Assuming similar true effect sizes in both disciplines, power was lower in cognitive neuroscience than in psychology. Journal impact factors negatively correlated with power. Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature. In light of our findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience.
Do interventions to promote walking in groups increase physical activity? A systematic literature review with meta-analysis
- The international journal of behavioral nutrition and physical activity
- Published about 5 years ago
OBJECTIVE: Walking groups are increasingly being set up but little is known about their efficacy in promoting physical activity. The present study aims to assess the efficacy of interventions to promote walking in groups to promoting physical activity within adults, and to explore potential moderators of this efficacy. METHOD: Systematic literature review searches were conducted using multiple databases. A random effect model was used for the meta-analysis, with sensitivity analysis. RESULTS: The effect of the interventions (19 studies, 4 572 participants) on physical activity was of medium size (d = 0.52), statistically significant (95%CI 0.32 to 0.71, p < 0.0001), and with large fail-safe of N = 753. Moderator analyses showed that lower quality studies had larger effect sizes than higher quality studies, studies reporting outcomes over six months had larger effect sizes than studies reporting outcomes up to six months, studies that targeted both genders had higher effect sizes than studies that targeted only women, studies that targeted older adults had larger effect sizes than studies that targeted younger adults. No significant differences were found between studies delivered by professionals and those delivered by lay people. CONCLUSION: Interventions to promote walking in groups are efficacious at increasing physical activity. Despite low homogeneity of results, and limitations (e.g. small number of studies using objective measures of physical activity, publication bias), which might have influence the findings, the large fail-safe N suggests these findings are robust. Possible explanations for heterogeneity between studies are discussed, and the need for more investigation of this is highlighted.
Data analysis is used to test the hypothesis that “hitting is contagious”. A statistical model is described to study the effect of a hot hitter upon his teammates' batting during a consecutive game hitting streak. Box score data for entire seasons comprising [Formula: see text] streaks of length [Formula: see text] games, including a total [Formula: see text] observations were compiled. Treatment and control sample groups ([Formula: see text]) were constructed from core lineups of players on the streaking batter’s team. The percentile method bootstrap was used to calculate [Formula: see text] confidence intervals for statistics representing differences in the mean distributions of two batting statistics between groups. Batters in the treatment group (hot streak active) showed statistically significant improvements in hitting performance, as compared against the control. Mean [Formula: see text] for the treatment group was found to be [Formula: see text] to [Formula: see text] percentage points higher during hot streaks (mean difference increased [Formula: see text] points), while the batting heat index [Formula: see text] introduced here was observed to increase by [Formula: see text] points. For each performance statistic, the null hypothesis was rejected at the [Formula: see text] significance level. We conclude that the evidence suggests the potential existence of a “statistical contagion effect”. Psychological mechanisms essential to the empirical results are suggested, as several studies from the scientific literature lend credence to contagious phenomena in sports. Causal inference from these results is difficult, but we suggest and discuss several latent variables that may contribute to the observed results, and offer possible directions for future research.
The hypothesis that the S allele of the 5-HTTLPR serotonin transporter promoter region is associated with increased risk of depression, but only in individuals exposed to stressful situations, has generated much interest, research and controversy since first proposed in 2003. Multiple meta-analyses combining results from heterogeneous analyses have not settled the issue. To determine the magnitude of the interaction and the conditions under which it might be observed, we performed new analyses on 31 data sets containing 38 802 European ancestry subjects genotyped for 5-HTTLPR and assessed for depression and childhood maltreatment or other stressful life events, and meta-analysed the results. Analyses targeted two stressors (narrow, broad) and two depression outcomes (current, lifetime). All groups that published on this topic prior to the initiation of our study and met the assessment and sample size criteria were invited to participate. Additional groups, identified by consortium members or self-identified in response to our protocol (published prior to the start of analysis) with qualifying unpublished data, were also invited to participate. A uniform data analysis script implementing the protocol was executed by each of the consortium members. Our findings do not support the interaction hypothesis. We found no subgroups or variable definitions for which an interaction between stress and 5-HTTLPR genotype was statistically significant. In contrast, our findings for the main effects of life stressors (strong risk factor) and 5-HTTLPR genotype (no impact on risk) are strikingly consistent across our contributing studies, the original study reporting the interaction and subsequent meta-analyses. Our conclusion is that if an interaction exists in which the S allele of 5-HTTLPR increases risk of depression only in stressed individuals, then it is not broadly generalisable, but must be of modest effect size and only observable in limited situations.Molecular Psychiatry advance online publication, 4 April 2017; doi:10.1038/mp.2017.44.
P values and hypothesis testing methods are frequently misused in clinical research. Much of this misuse appears to be owing to the widespread, mistaken belief that they provide simple, reliable, and objective triage tools for separating the true and important from the untrue or unimportant. The primary focus in interpreting therapeutic clinical research data should be on the treatment (“oomph”) effect, a metaphorical force that moves patients given an effective treatment to a different clinical state relative to their control counterparts. This effect is assessed using 2 complementary types of statistical measures calculated from the data, namely, effect magnitude or size and precision of the effect size. In a randomized trial, effect size is often summarized using constructs, such as odds ratios, hazard ratios, relative risks, or adverse event rate differences. How large a treatment effect has to be to be consequential is a matter for clinical judgment. The precision of the effect size (conceptually related to the amount of spread in the data) is usually addressed with confidence intervals. P values (significance tests) were first proposed as an informal heuristic to help assess how “unexpected” the observed effect size was if the true state of nature was no effect or no difference. Hypothesis testing was a modification of the significance test approach that envisioned controlling the false-positive rate of study results over many (hypothetical) repetitions of the experiment of interest. Both can be helpful but, by themselves, provide only a tunnel vision perspective on study results that ignores the clinical effects the study was conducted to measure.