Concept: Ronald Fisher
This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.
What are the statistical practices of articles published in journals with a high impact factor? Are there differences compared with articles published in journals with a somewhat lower impact factor that have adopted editorial policies to reduce the impact of limitations of Null Hypothesis Significance Testing? To investigate these questions, the current study analyzed all articles related to psychological, neuropsychological and medical issues, published in 2011 in four journals with high impact factors: Science, Nature, The New England Journal of Medicine and The Lancet, and three journals with relatively lower impact factors: Neuropsychology, Journal of Experimental Psychology-Applied and the American Journal of Public Health. Results show that Null Hypothesis Significance Testing without any use of confidence intervals, effect size, prospective power and model estimation, is the prevalent statistical practice used in articles published in Nature, 89%, followed by articles published in Science, 42%. By contrast, in all other journals, both with high and lower impact factors, most articles report confidence intervals and/or effect size measures. We interpreted these differences as consequences of the editorial policies adopted by the journal editors, which are probably the most effective means to improve the statistical practices in journals with high or low impact factors.
Questions over the clinical significance of cannabis withdrawal have hindered its inclusion as a discrete cannabis induced psychiatric condition in the Diagnostic and Statistical Manual of Mental Disorders (DSM IV). This study aims to quantify functional impairment to normal daily activities from cannabis withdrawal, and looks at the factors predicting functional impairment. In addition the study tests the influence of functional impairment from cannabis withdrawal on cannabis use during and after an abstinence attempt.
Much has been written regarding p-values below certain thresholds (most notably 0.05) denoting statistical significance and the tendency of such p-values to be more readily publishable in peer-reviewed journals. Intuition suggests that there may be a tendency to manipulate statistical analyses to push a “near significant p-value” to a level that is considered significant. This article presents a method for detecting the presence of such manipulation (herein called “fiddling”) in a distribution of p-values from independent studies. Simulations are used to illustrate the properties of the method. The results suggest that the method has low type I error and that power approaches acceptable levels as the number of p-values being studied approaches 1000.
Data analysis is used to test the hypothesis that “hitting is contagious”. A statistical model is described to study the effect of a hot hitter upon his teammates' batting during a consecutive game hitting streak. Box score data for entire seasons comprising [Formula: see text] streaks of length [Formula: see text] games, including a total [Formula: see text] observations were compiled. Treatment and control sample groups ([Formula: see text]) were constructed from core lineups of players on the streaking batter’s team. The percentile method bootstrap was used to calculate [Formula: see text] confidence intervals for statistics representing differences in the mean distributions of two batting statistics between groups. Batters in the treatment group (hot streak active) showed statistically significant improvements in hitting performance, as compared against the control. Mean [Formula: see text] for the treatment group was found to be [Formula: see text] to [Formula: see text] percentage points higher during hot streaks (mean difference increased [Formula: see text] points), while the batting heat index [Formula: see text] introduced here was observed to increase by [Formula: see text] points. For each performance statistic, the null hypothesis was rejected at the [Formula: see text] significance level. We conclude that the evidence suggests the potential existence of a “statistical contagion effect”. Psychological mechanisms essential to the empirical results are suggested, as several studies from the scientific literature lend credence to contagious phenomena in sports. Causal inference from these results is difficult, but we suggest and discuss several latent variables that may contribute to the observed results, and offer possible directions for future research.
Conclusive evidence for sexual dimorphism in non-avian dinosaurs has been elusive. Here it is shown that dimorphism in the shape of the dermal plates of Stegosaurus mjosi (Upper Jurassic, western USA) does not result from non-sex-related individual, interspecific, or ontogenetic variation and is most likely a sexually dimorphic feature. One morph possessed wide, oval plates 45% larger in surface area than the tall, narrow plates of the other morph. Intermediate morphologies are lacking as principal component analysis supports marked size- and shape-based dimorphism. In contrast, many non-sex-related individual variations are expected to show intermediate morphologies. Taphonomy of a new quarry in Montana (JRDI 5ES Quarry) shows that at least five individuals were buried in a single horizon and were not brought together by water or scavenger transportation. This new site demonstrates co-existence, and possibly suggests sociality, between two morphs that only show dimorphism in their plates. Without evidence for niche partitioning, it is unlikely that the two morphs represent different species. Histology of the new specimens in combination with studies on previous specimens indicates that both morphs occur in fully-grown individuals. Therefore, the dimorphism is not a result of ontogenetic change. Furthermore, the two morphs of plates do not simply come from different positions on the back of a single individual. Plates from all positions on the body can be classified as one of the two morphs, and previously discovered, isolated specimens possess only one morph of plates. Based on the seemingly display-oriented morphology of plates, female mate choice was likely the driving evolutionary mechanism rather than male-male competition. Dinosaur ornamentation possibly served similar functions to the ornamentation of modern species. Comparisons to ornamentation involved in sexual selection of extant species, such as the horns of bovids, may be appropriate in predicting the function of some dinosaur ornamentation.
BACKGROUND: Better knowledge of the suprascapular notch anatomy may help to prevent and to assess more accurately suprascapular nerve entrapment syndrome. Our purposes were to verify the reliability of the existing data, to assess the differences between the two genders, to verify the correlation between the dimensions of the scapula and the suprascapular notch, and to investigate the relationship between the suprascapular notch and the postero-superior limit of the safe zone for the suprascapular nerve. METHODS: We examined 500 dried scapulae, measuring seven distances related to the scapular body and suprascapular notch; they were also catalogued according to gender, age and side. Suprascapular notch was classified in accordance with Rengachary’s method. For each class, we also took into consideration the width/depth ratio. Furthermore, Pearson’s correlation was calculated. RESULTS: The frequencies were: Type I 12.4%, Type II 19.8%, Type III 22.8%, Type IV 31.1%, Type V 10.2%, Type VI 3.6%. Width and depth did not demonstrate a statistical significant difference when analyzed according to gender and side; however, a significant difference was found between the depth means elaborated according to median age (73 y.o.). Correlation indexes were weak or not statistically significant. The differences among the postero-superior limits of the safe zone in the six types of notches was not statistically significant. CONCLUSIONS: Patient’s characteristics (gender, age and scapular dimensions) are not related to the characteristics of the suprascapular notch (dimensions and Type); our data suggest that the entrapment syndrome is more likely to be associated with a Type III notch because of its specific features.
BACKGROUND: Nutritional epidemiology is a highly prolific field. Debates on associations of nutrients with disease risk are common in the literature and attract attention in public media. OBJECTIVE: We aimed to examine the conclusions, statistical significance, and reproducibility in the literature on associations between specific foods and cancer risk. DESIGN: We selected 50 common ingredients from random recipes in a cookbook. PubMed queries identified recent studies that evaluated the relation of each ingredient to cancer risk. Information regarding author conclusions and relevant effect estimates were extracted. When >10 articles were found, we focused on the 10 most recent articles. RESULTS: Forty ingredients (80%) had articles reporting on their cancer risk. Of 264 single-study assessments, 191 (72%) concluded that the tested food was associated with an increased (n = 103) or a decreased (n = 88) risk; 75% of the risk estimates had weak (0.05 > P ≥ 0.001) or no statistical (P > 0.05) significance. Statistically significant results were more likely than nonsignificant findings to be published in the study abstract than in only the full text (P < 0.0001). Meta-analyses (n = 36) presented more conservative results; only 13 (26%) reported an increased (n = 4) or a decreased (n = 9) risk (6 had more than weak statistical support). The median RRs (IQRs) for studies that concluded an increased or a decreased risk were 2.20 (1.60, 3.44) and 0.52 (0.39, 0.66), respectively. The RRs from the meta-analyses were on average null (median: 0.96; IQR: 0.85, 1.10). CONCLUSIONS: Associations with cancer risk or benefits have been claimed for most food ingredients. Many single studies highlight implausibly large effects, even though evidence is weak. Effect sizes shrink in meta-analyses.
A typical rule that has been used for the endorsement of new medications by the Food and Drug Administration is to have two trials, each convincing on its own, demonstrating effectiveness. “Convincing” may be subjectively interpreted, but the use of p-values and the focus on statistical significance (in particular with p < .05 being coined significant) is pervasive in clinical research. Therefore, in this paper, we calculate with simulations what it means to have exactly two trials, each with p < .05, in terms of the actual strength of evidence quantified by Bayes factors. Our results show that different cases where two trials have a p-value below .05 have wildly differing Bayes factors. Bayes factors of at least 20 in favor of the alternative hypothesis are not necessarily achieved and they fail to be reached in a large proportion of cases, in particular when the true effect size is small (0.2 standard deviations) or zero. In a non-trivial number of cases, evidence actually points to the null hypothesis, in particular when the true effect size is zero, when the number of trials is large, and when the number of participants in both groups is low. We recommend use of Bayes factors as a routine tool to assess endorsement of new medications, because Bayes factors consistently quantify strength of evidence. Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications.
Post-copulatory sexual selection (PSS), fuelled by female promiscuity, is credited with the rapid evolution of sperm quality traits across diverse taxa. Yet, our understanding of the adaptive significance of sperm ornaments and the cryptic female preferences driving their evolution is extremely limited. Here we review the evolutionary allometry of exaggerated sexual traits (for example, antlers, horns, tail feathers, mandibles and dewlaps), show that the giant sperm of some Drosophila species are possibly the most extreme ornaments in all of nature and demonstrate how their existence challenges theories explaining the intensity of sexual selection, mating-system evolution and the fundamental nature of sex differences. We also combine quantitative genetic analyses of interacting sex-specific traits in D. melanogaster with comparative analyses of the condition dependence of male and female reproductive potential across species with varying ornament size to reveal complex dynamics that may underlie sperm-length evolution. Our results suggest that producing few gigantic sperm evolved by (1) Fisherian runaway selection mediated by genetic correlations between sperm length, the female preference for long sperm and female mating frequency, and (2) longer sperm increasing the indirect benefits to females. Our results also suggest that the developmental integration of sperm quality and quantity renders post-copulatory sexual selection on ejaculates unlikely to treat male-male competition and female choice as discrete processes.