- Proceedings of the National Academy of Sciences of the United States of America
- Published about 4 years ago
Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25-50:1, and to 100-200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.
A paper from the Open Science Collaboration (Research Articles, 28 August 2015, aac4716) attempting to replicate 100 published studies suggests that the reproducibility of psychological science is surprisingly low. We show that this article contains three statistical errors and provides no support for such a conclusion. Indeed, the data are consistent with the opposite conclusion, namely, that the reproducibility of psychological science is quite high.
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
Background Acetaminophen is a common therapy for fever in patients in the intensive care unit (ICU) who have probable infection, but its effects are unknown. Methods We randomly assigned 700 ICU patients with fever (body temperature, ≥38°C) and known or suspected infection to receive either 1 g of intravenous acetaminophen or placebo every 6 hours until ICU discharge, resolution of fever, cessation of antimicrobial therapy, or death. The primary outcome was ICU-free days (days alive and free from the need for intensive care) from randomization to day 28. Results The number of ICU-free days to day 28 did not differ significantly between the acetaminophen group and the placebo group: 23 days (interquartile range, 13 to 25) among patients assigned to acetaminophen and 22 days (interquartile range, 12 to 25) among patients assigned to placebo (Hodges-Lehmann estimate of absolute difference, 0 days; 96.2% confidence interval [CI], 0 to 1; P=0.07). A total of 55 of 345 patients in the acetaminophen group (15.9%) and 57 of 344 patients in the placebo group (16.6%) had died by day 90 (relative risk, 0.96; 95% CI, 0.66 to 1.39; P=0.84). Conclusions Early administration of acetaminophen to treat fever due to probable infection did not affect the number of ICU-free days. (Funded by the Health Research Council of New Zealand and others; HEAT Australian New Zealand Clinical Trials Registry number, ACTRN12612000513819 .).
To estimate how far changes in the prevalence of electronic cigarette (e-cigarette) use in England have been associated with changes in quit success, quit attempts, and use of licensed medication and behavioural support in quit attempts.
Over the past ten years, unconventional gas and oil drilling (UGOD) has markedly expanded in the United States. Despite substantial increases in well drilling, the health consequences of UGOD toxicant exposure remain unclear. This study examines an association between wells and healthcare use by zip code from 2007 to 2011 in Pennsylvania. Inpatient discharge databases from the Pennsylvania Healthcare Cost Containment Council were correlated with active wells by zip code in three counties in Pennsylvania. For overall inpatient prevalence rates and 25 specific medical categories, the association of inpatient prevalence rates with number of wells per zip code and, separately, with wells per km2 (separated into quantiles and defined as well density) were estimated using fixed-effects Poisson models. To account for multiple comparisons, a Bonferroni correction with associations of p<0.00096 was considered statistically significant. Cardiology inpatient prevalence rates were significantly associated with number of wells per zip code (p<0.00096) and wells per km2 (p<0.00096) while neurology inpatient prevalence rates were significantly associated with wells per km2 (p<0.00096). Furthermore, evidence also supported an association between well density and inpatient prevalence rates for the medical categories of dermatology, neurology, oncology, and urology. These data suggest that UGOD wells, which dramatically increased in the past decade, were associated with increased inpatient prevalence rates within specific medical categories in Pennsylvania. Further studies are necessary to address healthcare costs of UGOD and determine whether specific toxicants or combinations are associated with organ-specific responses.
Endurance exercise training studies frequently show modest changes in VO2max with training and very limited responses in some subjects. By contrast, studies using interval training (IT) or combined IT and continuous training (CT) have reported mean increases in VO2max of up to ∼1.0 L · min(-1). This raises questions about the role of exercise intensity and the trainability of VO2max. To address this topic we analyzed IT and IT/CT studies published in English from 1965-2012. Inclusion criteria were: 1)≥3 healthy sedentary/recreationally active humans <45 yrs old, 2) training duration 6-13 weeks, 3) ≥3 days/week, 4) ≥10 minutes of high intensity work, 5) ≥1∶1 work/rest ratio, and 6) results reported as mean ± SD or SE, ranges of change, or individual data. Due to heterogeneity (I(2) value of 70), statistical synthesis of the data used a random effects model. The summary statistic of interest was the change in VO2max. A total of 334 subjects (120 women) from 37 studies were identified. Participants were grouped into 40 distinct training groups, so the unit of analysis was 40 rather than 37. An increase in VO2max of 0.51 L ·min(-1) (95% CI: 0.43 to 0.60 L · min(-1)) was observed. A subset of 9 studies, with 72 subjects, that featured longer intervals showed even larger (∼0.8-0.9 L · min(-1)) changes in VO2max with evidence of a marked response in all subjects. These results suggest that ideas about trainability and VO2max should be further evaluated with standardized IT or IT/CT training programs.
This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.
De Winter and Happee  examined whether science based on selective publishing of significant results may be effective in accurate estimation of population effects, and whether this is even more effective than a science in which all results are published (i.e., a science without publication bias). Based on their simulation study they concluded that “selective publishing yields a more accurate meta-analytic estimation of the true effect than publishing everything, (and that) publishing nonreplicable results while placing null results in the file drawer can be beneficial for the scientific collective” (p.4).
In the USA, the relationship between the legal availability of guns and the firearm-related homicide rate has been debated. It has been argued that unrestricted gun availability promotes the occurrence of firearm-induced homicides. It has also been pointed out that gun possession can protect potential victims when attacked. This paper provides a first mathematical analysis of this tradeoff, with the goal to steer the debate towards arguing about assumptions, statistics, and scientific methods. The model is based on a set of clearly defined assumptions, which are supported by available statistical data, and is formulated axiomatically such that results do not depend on arbitrary mathematical expressions. According to this framework, two alternative scenarios can minimize the gun-related homicide rate: a ban of private firearms possession, or a policy allowing the general population to carry guns. Importantly, the model identifies the crucial parameters that determine which policy minimizes the death rate, and thus serves as a guide for the design of future epidemiological studies. The parameters that need to be measured include the fraction of offenders that illegally possess a gun, the degree of protection provided by gun ownership, and the fraction of the population who take up their right to own a gun and carry it when attacked. Limited data available in the literature were used to demonstrate how the model can be parameterized, and this preliminary analysis suggests that a ban of private firearm possession, or possibly a partial reduction in gun availability, might lower the rate of firearm-induced homicides. This, however, should not be seen as a policy recommendation, due to the limited data available to inform and parameterize the model. However, the model clearly defines what needs to be measured, and provides a basis for a scientific discussion about assumptions and data.