Discover the most talked about and latest scientific content & concepts.

Concept: Statistical significance


Background Concerns persist regarding the effect of current surgical resident duty-hour policies on patient outcomes, resident education, and resident well-being. Methods We conducted a national, cluster-randomized, pragmatic, noninferiority trial involving 117 general surgery residency programs in the United States (2014-2015 academic year). Programs were randomly assigned to current Accreditation Council for Graduate Medical Education (ACGME) duty-hour policies (standard-policy group) or more flexible policies that waived rules on maximum shift lengths and time off between shifts (flexible-policy group). Outcomes included the 30-day rate of postoperative death or serious complications (primary outcome), other postoperative complications, and resident perceptions and satisfaction regarding their well-being, education, and patient care. Results In an analysis of data from 138,691 patients, flexible, less-restrictive duty-hour policies were not associated with an increased rate of death or serious complications (9.1% in the flexible-policy group and 9.0% in the standard-policy group, P=0.92; unadjusted odds ratio for the flexible-policy group, 0.96; 92% confidence interval, 0.87 to 1.06; P=0.44; noninferiority criteria satisfied) or of any secondary postoperative outcomes studied. Among 4330 residents, those in programs assigned to flexible policies did not report significantly greater dissatisfaction with overall education quality (11.0% in the flexible-policy group and 10.7% in the standard-policy group, P=0.86) or well-being (14.9% and 12.0%, respectively; P=0.10). Residents under flexible policies were less likely than those under standard policies to perceive negative effects of duty-hour policies on multiple aspects of patient safety, continuity of care, professionalism, and resident education but were more likely to perceive negative effects on personal activities. There were no significant differences between study groups in resident-reported perception of the effect of fatigue on personal or patient safety. Residents in the flexible-policy group were less likely than those in the standard-policy group to report leaving during an operation (7.0% vs. 13.2%, P<0.001) or handing off active patient issues (32.0% vs. 46.3%, P<0.001). Conclusions As compared with standard duty-hour policies, flexible, less-restrictive duty-hour policies for surgical residents were associated with noninferior patient outcomes and no significant difference in residents' satisfaction with overall well-being and education quality. (FIRST number, NCT02050789 .).

Concepts: Psychology, Patient, Hospital, Surgery, Statistical significance, Physician, Perception, Resident


A focus on novel, confirmatory, and statistically significant results leads to substantial bias in the scientific literature. One type of bias, known as “p-hacking,” occurs when researchers collect or select data or statistical analyses until nonsignificant results become significant. Here, we use text-mining to demonstrate that p-hacking is widespread throughout science. We then illustrate how one can test for p-hacking when performing a meta-analysis and show that, while p-hacking is probably common, its effect seems to be weak relative to the real effect sizes being measured. This result suggests that p-hacking probably does not drastically alter scientific consensuses drawn from meta-analyses.

Concepts: Scientific method, Statistics, Mathematics, Statistical significance, Science, Effect size, Meta-analysis, Statistical power


This randomized controlled trial was performed to investigate whether placebo effects in chronic low back pain could be harnessed ethically by adding open-label placebo (OLP) treatment to treatment as usual (TAU) for 3 weeks. Pain severity was assessed on three 0- to 10-point Numeric Rating Scales, scoring maximum pain, minimum pain, and usual pain, and a composite, primary outcome, total pain score. Our other primary outcome was back-related dysfunction, assessed on the Roland-Morris Disability Questionnaire. In an exploratory follow-up, participants on TAU received placebo pills for 3 additional weeks. We randomized 97 adults reporting persistent low back pain for more than 3 months' duration and diagnosed by a board-certified pain specialist. Eighty-three adults completed the trial. Compared to TAU, OLP elicited greater pain reduction on each of the three 0- to 10-point Numeric Rating Scales and on the 0- to 10-point composite pain scale (P < 0.001), with moderate to large effect sizes. Pain reduction on the composite Numeric Rating Scales was 1.5 (95% confidence interval: 1.0-2.0) in the OLP group and 0.2 (-0.3 to 0.8) in the TAU group. Open-label placebo treatment also reduced disability compared to TAU (P < 0.001), with a large effect size. Improvement in disability scores was 2.9 (1.7-4.0) in the OLP group and 0.0 (-1.1 to 1.2) in the TAU group. After being switched to OLP, the TAU group showed significant reductions in both pain (1.5, 0.8-2.3) and disability (3.4, 2.2-4.5). Our findings suggest that OLP pills presented in a positive context may be helpful in chronic low back pain.This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND), where it is permissible to download and share the work provided it is properly cited. The work cannot be changed in any way or used commercially without permission from the journal.

Concepts: Low back pain, Randomized controlled trial, Statistical significance, Pharmaceutical industry, Clinical research, Placebo, Acupuncture, Effect size


Recent advances in Bayesian hypothesis testing have led to the development of uniformly most powerful Bayesian tests, which represent an objective, default class of Bayesian hypothesis tests that have the same rejection regions as classical significance tests. Based on the correspondence between these two classes of tests, it is possible to equate the size of classical hypothesis tests with evidence thresholds in Bayesian tests, and to equate P values with Bayes factors. An examination of these connections suggest that recent concerns over the lack of reproducibility of scientific studies can be attributed largely to the conduct of significance tests at unjustifiably high levels of significance. To correct this problem, evidence thresholds required for the declaration of a significant finding should be increased to 25-50:1, and to 100-200:1 for the declaration of a highly significant finding. In terms of classical hypothesis tests, these evidence standards mandate the conduct of tests at the 0.005 or 0.001 level of significance.

Concepts: Scientific method, Statistics, Statistical significance, Statistical hypothesis testing, Falsifiability, Bayesian inference, Statistical power, United States Declaration of Independence


Background Experimental and clinical data suggest that reducing inflammation without affecting lipid levels may reduce the risk of cardiovascular disease. Yet, the inflammatory hypothesis of atherothrombosis has remained unproved. Methods We conducted a randomized, double-blind trial of canakinumab, a therapeutic monoclonal antibody targeting interleukin-1β, involving 10,061 patients with previous myocardial infarction and a high-sensitivity C-reactive protein level of 2 mg or more per liter. The trial compared three doses of canakinumab (50 mg, 150 mg, and 300 mg, administered subcutaneously every 3 months) with placebo. The primary efficacy end point was nonfatal myocardial infarction, nonfatal stroke, or cardiovascular death. Results At 48 months, the median reduction from baseline in the high-sensitivity C-reactive protein level was 26 percentage points greater in the group that received the 50-mg dose of canakinumab, 37 percentage points greater in the 150-mg group, and 41 percentage points greater in the 300-mg group than in the placebo group. Canakinumab did not reduce lipid levels from baseline. At a median follow-up of 3.7 years, the incidence rate for the primary end point was 4.50 events per 100 person-years in the placebo group, 4.11 events per 100 person-years in the 50-mg group, 3.86 events per 100 person-years in the 150-mg group, and 3.90 events per 100 person-years in the 300-mg group. The hazard ratios as compared with placebo were as follows: in the 50-mg group, 0.93 (95% confidence interval [CI], 0.80 to 1.07; P=0.30); in the 150-mg group, 0.85 (95% CI, 0.74 to 0.98; P=0.021); and in the 300-mg group, 0.86 (95% CI, 0.75 to 0.99; P=0.031). The 150-mg dose, but not the other doses, met the prespecified multiplicity-adjusted threshold for statistical significance for the primary end point and the secondary end point that additionally included hospitalization for unstable angina that led to urgent revascularization (hazard ratio vs. placebo, 0.83; 95% CI, 0.73 to 0.95; P=0.005). Canakinumab was associated with a higher incidence of fatal infection than was placebo. There was no significant difference in all-cause mortality (hazard ratio for all canakinumab doses vs. placebo, 0.94; 95% CI, 0.83 to 1.06; P=0.31). Conclusions Antiinflammatory therapy targeting the interleukin-1β innate immunity pathway with canakinumab at a dose of 150 mg every 3 months led to a significantly lower rate of recurrent cardiovascular events than placebo, independent of lipid-level lowering. (Funded by Novartis; CANTOS number, NCT01327846 .).

Concepts: Immune system, Inflammation, Epidemiology, Myocardial infarction, Atherosclerosis, Cardiovascular disease, Statistical significance, C-reactive protein


Background We evaluated whether rivaroxaban alone or in combination with aspirin would be more effective than aspirin alone for secondary cardiovascular prevention. Methods In this double-blind trial, we randomly assigned 27,395 participants with stable atherosclerotic vascular disease to receive rivaroxaban (2.5 mg twice daily) plus aspirin (100 mg once daily), rivaroxaban (5 mg twice daily), or aspirin (100 mg once daily). The primary outcome was a composite of cardiovascular death, stroke, or myocardial infarction. The study was stopped for superiority of the rivaroxaban-plus-aspirin group after a mean follow-up of 23 months. Results The primary outcome occurred in fewer patients in the rivaroxaban-plus-aspirin group than in the aspirin-alone group (379 patients [4.1%] vs. 496 patients [5.4%]; hazard ratio, 0.76; 95% confidence interval [CI], 0.66 to 0.86; P<0.001; z=-4.126), but major bleeding events occurred in more patients in the rivaroxaban-plus-aspirin group (288 patients [3.1%] vs. 170 patients [1.9%]; hazard ratio, 1.70; 95% CI, 1.40 to 2.05; P<0.001). There was no significant difference in intracranial or fatal bleeding between these two groups. There were 313 deaths (3.4%) in the rivaroxaban-plus-aspirin group as compared with 378 (4.1%) in the aspirin-alone group (hazard ratio, 0.82; 95% CI, 0.71 to 0.96; P=0.01; threshold P value for significance, 0.0025). The primary outcome did not occur in significantly fewer patients in the rivaroxaban-alone group than in the aspirin-alone group, but major bleeding events occurred in more patients in the rivaroxaban-alone group. Conclusions Among patients with stable atherosclerotic vascular disease, those assigned to rivaroxaban (2.5 mg twice daily) plus aspirin had better cardiovascular outcomes and more major bleeding events than those assigned to aspirin alone. Rivaroxaban (5 mg twice daily) alone did not result in better cardiovascular outcomes than aspirin alone and resulted in more major bleeding events. (Funded by Bayer; COMPASS number, NCT01776424 .).

Concepts: Myocardial infarction, Atherosclerosis, Cardiovascular disease, Stroke, Low-density lipoprotein, Statistical significance, Infarction


Over the past ten years, unconventional gas and oil drilling (UGOD) has markedly expanded in the United States. Despite substantial increases in well drilling, the health consequences of UGOD toxicant exposure remain unclear. This study examines an association between wells and healthcare use by zip code from 2007 to 2011 in Pennsylvania. Inpatient discharge databases from the Pennsylvania Healthcare Cost Containment Council were correlated with active wells by zip code in three counties in Pennsylvania. For overall inpatient prevalence rates and 25 specific medical categories, the association of inpatient prevalence rates with number of wells per zip code and, separately, with wells per km2 (separated into quantiles and defined as well density) were estimated using fixed-effects Poisson models. To account for multiple comparisons, a Bonferroni correction with associations of p<0.00096 was considered statistically significant. Cardiology inpatient prevalence rates were significantly associated with number of wells per zip code (p<0.00096) and wells per km2 (p<0.00096) while neurology inpatient prevalence rates were significantly associated with wells per km2 (p<0.00096). Furthermore, evidence also supported an association between well density and inpatient prevalence rates for the medical categories of dermatology, neurology, oncology, and urology. These data suggest that UGOD wells, which dramatically increased in the past decade, were associated with increased inpatient prevalence rates within specific medical categories in Pennsylvania. Further studies are necessary to address healthcare costs of UGOD and determine whether specific toxicants or combinations are associated with organ-specific responses.

Concepts: Medicine, Statistics, Petroleum, Statistical significance, The Association, Multiple comparisons, Natural gas, Bonferroni correction


This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.

Concepts: Statistics, Statistical significance, Ronald Fisher, Statistical hypothesis testing, P-value, Statistical power, Hypothesis testing, Counternull


What are the statistical practices of articles published in journals with a high impact factor? Are there differences compared with articles published in journals with a somewhat lower impact factor that have adopted editorial policies to reduce the impact of limitations of Null Hypothesis Significance Testing? To investigate these questions, the current study analyzed all articles related to psychological, neuropsychological and medical issues, published in 2011 in four journals with high impact factors: Science, Nature, The New England Journal of Medicine and The Lancet, and three journals with relatively lower impact factors: Neuropsychology, Journal of Experimental Psychology-Applied and the American Journal of Public Health. Results show that Null Hypothesis Significance Testing without any use of confidence intervals, effect size, prospective power and model estimation, is the prevalent statistical practice used in articles published in Nature, 89%, followed by articles published in Science, 42%. By contrast, in all other journals, both with high and lower impact factors, most articles report confidence intervals and/or effect size measures. We interpreted these differences as consequences of the editorial policies adopted by the journal editors, which are probably the most effective means to improve the statistical practices in journals with high or low impact factors.

Concepts: Statistics, Statistical significance, Ronald Fisher, Statistical hypothesis testing, Effect size, Impact factor, Statistical power, The Lancet


Objective To examine the effect of surgeon sex on postoperative outcomes of patients undergoing common surgical procedures.Design Population based, retrospective, matched cohort study from 2007 to 2015.Setting Population based cohort of all patients treated in Ontario, Canada.Participants Patients undergoing one of 25 surgical procedures performed by a female surgeon were matched by patient age, patient sex, comorbidity, surgeon volume, surgeon age, and hospital to patients undergoing the same operation by a male surgeon.Interventions Sex of treating surgeon.Main outcome measure The primary outcome was a composite of death, readmission, and complications. We compared outcomes between groups using generalised estimating equations.Results 104 630 patients were treated by 3314 surgeons, 774 female and 2540 male. Before matching, patients treated by female doctors were more likely to be female and younger but had similar comorbidity, income, rurality, and year of surgery. After matching, the groups were comparable. Fewer patients treated by female surgeons died, were readmitted to hospital, or had complications within 30 days (5810 of 52 315, 11.1%, 95% confidence interval 10.9% to 11.4%) than those treated by male surgeons (6046 of 52 315, 11.6%, 11.3% to 11.8%; adjusted odds ratio 0.96, 0.92 to 0.99, P=0.02). Patients treated by female surgeons were less likely to die within 30 days (adjusted odds ratio 0.88; 0.79 to 0.99, P=0.04), but there was no significant difference in readmissions or complications. Stratified analyses by patient, physician, and hospital characteristics did not significant modify the effect of surgeon sex on outcome. A retrospective analysis showed no difference in outcomes by surgeon sex in patients who had emergency surgery, where patients do not usually choose their surgeon.Conclusions After accounting for patient, surgeon, and hospital characteristics, patients treated by female surgeons had a small but statistically significant decrease in 30 day mortality and similar surgical outcomes (length of stay, complications, and readmission), compared with those treated by male surgeons. These findings support the need for further examination of the surgical outcomes and mechanisms related to physicians and the underlying processes and patterns of care to improve mortality, complications, and readmissions for all patients.

Concepts: Male, Statistics, Hospital, Surgery, Statistical significance, Physician, Surgeon, American College of Surgeons