### Concept: Statistical inference

#### 666

##### Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates

- OPEN
- Proceedings of the National Academy of Sciences of the United States of America
- Published about 3 years ago
- Discuss

The most widely used task functional magnetic resonance imaging (fMRI) analyses use parametric statistical methods that depend on a variety of assumptions. In this work, we use real resting-state data and a total of 3 million random task group analyses to compute empirical familywise error rates for the fMRI software packages SPM, FSL, and AFNI, as well as a nonparametric permutation method. For a nominal familywise error rate of 5%, the parametric statistical methods are shown to be conservative for voxelwise inference and invalid for clusterwise inference. Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian shape. By comparison, the nonparametric permutation test is found to produce nominal results for voxelwise as well as clusterwise inference. These findings speak to the need of validating the statistical methods being used in the field of neuroimaging.

#### 174

Data analysis is used to test the hypothesis that “hitting is contagious”. A statistical model is described to study the effect of a hot hitter upon his teammates' batting during a consecutive game hitting streak. Box score data for entire seasons comprising [Formula: see text] streaks of length [Formula: see text] games, including a total [Formula: see text] observations were compiled. Treatment and control sample groups ([Formula: see text]) were constructed from core lineups of players on the streaking batter’s team. The percentile method bootstrap was used to calculate [Formula: see text] confidence intervals for statistics representing differences in the mean distributions of two batting statistics between groups. Batters in the treatment group (hot streak active) showed statistically significant improvements in hitting performance, as compared against the control. Mean [Formula: see text] for the treatment group was found to be [Formula: see text] to [Formula: see text] percentage points higher during hot streaks (mean difference increased [Formula: see text] points), while the batting heat index [Formula: see text] introduced here was observed to increase by [Formula: see text] points. For each performance statistic, the null hypothesis was rejected at the [Formula: see text] significance level. We conclude that the evidence suggests the potential existence of a “statistical contagion effect”. Psychological mechanisms essential to the empirical results are suggested, as several studies from the scientific literature lend credence to contagious phenomena in sports. Causal inference from these results is difficult, but we suggest and discuss several latent variables that may contribute to the observed results, and offer possible directions for future research.

#### 169

To assess the relationship between surgical delay and mortality in elderly patients with hip fracture. Systematic review and meta-analysis of retrospective and prospective studies published from 1948 to 2011. Medline (from 1948), Embase (from 1974) and CINAHL (from 1982), and the Cochrane Library. Odds ratios (OR) and 95% confidence intervals for each study were extracted and pooled with a random effects model. Heterogeneity, publication bias, Bayesian analysis, and meta-regression analyses were done. Criteria for inclusion were retro- and prospective elderly population studies, patients with operated hip fractures, indication of timing of surgery and survival status.

#### 164

##### Pulmonary Auscultation in the Operating Room: A Prospective Randomized Blinded Trial Comparing Electronic and Conventional Stethoscopes

- OPEN
- Anesthesia and analgesia
- Published almost 6 years ago
- Discuss

BACKGROUND:We compared the subjective quality of pulmonary auscultation between 2 acoustic stethoscopes (Holtex Ideal® and Littmann Cardiology III®) and an electronic stethoscope (Littmann 3200®) in the operating room.METHODS:A prospective double-blind randomized study with an evaluation during mechanical ventilation was performed in 100 patients. After each examination, the listeners using a numeric scale (0-10) rated the quality of auscultation. Auscultation quality was compared in patients among stethoscopes with a multilevel mixed-effects linear regression with random intercept (operator effect), adjusted on significant factors in univariate analysis. A significant difference was defined as P < 0.05.RESULTS:One hundred comparative evaluations of pulmonary auscultation were performed. The quality of auscultation was rated 8.2 ± 1.6 for the electronic stethoscope, 7.4 ± 1.8 for the Littmann Cardiology III, and 4.6 ± 1.8 for the Holtex Ideal. Compared with Holtex Ideal, auscultation quality was significantly higher with other stethoscopes (P < 0.0001). Compared with Littmann Cardiology III, auscultation quality was significantly higher with Littmann 3200 electronic stethoscope (β = 0.9 [95% confidence interval, 0.5-1.3]).CONCLUSIONS:An electronic stethoscope can provide a better quality of pulmonary auscultation than acoustic stethoscopes in the operating room, yet with a magnitude of improvement marginally higher than that provided with a high performance acoustic stethoscope. Whether this can translate into a clinically relevant benefit requires further studies.

#### 78

##### What exactly is ‘N’ in cell culture and animal experiments?

- OPEN
- PLoS biology
- Published over 1 year ago
- Discuss

Biologists determine experimental effects by perturbing biological entities or units. When done appropriately, independent replication of the entity-intervention pair contributes to the sample size (N) and forms the basis of statistical inference. If the wrong entity-intervention pair is chosen, an experiment cannot address the question of interest. We surveyed a random sample of published animal experiments from 2011 to 2016 where interventions were applied to parents and effects examined in the offspring, as regulatory authorities provide clear guidelines on replication with such designs. We found that only 22% of studies (95% CI = 17%-29%) replicated the correct entity-intervention pair and thus made valid statistical inferences. Nearly half of the studies (46%, 95% CI = 38%-53%) had pseudoreplication while 32% (95% CI = 26%-39%) provided insufficient information to make a judgement. Pseudoreplication artificially inflates the sample size, and thus the evidence for a scientific claim, resulting in false positives. We argue that distinguishing between biological units, experimental units, and observational units clarifies where replication should occur, describe the criteria for genuine replication, and provide concrete examples of in vitro, ex vivo, and in vivo experimental designs.

#### 76

##### Inferences About Sexual Orientation: The Roles of Stereotypes, Faces, and The Gaydar Myth

- OPEN
- Journal of sex research
- Published almost 4 years ago
- Discuss

In the present work, we investigated the pop cultural idea that people have a sixth sense, called “gaydar,” to detect who is gay. We propose that “gaydar” is an alternate label for using stereotypes to infer orientation (e.g., inferring that fashionable men are gay). Another account, however, argues that people possess a facial perception process that enables them to identify sexual orientation from facial structure. We report five experiments testing these accounts. Participants made gay-or-straight judgments about fictional targets that were constructed using experimentally manipulated stereotypic cues and real gay/straight people’s face cues. These studies revealed that orientation is not visible from the face-purportedly “face-based” gaydar arises from a third-variable confound. People do, however, readily infer orientation from stereotypic attributes (e.g., fashion, career). Furthermore, the folk concept of gaydar serves as a legitimizing myth: Compared to a control group, people stereotyped more often when led to believe in gaydar, whereas people stereotyped less when told gaydar is an alternate label for stereotyping. Discussion focuses on the implications of the gaydar myth and why, contrary to some prior claims, stereotyping is highly unlikely to result in accurate judgments about orientation.

#### 72

Can behavior be unconsciously primed via the activation of attitudes, stereotypes, or other concepts? A number of studies have suggested that such priming effects can occur, and a prominent illustration is the claim that individuals' accuracy in answering general knowledge questions can be influenced by activating intelligence-related concepts such as professor or soccer hooligan. In 9 experiments with 475 participants we employed the procedures used in these studies, as well as a number of variants of those procedures, in an attempt to obtain this intelligence priming effect. None of the experiments obtained the effect, although financial incentives did boost performance. A Bayesian analysis reveals considerable evidential support for the null hypothesis. The results conform to the pattern typically obtained in word priming experiments in which priming is very narrow in its generalization and unconscious (subliminal) influences, if they occur at all, are extremely short-lived. We encourage others to explore the circumstances in which this phenomenon might be obtained.

#### 71

##### New evidence on the impact of sustained exposure to air pollution on life expectancy from China’s Huai River Policy

- OPEN
- Proceedings of the National Academy of Sciences of the United States of America
- Published almost 2 years ago
- Discuss

This paper finds that a 10-μg/m(3) increase in airborne particulate matter [particulate matter smaller than 10 μm (PM10)] reduces life expectancy by 0.64 years (95% confidence interval = 0.21-1.07). This estimate is derived from quasiexperimental variation in PM10 generated by China’s Huai River Policy, which provides free or heavily subsidized coal for indoor heating during the winter to cities north of the Huai River but not to those to the south. The findings are derived from a regression discontinuity design based on distance from the Huai River, and they are robust to using parametric and nonparametric estimation methods, different kernel types and bandwidth sizes, and adjustment for a rich set of demographic and behavioral covariates. Furthermore, the shorter lifespans are almost entirely caused by elevated rates of cardiorespiratory mortality, suggesting that PM10 is the causal factor. The estimates imply that bringing all of China into compliance with its Class I standards for PM10 would save 3.7 billion life-years.

#### 63

The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p-values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p-values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment.

#### 61

A fundamental challenge in robotics today is building robots that can learn new skills by observing humans and imitating human actions. We propose a new Bayesian approach to robotic learning by imitation inspired by the developmental hypothesis that children use self-experience to bootstrap the process of intention recognition and goal-based imitation. Our approach allows an autonomous agent to: (i) learn probabilistic models of actions through self-discovery and experience, (ii) utilize these learned models for inferring the goals of human actions, and (iii) perform goal-based imitation for robotic learning and human-robot collaboration. Such an approach allows a robot to leverage its increasing repertoire of learned behaviors to interpret increasingly complex human actions and use the inferred goals for imitation, even when the robot has very different actuators from humans. We demonstrate our approach using two different scenarios: (i) a simulated robot that learns human-like gaze following behavior, and (ii) a robot that learns to imitate human actions in a tabletop organization task. In both cases, the agent learns a probabilistic model of its own actions, and uses this model for goal inference and goal-based imitation. We also show that the robotic agent can use its probabilistic model to seek human assistance when it recognizes that its inferred actions are too uncertain, risky, or impossible to perform, thereby opening the door to human-robot collaboration.