SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Outlier

170

BACKGROUND: To evaluate institutional nursing care performance in the context of national comparative statistics (benchmarks), approximately one in every three major healthcare institutions (over 1,800 hospitals) across the United States, have joined the National Database for Nursing Quality Indicators[REGISTERED SIGN] (NDNQI[REGISTERED SIGN]). With over 18,000 hospital units contributing data for nearly 200 quantitative measures at present, a reliable and efficient input data screening for all quantitative measures for data quality control is critical to the integrity, validity, and on-time delivery of NDNQI reports. METHODS: With Monte Carlo simulation and quantitative NDNQI indicator examples, we compared two ad-hoc methods using robust scale estimators, Inter Quartile Range (IQR) and Median Absolute Deviation from the Median (MAD), to the classic, theoretically-based Minimum Covariance Determinant (FAST-MCD) approach, for initial univariate outlier detection. RESULTS: While the theoretically based FAST-MCD used in one dimension can be sensitive and is better suited for identifying groups of outliers because of its high breakdown point, the ad-hoc IQR and MAD approaches are fast, easy to implement, and could be more robust and efficient, depending on the distributional property of the underlying measure of interest. CONCLUSION: With highly skewed distributions for most NDNQI indicators within a short data screen window, the FAST-MCD approach, when used in one dimensional raw data setting, could overestimate the false alarm rates for potential outliers than the IQR and MAD with the same pre-set of critical value, thus, overburden data quality control at both the data entry and administrative ends in our setting.

Concepts: Median, Dimension, Absolute deviation, Normal distribution, Standard deviation, Robust statistics, Outlier, Median absolute deviation

168

BACKGROUND: Cancer outlier profile analysis (COPA) has proven to be an effective approach to analyzing cancer expression data, leading to the discovery of the TMPRSS2 and ETS family gene fusion events in prostate cancer. However, the original COPA algorithm did not identify down-regulated outliers, and the currently available R package implementing the method is similarly restricted to the analysis of over-expressed outliers. Here we present a modified outlier detection method, mCOPA, which contains refinements to the outlier-detection algorithm, identifies both over- and under-expressed outliers, is freely available, and can be applied to any expression dataset. RESULTS: We compare our method to other feature-selection approaches, and demonstrate that mCOPA frequently selects more-informative features than do differential expression or variance-based feature selection approaches, and is able to recover observed clinical subtypes more consistently. We demonstrate the application of mCOPA to prostate cancer expression data, and explore the use of outliers in clustering, pathway analysis, and the identification of tumour suppressors. We analyse the under-expressed outliers to identify known and novel prostate cancer tumour suppressor genes, validating these against data in Oncomine and the Cancer Gene Index. We also demonstrate how a combination of outlier analysis and pathway analysis can identify molecular mechanisms disrupted in individual tumours. CONCLUSIONS: We demonstrate that mCOPA offers advantages, compared to differential expression or variance, in selecting outlier features, and that the features so selected are better able to assign samples to clinically annotated subtypes. Further, we show that the biology explored by outlier analysis differs from that uncovered in differential expression or variance analysis. mCOPA is an important new tool for the exploration of cancer datasets and the discovery of new cancer subtypes, and can be combined with pathway and functional analysis approaches to discover mechanisms underpinning heterogeneity in cancers.

Concepts: Gene expression, Cancer, Oncology, Prostate cancer, Tumor, Tumor suppressor gene, Normal distribution, Outlier

28

Water quality controls involve large number of variables and observations, often subject to some outliers. An outlier is an observation that is numerically distant from the rest of the data or that appears to deviate markedly from other members of the sample in which it occurs. An interesting analysis is to find those observations that produce measurements that are different from the pattern established in the sample. Therefore, identification of atypical observations is an important concern in water quality monitoring and a difficult task because of the multivariate nature of water quality data. Our study provides a new method for detecting outliers in water quality monitoring parameters, using oxygen and turbidity as indicator variables. Until now, methods were based on considering the different parameters as a vector whose components were their concentration values. Our approach lies in considering water quality monitoring through time as curves instead of vectors, that is to say, the data set of the problem is considered as a time-dependent function and not as a set of discrete values in different time instants. The methodology, which is based on the concept of functional depth, was applied to the detection of outliers in water quality monitoring samples in San Esteban estuary. Results were discussed in terms of origin, causes, etc., and compared with those obtained using the conventional method based on vector comparison. Finally, the advantages of the functional method are exposed.

Concepts: Scientific method, Water, Water pollution, Data, Observation, Data analysis, Water quality, Outlier

7

Controlling for background demographic effects is important for accurately identifying loci that have recently undergone positive selection. To date, the effects of demography have not yet been explicitly considered when identifying loci under selection during dog domestication. To investigate positive selection on the dog lineage early in the domestication, we examined patterns of polymorphism in six canid genomes that were previously used to infer a demographic model of dog domestication. Using an inferred demographic model, we computed false discovery rates (FDR) and identified 349 outlier regions consistent with positive selection at a low FDR. The signals in the top 100 regions were frequently centered on candidate genes related to brain function and behavior, including LHFPL3, CADM2, GRIK3, SH3GL2, MBP, PDE7B, NTAN1, and GLRA1. These regions contained significant enrichments in behavioral ontology categories. The 3rd top hit, CCRN4L, plays a major role in lipid metabolism, that is supported by additional metabolism related candidates revealed in our scan, including SCP2D1 and PDXC1. Comparing our method to an empirical outlier approach that does not directly account for demography, we found only modest overlaps between the two methods, with 60% of empirical outliers having no overlap with our demography-based outlier detection approach. Demography-aware approaches have lower-rates of false discovery. Our top candidates for selection, in addition to expanding the set of neurobehavioral candidate genes, include genes related to lipid metabolism, suggesting a dietary target of selection that was important during the period when proto-dogs hunted and fed alongside hunter-gatherers.

Concepts: Protein, Genome, Logic, Reasoning, Dog, Inference, Balancing selection, Outlier

4

Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous.

Concepts: Series, Mathematical analysis, Sequence, Outlier

2

The discrete data structure and large sequencing depth of RNA sequencing (RNA-seq) experiments can often generate outlier read counts in one or more RNA samples within a homogeneous group. Thus, how to identify and manage outlier observations in RNA-seq data is an emerging topic of interest. One of the main objectives in these research efforts is to develop statistical methodology that effectively balances the impact of outlier observations and achieves maximal power for statistical testing. To reach that goal, strengthening the accuracy of outlier detection is an important precursor. Current outlier detection algorithms for RNA-seq data are executed within a testing framework and may be sensitive to sparse data and heavy-tailed distributions. Therefore, we propose a univariate algorithm that utilizes a probabilistic approach to measure the deviation between an observation and the distribution generating the remaining data and implement it within in an iterative leave-one-out design strategy. Analyses of real and simulated RNA-seq data show that the proposed methodology has higher outlier detection rates for both non-normalized and normalized negative binomial distributed data.

Concepts: Scientific method, Algorithm, Statistics, Molecular biology, Data, Probability theory, Distribution, Outlier

1

As Bayesian methods become more popular among behavioral scientists, they will inevitably be applied in situations that violate the assumptions underpinning typical models used to guide statistical inference. With this in mind, it is important to know something about how robust Bayesian methods are to the violation of those assumptions. In this paper, we focus on the problem of contaminated data (such as data with outliers or conflicts present), with specific application to the problem of estimating a credible interval for the population mean. We evaluate five Bayesian methods for constructing a credible interval, using toy examples to illustrate the qualitative behavior of different approaches in the presence of contaminants, and an extensive simulation study to quantify the robustness of each method. We find that the “default” normal model used in most Bayesian data analyses is not robust, and that approaches based on the Bayesian bootstrap are only robust in limited circumstances. A simple parametric model based on Tukey’s “contaminated normal model” and a model based on the t-distribution were markedly more robust. However, the contaminated normal model had the added benefit of estimating which data points were discounted as outliers and which were not.

Concepts: Scientific method, Statistics, Median, Data analysis, Confidence interval, Normal distribution, Statistical inference, Outlier

1

Designing a good scatterplot can be difficult for non-experts in visualization, because they need to decide on many parameters, such as marker size and opacity, aspect ratio, color, and rendering order. This paper contributes to research exploring the use of perceptual models and quality metrics to set such parameters automatically for enhanced visual quality of a scatterplot. A key consideration in this paper is the construction of a cost function to capture several relevant aspects of the human visual system, examining a scatterplot design for some data analysis task. We show how the cost function can be used in an optimizer to search for the optimal visual design for a user’s dataset and task objectives (e.g., “reliable linear correlation estimation is more important than class separation”). The approach is extensible to different analysis tasks. To test its performance in a realistic setting, we pre-calibrated it for correlation estimation, class separation, and outlier detection. The optimizer was able to produce designs that achieved a level of speed and success comparable to that of those using human-designed presets (e.g., in R or MATLAB). Case studies demonstrate that the approach can adapt a design to the data, to reveal patterns without user intervention.

Concepts: Mathematics, Data, Visual system, Operations research, Design, Optimization, Graphic design, Outlier

1

Antimicrobial stewardship programs (ASP) are a key national initiative to promote appropriate use of antibiotics, and to reduce the burden of resistance. The dilemma of managing the outlier physician is especially complex. We outline strategies to establish a successful ASP that reviews appropriate efforts to achieve the goal of modifying outlier physician’s behavior. One must try to differentiate deviation from ASP norms from all other issues of outliers. Essential elements include identifying and understanding the local problems, planning and achieving hospital administration and medical staff support. A successful ASP includes effective communication and acceptance of evidence-based recommendations, so that patient clinical outcomes will be optimized.

Concepts: Medicine, Patient, Hospital, Physician, Management, Problem solving, Project management, Outlier

1

The development of modern crops typically involves both selection and hybridization, but to date most studies have focused on the former. In the present study, we explore how both processes, and their interactions, have molded the genome of the cultivated sunflower (Helianthus annuus), a globally important oilseed. To identify genes targeted by selection during the domestication and improvement of sunflower, and to detect post-domestication hybridization with wild species, we analyzed transcriptome sequences of 80 genotypes, including wild, landrace, and modern lines of H. annuus, as well as two cross-compatible wild relatives, Helianthus argophyllus and Helianthus petiolaris. Outlier analyses identified 122 and 15 candidate genes associated with domestication and improvement, respectively. As in several previous studies, genes putatively involved in oil biosynthesis were the most extreme outliers. Additionally, several promising associations were observed with previously mapped quantitative trait loci (QTLs), such as branching. Admixture analyses revealed that all the modern cultivar genomes we examined contained one or more introgressions from wild populations, with every chromosome having evidence of introgression in at least one modern line. Cumulatively, introgressions cover c. 10% of the cultivated sunflower genome. Surprisingly, introgressions do not avoid candidate domestication genes, probably because of the reintroduction of branching.

Concepts: DNA, Gene, Genetics, Genome, RNA, Quantitative trait locus, Sunflower, Outlier