SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Data set

29

The location and persistence of surface water (inland and coastal) is both affected by climate and human activity and affects climate, biological diversity and human wellbeing. Global data sets documenting surface water location and seasonality have been produced from inventories and national descriptions, statistical extrapolation of regional data and satellite imagery, but measuring long-term changes at high resolution remains a challenge. Here, using three million Landsat satellite images, we quantify changes in global surface water over the past 32 years at 30-metre resolution. We record the months and years when water was present, where occurrence changed and what form changes took in terms of seasonality and persistence. Between 1984 and 2015 permanent surface water has disappeared from an area of almost 90,000 square kilometres, roughly equivalent to that of Lake Superior, though new permanent bodies of surface water covering 184,000 square kilometres have formed elsewhere. All continental regions show a net increase in permanent water, except Oceania, which has a fractional (one per cent) net loss. Much of the increase is from reservoir filling, although climate change is also implicated. Loss is more geographically concentrated than gain. Over 70 per cent of global net permanent water loss occurred in the Middle East and Central Asia, linked to drought and human actions including river diversion or damming and unregulated withdrawal. Losses in Australia and the USA linked to long-term droughts are also evident. This globally consistent, validated data set shows that impacts of climate change and climate oscillations on surface water occurrence can be measured and that evidence can be gathered to show how surface water is altered by human activities. We anticipate that this freely available data will improve the modelling of surface forcing, provide evidence of state and change in wetland ecotones (the transition areas between biomes), and inform water-management decision-making.

Concepts: Precipitation, Climate, Hydrology, Data set, Middle East, Change, Human behavior, Remote sensing

29

The use of computational modeling algorithms to guide the design of novel enzyme catalysts is a rapidly growing field. Force-field based methods have now been used to engineer both enzyme specificity and activity. However, the proportion of designed mutants with the intended function is often less than ten percent. One potential reason for this is that current force-field based approaches are trained on indirect measures of function rather than direct correlation to experimentally-determined functional effects of mutations. We hypothesize that this is partially due to the lack of data sets for which a large panel of enzyme variants has been produced, purified, and kinetically characterized. Here we report the kcat and KM values of 100 purified mutants of a glycoside hydrolase enzyme. We demonstrate the utility of this data set by using machine learning to train a new algorithm that enables prediction of each kinetic parameter based on readily-modeled structural features. The generated dataset and analyses carried out in this study not only provide insight into how this enzyme functions, they also provide a clear path forward for the improvement of computational enzyme redesign algorithms.

Concepts: Scientific method, Algorithm, Enzyme, Data, Data set, Correlation and dependence, Machine learning

28

We present SHARE, a new system for statistical health information release with differential privacy. We present two case studies that evaluate the software on real medical datasets and demonstrate the feasibility and utility of applying the differential privacy framework on biomedical data.

Concepts: Health care, Scientific method, Mathematics, Evaluation methods, Data, Data set, Computer program, System software

28

MOTIVATION: Image non-uniformity (NU) refers to systematic, slowly varying spatial gradients in images that result in a bias that can affect all downstream image processing, quantification and statistical analysis steps. Image NU is poorly modeled in the field of high-content screening (HCS), however, such that current conventional correction algorithms may be either inappropriate for HCS or fail to take advantage of the information available in HCS image data. RESULTS: A novel image NU bias correction algorithm, termed intensity quantile estimation and mapping (IQEM), is described. The algorithm estimates the full non-linear form of the image NU bias by mapping pixel intensities to a reference intensity quantile function. IQEM accounts for the variation in NU bias over broad cell intensity ranges and data acquisition times, both of which are characteristic of HCS image datasets. Validation of the method, using simulated and HCS microtubule polymerization screen images, is presented. Two requirements of IQEM are that the dataset consists of large numbers of images acquired under identical conditions and that cells are distributed with no within-image spatial preference. Availability and implementation: MATLAB function files are available at http://nadon-mugqic.mcgill.ca/. CONTACT: robert.nadon@mcgill.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Concepts: Statistics, Mathematics, Optics, Data, Data set, Computer graphics

28

Many case-control tests of rare variation are implemented in statistical frameworks that make correction for confounders like population stratification difficult. Simple permutation of disease status is unacceptable for resolving this issue because the replicate data sets do not have the same confounding as the original data set. These limitations make it difficult to apply rare-variant tests to samples in which confounding most likely exists, e.g., samples collected from admixed populations. To enable the use of such rare-variant methods in structured samples, as well as to facilitate permutation tests for any situation in which case-control tests require adjustment for confounding covariates, we propose to establish the significance of a rare-variant test via a modified permutation procedure. Our procedure uses Fisher’s noncentral hypergeometric distribution to generate permuted data sets with the same structure present in the actual data set such that inference is valid in the presence of confounding factors. We use simulated sequence data based on coalescent models to show that our permutation strategy corrects for confounding due to population stratification that, if ignored, would otherwise inflate the size of a rare-variant test. We further illustrate the approach by using sequence data from the Dallas Heart Study of energy metabolism traits. Researchers can implement our permutation approach by using the R package BiasedUrn.

Concepts: Experimental design, Statistics, Data set, Confounding, Case-control study, Covariate, Permutation, Hypergeometric distribution

27

Discretization of a continuous-valued symptom (attribute) in medical data set is a crucial preprocessing step for the medical classification task. This paper proposes a supportive attribute - assisted discretization (SAAD) model for medical diagnostic problems. The intent of this approach is to discover the best supportive symptom that correlates closely with the continuous-valued symptom being discretized and to conduct the discretization process using the significant supportive information that is provided by the best supportive symptom, because we hypothesize that a good discretization scheme should rely heavily on the interaction between a continuous-valued attribute and both its supportive attribute and the class attribute. SAAD can consider each continuous-valued symptom differently and intelligently, which allows it to be capable of minimizing the information lost and the data uncertainty. Hence, SAAD results in higher classification accuracy. Empirical experiments using ten real-life datasets from the UCI repository were conducted to compare the classification accuracy achieved by several prestigious classifiers with SAAD and other state-of-the-art discretization approaches. The experimental results demonstrate the effectiveness and usefulness of the proposed approach in enhancing the diagnostic accuracy.

Concepts: Mathematics, Economics, Data set, Empiricism, Experiment, Proposal, Conducting, Discretization

27

Mass spectrometry imaging (MSI) generates large volumetric data sets consisting of mass to charge ratio (m/z), ion current, and x,y coordinate location.. These datasets usually serve limited purposes centered on measuring the distribution of a small set of ions with known m/ z. Such earmarked queries consider only a fraction of the full mass spectrum captured, and there are few tools to assist the exploration of the remaining volume of unknown data in terms of demonstrating similarity or discordance in tissue compartment distribution patterns. Here we present a novel, interactive approach to extract information from MSI data that relies on pre-calculated data structures to perform queries of large data sets with a typical laptop. We have devised methods to query the full volume to find new m/z values of potential interest based on similarity to biological structures, or to the spatial distribution of known ions. We describe these query methods in detail and provide examples demonstrating the power of the methods to “discover” m/z values of ions that have such potentially interesting correlations. The “discovered” ions may be further correlated with either positional locations or the coincident distribution of other ions, using successive queries. Finally, we show it is possible to gain insight to the fragmentation pattern of the parent molecule from such correlations. The ability to discover new ions of interest in the unknown bulk of an MSI dataset offers the potential to further our understanding of biological and physiological processes related to health and disease.

Concepts: Mass spectrometry, Atom, Data set, Correlation and dependence, Ion source, Mass-to-charge ratio, Mass spectrum

27

A memory-efficient algorithm for the computation of Principal Component Analysis (PCA) of large mass spectrometry imaging data sets is presented. Mass Spectrometry Imaging (MSI) enables two- and three- dimensional overviews of hundreds of unlabeled molecular species in complex samples such as intact tissue. PCA, in combination with data binning or other reduction algorithms, has been widely used in the unsupervised processing of MSI data and as a dimentionality reduction method prior to clustering and spatial segmentation. Standard implementations of PCA require the data to be stored in random access memory. This imposes an upper limit on the amount of data that can be processed, necessitating a compromise between the number of pixels and the number of peaks to include. With increasing interest in multivariate analysis of large 3D multi-slice datasets and ongoing improvements in instrumentation, the ability to retain all pixels and many more peaks is increasingly important. We present a new method which has no limitation on the number of pixels and allows an increased number of peaks to be retained. The new technique was validated against the MATLAB (The MathWorks Inc., Natick, Massachusetts) implementation of PCA (princomp) and then used to reduce, without discarding peaks or pixels, multiple serial sections acquired from a single mouse brain which was too large to be analysed with princomp. k-means clustering was then performed on the reduced dataset. We further demonstrate with simulated data of 83 slices, comprising 20535 pixels per slice and equalling 44 GB of data, that the new method can be used in combination with existing tools to process an entire organ. MATLAB code implementing the memory efficient PCA algorithm is provided.

Concepts: Multivariate statistics, Data set, Principal component analysis, Machine learning, The MathWorks, MATLAB, K-means clustering, Natick, Massachusetts

26

Within this work a methodological extension of the matched molecular pair analysis is presented. The method is based on a pharmacophore retyping of the molecular graph and a consecutive matched molecular pair analysis. The features of the new methodology are exemplified using a large dataset on CYP inhibition. We show that fuzzy matched pairs can be used to extract activity and selectivity determining pharmacophoric features. Based on the fuzzy pharmacophore description the method clusters molecular transfers and offers new opportunities for the combination of data from different sources, namely public and industry datasets.

Concepts: Scientific method, Function, Data set, Methodology, Power of a method

25

Existing computational pipelines for quantitative analysis of high-content microscopy data rely on traditional machine learning approaches that fail to accurately classify more than a single dataset without substantial tuning and training, requiring extensive analysis. Here, we demonstrate that the application of deep learning to biological image data can overcome the pitfalls associated with conventional machine learning classifiers. Using a deep convolutional neural network (DeepLoc) to analyze yeast cell images, we show improved performance over traditional approaches in the automated classification of protein subcellular localization. We also demonstrate the ability of DeepLoc to classify highly divergent image sets, including images of pheromone-arrested cells with abnormal cellular morphology, as well as images generated in different genetic backgrounds and in different laboratories. We offer an open-source implementation that enables updating DeepLoc on new microscopy datasets. This study highlights deep learning as anĀ important tool for the expedited analysis of high-content microscopy data.

Concepts: Protein, Cell nucleus, Bioinformatics, Organism, Cell biology, Yeast, Data set, Protein subcellular localization prediction