SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: Data set

683

We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier’s antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual’s privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals.

Concepts: Optics, Data set, Law, First-order logic

515

Sea surface temperature (SST) records are subject to potential biases due to changing instrumentation and measurement practices. Significant differences exist between commonly used composite SST reconstructions from the National Oceanic and Atmospheric Administration’s Extended Reconstruction Sea Surface Temperature (ERSST), the Hadley Centre SST data set (HadSST3), and the Japanese Meteorological Agency’s Centennial Observation-Based Estimates of SSTs (COBE-SST) from 2003 to the present. The update from ERSST version 3b to version 4 resulted in an increase in the operational SST trend estimate during the last 19 years from 0.07° to 0.12°C per decade, indicating a higher rate of warming in recent years. We show that ERSST version 4 trends generally agree with largely independent, near-global, and instrumentally homogeneous SST measurements from floating buoys, Argo floats, and radiometer-based satellite measurements that have been developed and deployed during the past two decades. We find a large cooling bias in ERSST version 3b and smaller but significant cooling biases in HadSST3 and COBE-SST from 2003 to the present, with respect to most series examined. These results suggest that reported rates of SST warming in recent years have been underestimated in these three data sets.

Concepts: Present, Time, Measurement, Oceanography, Data set, Gas, Sea surface temperature, Decade

329

Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

Concepts: Statistics, Academic publishing, Data, Data set, DNA microarray, Reuse, Recycling, Remanufacturing

172

Individual participant data (IPD) meta-analyses that obtain “raw” data from studies rather than summary data typically adopt a “two-stage” approach to analysis whereby IPD within trials generate summary measures, which are combined using standard meta-analytical methods. Recently, a range of “one-stage” approaches which combine all individual participant data in a single meta-analysis have been suggested as providing a more powerful and flexible approach. However, they are more complex to implement and require statistical support. This study uses a dataset to compare “two-stage” and “one-stage” models of varying complexity, to ascertain whether results obtained from the approaches differ in a clinically meaningful way.

Concepts: Epidemiology, Statistics, Chaos theory, Actuarial science, Evaluation methods, Data, Data set, Publication bias

170

BACKGROUND: Experimental datasets are becoming larger and increasingly complex, spanning different data domains, thereby expanding the requirements for respective tool support for their analysis. Networks provide a basis for the integration, analysis and visualization of multi-omics experimental datasets. RESULTS: Here we present VANTED (version 2), a framework for systems biology applications, which comprises a comprehensive set of seven main tasks. These range from network reconstruction, data visualization, integration of various data types, network simulation to data exploration combined with a manifold support of systems biology standards for visualization and data exchange. The offered set of functionalities is instantiated by combining several tasks in order to enable users to view and explore a comprehensive dataset from different perspectives. We describe the system as well as an exemplary workflow. CONCLUSIONS: VANTED is a stand-alone framework which supports scientists during the data analysis and interpretation phase. It is available as a Java open source tool from http://www.vanted.org.

Concepts: Statistics, Mathematics, Data, Data set, Data analysis, Data mining, Open source, Real analysis

169

BACKGROUND: Validation of administrative data is important to assess potential sources of bias in outcome evaluation and to prevent dissemination of misleading or inaccurate information. The purpose of the study was to determine the completeness and accuracy of endoscopy data in several administrative data sources in the year prior to colorectal cancer diagnosis as part of a larger project focused on evaluating the quality of pre-diagnostic care. Methods: Primary and secondary data sources for endoscopy were collected from the Alberta Cancer Registry, cancer medical charts and three different administrative data sources. 1672 randomly sampled patients diagnosed with invasive colorectal cancer in years 2000-2005 in Alberta, Canada were included. A retrospective validation study of administrative data for endoscopy in the year prior to colorectal cancer diagnosis was conducted. A gold standard dataset was created by combining all the datasets. Number and percent identified, agreement and percent unique to a given data source were calculated and compared across each dataset and to the gold standard with respect to identifying all patients who underwent endoscopy and all endoscopies received by those patients. Results: The combined administrative data and physician billing data identified as high or higher percentage of patients who had one or more endoscopy (84% and 78%, respectively) and total endoscopy procedures (89% and 81%, respectively) than the chart review (78% for both). Conclusions: Endoscopy data has a high level of completeness and accuracy in physician billing data alone. Combined with hospital in/outpatient data it is more complete than chart review alone.

Concepts: Evaluation, Colorectal cancer, Physician, Data, Data set, Endoscopy, Sigmoidoscopy, Percentage point

169

For several immune-mediated diseases, immunological analysis will become more complex in the future with datasets in which cytokine and gene expression data play a major role. These data have certain characteristics that require sophisticated statistical analysis such as strategies for non-normal distribution and censoring. Additionally, complex and multiple immunological relationships need to be adjusted for potential confounding and interaction effects.

Concepts: Gene, Gene expression, Transcription, Statistics, Sociology, Data, Data set, Analysis of variance

169

Many important questions in biology are, fundamentally, comparative, and this extends to our analysis of a growing number of sequenced genomes. Existing genomic analysis tools are often organized around literal views of genomes as linear strings. Even when information is highly condensed, these views grow cumbersome as larger numbers of genomes are added. Data aggregation and summarization methods from the field of visual analytics can provide abstracted comparative views, suitable for sifting large multi-genome datasets to identify critical similarities and differences. We introduce a software system for visual analysis of comparative genomics data. The system automates the process of data integration, and provides the analysis platform to identify and explore features of interest within these large datasets. GenoSets borrows techniques from business intelligence and visual analytics to provide a rich interface of interactive visualizations supported by a multi-dimensional data warehouse. In GenoSets, visual analytic approaches are used to enable querying based on orthology, functional assignment, and taxonomic or user-defined groupings of genomes. GenoSets links this information together with coordinated, interactive visualizations for both detailed and high-level categorical analysis of summarized data. GenoSets has been designed to simplify the exploration of multiple genome datasets and to facilitate reasoning about genomic comparisons. Case examples are included showing the use of this system in the analysis of 12 Brucella genomes. GenoSets software and the case study dataset are freely available at http://genosets.uncc.edu. We demonstrate that the integration of genomic data using a coordinated multiple view approach can simplify the exploration of large comparative genomic data sets, and facilitate reasoning about comparisons and features of interest.

Concepts: Gene, Genetics, Genome, Genomics, Data set, Logic, Data management, Business intelligence

166

In genome-wide association studies, results have been improved through imputation of a denser marker set based on reference haplotypes and phasing of the genotype data. To better handle very large sets of reference haplotypes, pre-phasing with only study individuals has been suggested. We present a possible problem which is aggravated when pre-phasing strategies are used, and suggest a modification avoiding the resulting issues with application to the MaCH tool, although the underlying problem is not specific to that tool. We evaluate the effectiveness of our remedy to a subset of Hapmap data, comparing the original version of MaCH and our modified approach. Improvements are demonstrated on the original data (phase switch error rate decreasing by 10%), but the differences are more pronounced in cases where the data is augmented to represent the presence of closely related individuals, especially when siblings are present (30% reduction in switch error rate in the presence of children, 47% reduction in the presence of siblings). The main conclusion of this investigation is that existing statistical methods for phasing and imputation of unrelated individuals might give results of sub-par quality if a subset of study individuals nonetheless are related. As the populations collected for general genome-wide association studies grow in size, including relatives might become more common. If a general GWAS framework for unrelated individuals would be employed on datasets with some related individuals, such as including familial data or material from domesticated animals, caution should also be taken regarding the quality of haplotypes. Our modification to MaCH is available on request and straightforward to implement. We hope that this mode, if found to be of use, could be integrated as an option in future standard distributions of MaCH.

Concepts: Better, Genetics, Statistics, Improve, Mathematics, Data, Data set, Genome-wide association study

166

(23)Na magnetic resonance imaging is a promising technique for the noninvasive imaging of renal function. Past investigations of the renal corticomedullary [(23)Na] gradient have relied on imaging only in the coronal plane and on cumbersome calculations of [(23)Na], which require the use of external phantoms. The aim of this study is therefore two-fold: to use an isotropic three-dimensional data set to compare coronal measurements of renal [(23)Na] relative to measurements obtained in planes along the corticomedullary gradients and to investigate cerebrospinal fluid (CSF) (23)Na signal as an internal reference standard, obviating the need for time-intensive [(23)Na] calculations.

Concepts: Brain, Statistics, Nuclear magnetic resonance, Magnetic resonance imaging, Multiple sclerosis, Data set, Cerebrospinal fluid, Level set