Discover the most talked about and latest scientific content & concepts.

Journal: Scientific data


How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski. The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point. The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends.

Concepts: Statistics, Chronology, Demography, Geographic information system, City, Urban area, Urbanization, Anno Domini


Reproducible climate reconstructions of the Common Era (1 CE to present) are key to placing industrial-era warming into the context of natural climatic variability. Here we present a community-sourced database of temperature-sensitive proxy records from the PAGES2k initiative. The database gathers 692 records from 648 locations, including all continental regions and major ocean basins. The records are from trees, ice, sediment, corals, speleothems, documentary evidence, and other archives. They range in length from 50 to 2000 years, with a median of 547 years, while temporal resolution ranges from biweekly to centennial. Nearly half of the proxy time series are significantly correlated with HadCRUT4.2 surface temperature over the period 1850-2014. Global temperature composites show a remarkable degree of coherence between high- and low-resolution archives, with broadly similar patterns across archive types, terrestrial versus marine locations, and screening criteria. The database is suited to investigations of global and regional temperature variability over the Common Era, and is shared in the Linked Paleo Data (LiPD) format, including serializations in Matlab, R and Python.

Concepts: Ice, Climate, Weather, Climate change, Degrees of freedom, Ocean, Global warming, Latitude


Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of sensing technologies that provide high-fidelity data streams for every match. Unfortunately, these detailed data are owned by specialized companies and hence are rarely publicly available for scientific research. To fill this gap, this paper describes the largest open collection of soccer-logs ever released, containing all the spatio-temporal events (passes, shots, fouls, etc.) that occured during each match for an entire season of seven prominent soccer competitions. Each match event contains information about its position, time, outcome, player and characteristics. The nature of team sports like soccer, halfway between the abstraction of a game and the reality of complex social systems, combined with the unique size and composition of this dataset, provide an ideal ground for tackling a wide range of data science problems, including the measurement and evaluation of performance, both at individual and at collective level, and the determinants of success and failure.


Tardigrades are ubiquitous microscopic animals that play an important role in the study of metazoan phylogeny. Most terrestrial tardigrades can withstand extreme environments by entering an ametabolic desiccated state termed anhydrobiosis. Due to their small size and the non-axenic nature of laboratory cultures, molecular studies of tardigrades are prone to contamination. To minimize the possibility of microbial contaminations and to obtain high-quality genomic information, we have developed an ultra-low input library sequencing protocol to enable the genome sequencing of a single tardigrade Hypsibius dujardini individual. Here, we describe the details of our sequencing data and the ultra-low input library preparation methodologies.

Concepts: DNA, Genetics, Molecular biology, Organism, Animal, Tardigrade, Trehalose, Cryptobiosis


High-resolution, easily accessible paleoclimate data are essential for environmental, evolutionary, and ecological studies. The availability of bioclimatic layers derived from climatic simulations representing conditions of the Late Pleistocene and Holocene has revolutionized the study of species responses to Late Quaternary climate change. Yet, integrative studies of the impacts of climate change in the Early Pleistocene and Pliocene - periods in which recent speciation events are known to concentrate - have been hindered by the limited availability of downloadable, user-friendly climatic descriptors. Here we present PaleoClim, a free database of downscaled paleoclimate outputs at 2.5-minute resolution (~5 km at equator) that includes surface temperature and precipitation estimates from snapshot-style climate model simulations using HadCM3, a version of the UK Met Office Hadley Centre General Circulation Model. As of now, the database contains climatic data for three key time periods spanning from 3.3 to 0.787 million years ago: the Marine Isotope Stage 19 (MIS19) in the Pleistocene (~787 ka), the mid-Pliocene Warm Period (~3.264-3.025 Ma), and MIS M2 in the Late Pliocene (~3.3 Ma).


The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

Concepts: Electron, Chemistry, Computational chemistry, Density functional theory, Quantum chemistry, Molecular orbital, Standard Model, Theoretical chemistry


This paper presents the first global map of food systems sustainability based on a rigorous protocol. The choice of the metric dimensions, as well as the individual indicators included in the metric, were initially identified from a thorough review of the existing literature. A rigorous inclusion/exclusion protocol was then used to refine the list and shorten it to a sub-set of 27 indicators. An aggregate sustainability score was then computed based on those 27 indicators organized into four dimensions: environment, social, food security & nutrition and economic. The paper shows how the availability of data (or lack therefore) results in an unavoidable trade-off between number of indicators and number of countries, and highlights how optimization can be used to present the most robust metric possible given the existence of this trade-offs in the data space. The process results in the computation of a global sustainability map covering 97 countries and 20 indicators. The sustainability scores obtained for each country are made available over the entire range of indicators.


Wilderness areas, defined as areas free of industrial scale activities and other human pressures which result in significant biophysical disturbance, are important for biodiversity conservation and sustaining the key ecological processes underpinning planetary life-support systems. Despite their importance, wilderness areas are being rapidly eroded in extent and fragmented. Here we present the most up-to-date temporally inter-comparable maps of global terrestrial wilderness areas, which are essential for monitoring changes in their extent, and for proactively planning conservation interventions to ensure their preservation. Using maps of human pressure on the natural environment for 1993 and 2009, we identified wilderness as all ‘pressure free’ lands with a contiguous area >10,000 km2. These places are likely operating in a natural state and represent the most intact habitats globally. We then created a regionally representative map of wilderness following the well-established ‘Last of the Wild’ methodology; which identifies the 10% area with the lowest human pressure within each of Earth’s 60 biogeographic realms, and identifies the ten largest contiguous areas, along with all contiguous areas >10,000 km2.

Concepts: Biodiversity, Conservation biology, Ecology, Geography, Natural environment, Nature, Conservation movement, Wilderness


Health facilities form a central component of health systems, providing curative and preventative services and structured to allow referral through a pyramid of increasingly complex service provision. Access to health care is a complex and multidimensional concept, however, in its most narrow sense, it refers to geographic availability. Linking health facilities to populations has been a traditional per capita index of heath care coverage, however, with locations of health facilities and higher resolution population data, Geographic Information Systems allow for a more refined metric of health access, define geographic inequalities in service provision and inform planning. Maximizing the value of spatial heath access requires a complete census of providers and their locations. To-date there has not been a single, geo-referenced and comprehensive public health facility database for sub-Saharan Africa. We have assembled national master health facility lists from a variety of government and non-government sources from 50 countries and islands in sub Saharan Africa and used multiple geocoding methods to provide a comprehensive spatial inventory of 98,745 public health facilities.


Interactions between species, particularly where one is likely to be a pathogen of the other, as well as the geographical distribution of species, have been systematically extracted from various web-based, free-access sources, and assembled with the accompanying evidence into a single database. The database attempts to answer questions such as what are all the pathogens of a host, and what are all the hosts of a pathogen, what are all the countries where a pathogen was found, and what are all the pathogens found in a country. Two datasets were extracted from the database, focussing on species interactions and species distribution, based on evidence published between 1950-2012. The quality of their evidence was checked and verified against well-known, alternative, datasets of pathogens infecting humans, domestic animals and wild mammals. The presented datasets provide a valuable resource for researchers of infectious diseases of humans and animals, including zoonoses.

Concepts: Immune system, Disease, Infectious disease, Bacteria, Microbiology, Malaria, Infection, Mammal