Discover the most talked about and latest scientific content & concepts.

Concept: Computer data


Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.

Concepts: Sample size, Science, Bar chart, Computer data


Background:The risk of cancer with hypercalcaemia in primary care is unknown.Methods:This was a cohort study using calcium results in patients aged ⩾40 years in a primary care electronic data set. Diagnoses of cancer in the following year were identified.Results:Participants (54 267) had calcium results: 1674 (3%) were ⩾2.6 mmol l(-1). Hypercalcaemia was strongly associated with cancer, especially in males: OR 2.92, 95% CI 2.17-3.93, P=<0.001; positive predictive value (PPV) 11.5%; females: OR 1.86, 95% CI 1.39-2.50, P<0.001: PPV 4.1%.Conclusions:Hypercalcaemia is strongly associated with cancer in primary care, with men at most risk, despite hypercalcaemia being more common in women.British Journal of Cancer advance online publication, 5 August 2014; doi:10.1038/bjc.2014.433

Concepts: Cohort study, Vitamin D, Epidemiology, Male, Following, Gender, Primary care, Computer data


Physical activity is widely known to be one of the key elements of a healthy life. The many benefits of physical activity described in the medical literature include weight loss and reductions in the risk factors for chronic diseases. With the recent advances in wearable devices, such as smartwatches or physical activity wristbands, motion tracking sensors are becoming pervasive, which has led to an impressive growth in the amount of physical activity data available and an increasing interest in recognizing which specific activity a user is performing. Moreover, big data and machine learning are now cross-fertilizing each other in an approach called “deep learning”, which consists of massive artificial neural networks able to detect complicated patterns from enormous amounts of input data to learn classification models. This work compares various state-of-the-art classification techniques for automatic cross-person activity recognition under different scenarios that vary widely in how much information is available for analysis. We have incorporated deep learning by using Google’s TensorFlow framework. The data used in this study were acquired from PAMAP2 (Physical Activity Monitoring in the Ageing Population), a publicly available dataset containing physical activity data. To perform cross-person prediction, we used the leave-one-subject-out (LOSO) cross-validation technique. When working with large training sets, the best classifiers obtain very high average accuracies (e.g., 96% using extra randomized trees). However, when the data volume is drastically reduced (where available data are only 0.001% of the continuous data), deep neural networks performed the best, achieving 60% in overall prediction accuracy. We found that even when working with only approximately 22.67% of the full dataset, we can statistically obtain the same results as when working with the full dataset. This finding enables the design of more energy-efficient devices and facilitates cold starts and big data processing of physical activity records.

Concepts: Statistics, Data, Artificial intelligence, Machine learning, Neural network, Artificial neural network, Unsupervised learning, Computer data


BACKGROUND: Due to the growing number of biomedical entries in data repositories of the National Center for Biotechnology Information (NCBI), it is difficult to collect, manage and process all of these entries in one place by third-party software developers without significant investment in hardware and software infrastructure, its maintenance and administration. Web services allow development of software applications that integrate in one place the functionality and processing logic of distributed software components, without integrating the components themselves and without integrating the resources to which they have access. This is achieved by appropriate orchestration or choreography of available Web services and their shared functions. After the successful application of Web services in the business sector, this technology can now be used to build composite software tools that are oriented towards biomedical data processing. RESULTS: We have developed a new tool for efficient and dynamic data exploration in GenBank and other NCBI databases. A dedicated search GenBank system makes use of NCBI Web services and a package of Entrez Programming Utilities (eUtils) in order to provide extended searching capabilities in NCBI data repositories. In search GenBank users can use one of the three exploration paths: simple data searching based on the specified user’s query, advanced data searching based on the specified user’s query, and advanced data exploration with the use of macros. search GenBank orchestrates calls of particular tools available through the NCBI Web service providing requested functionality, while users interactively browse selected records in search GenBank and traverse between NCBI databases using available links. On the other hand, by building macros in the advanced data exploration mode, users create choreographies of eUtils calls, which can lead to the automatic discovery of related data in the specified databases. CONCLUSIONS: search GenBank extends standard capabilities of the NCBI Entrez search engine in querying biomedical databases. The possibility of creating and saving macros in the search GenBank is a unique feature and has a great potential. The potential will further grow in the future with the increasing density of networks of relationships between data stored in particular databases. search GenBank is available for public use at

Concepts: Computer program, Search engine optimization, Internet, Searching, Computer software, Application software, Web service, Computer data


Due to its wide occurrence in water resources and toxicity, pharmaceuticals and personal care products are becoming an emerging concern throughout the world. Application of residual/waste materials for water remediation can be a good strategy in waste management as well as in waste valorization. Herein, this dataset provides information on biochar application for the removal of emerging contaminant, diclofenac from water matrices. The data presented here is an extension of the research article explaining the mechanisms of adsorption diclofenac on biochars (Lonappan et al., 2017 [1]). This data article provides general information on the surface features of pine wood and pig manure biochar with the help of SEM and FTIR data. This dataset also provides information on XRD profiles of pine wood and pig manure biochars. In addition, different amounts of biochars were used to study the removal of a fixed concentration of diclofenac and the data is provided with this data set.

Concepts: Statistics, Data, Data set, Materials science, Recycling, Pinophyta, Waste, Computer data


Recently released large-scale neuron morphological data has greatly facilitated the research in neuroinformatics. However, the sheer volume and complexity of these data pose significant challenges for efficient and accurate neuron exploration. In this paper, we propose an effective retrieval framework to address these problems, based on frontier techniques of deep learning and binary coding. For the first time, we develop a deep learning based feature representation method for the neuron morphological data, where the 3D neurons are first projected into binary images and then learned features using an unsupervised deep neural network, i.e., stacked convolutional autoencoders (SCAEs). The deep features are subsequently fused with the hand-crafted features for more accurate representation. Considering the exhaustive search is usually very time-consuming in large-scale databases, we employ a novel binary coding method to compress feature vectors into short binary codes. Our framework is validated on a public data set including 58,000 neurons, showing promising retrieval precision and efficiency compared with state-of-the-art methods. In addition, we develop a novel neuron visualization program based on the techniques of augmented reality (AR), which can help users take a deep exploration of neuron morphologies in an interactive and immersive manner.

Concepts: Nervous system, Neuron, Computer data


High throughput sequencing makes it possible to evaluate thousands of genetic markers across genomes and populations. Reduced-representation sequencing approaches, like ddRADseq (double digest restriction site associated DNA sequencing), are frequently applied to screen for genetic variation. In particular in non-model organisms where whole-genome sequencing is not yet feasible, ddRADseq has become popular as it allows genome-wide assessment of variation patterns even in the absence of other genomic resources. However, while many tools are available for the analysis of ddRADseq data, few options exist to simulate ddRADseq data in order to evaluate the accuracy of downstream tools. The available tools either focus on the optimization of ddRAD experiment design or do not provide the information necessary for a detailed evaluation of different ddRAD analysis tools. For this task a ground truth, i.e. the underlying information of all effects in the data set, is required. Therefore, we here present DDRAGE, the ddRAD Dataset Generator, that allows both developers and users to evaluate their ddRAD analysis software. ddRAGE allows the user to adjust many parameters such as coverage and rates of mutations, sequencing errors or allelic dropouts, in order to generate a realistic simulated ddRADseq dataset for given experimental scenarios and organisms. The simulated reads can be easily processed with available analysis software such as STACKS or pyRAD and evaluated against the underlying parameters used to generate the data to gauge the impact of different parameter values used during downstream data processing This article is protected by copyright. All rights reserved.

Concepts: DNA, Gene, Genetics, Statistics, Molecular biology, Genome, Data, Computer data


Big data, cloud computing, and high-performance computing (HPC) are at the verge of convergence. Cloud computing is already playing an active part in big data processing with the help of big data frameworks like Hadoop and Spark. The recent upsurge of high-performance computing in China provides extra possibilities and capacity to address the challenges associated with big data. In this paper, we propose Orion-a big data interface on the Tianhe-2 supercomputer-to enable big data applications to run on Tianhe-2 via a single command or a shell script. Orion supports multiple users, and each user can launch multiple tasks. It minimizes the effort needed to initiate big data applications on the Tianhe-2 supercomputer via automated configuration. Orion follows the “allocate-when-needed” paradigm, and it avoids the idle occupation of computational resources. We tested the utility and performance of Orion using a big genomic dataset and achieved a satisfactory performance on Tianhe-2 with very few modifications to existing applications that were implemented in Hadoop/Spark. In summary, Orion provides a practical and economical interface for big data processing on Tianhe-2.

Concepts: Data, Computer, Cloud computing, Utility computing, Grid computing, Computational resource, High-performance computing, Computer data


Here, we briefly describe the real-time fMRI data that is provided for testing the functionality of the open-source Python/Matlab framework for neurofeedback, termed Open NeuroFeedback Training (OpenNFT, Koush et al. [1]). The data set contains real-time fMRI runs from three anonymized participants (i.e., one neurofeedback run per participant), their structural scans and pre-selected ROIs/masks/weights. The data allows for simulating the neurofeedback experiment without an MR scanner, exploring the software functionality, and measuring data processing times on the local hardware. In accordance with the descriptions in our main article, we provide data of (1) periodically displayed (intermittent) activation-based feedback; (2) intermittent effective connectivity feedback, based on dynamic causal modeling (DCM) estimations; and (3) continuous classification-based feedback based on support-vector-machine (SVM) estimations. The data is available on our public GitHub repository:

Concepts: Statistics, Participation, E-participation, Data, Electroencephalography, Neurofeedback, Computer data


Benefiting from global rank constraints, the lowrank representation (LRR) method has been shown to be an effective solution to subspace learning. However, the global mechanism also means that the LRR model is not suitable for handling large-scale data or dynamic data. For large-scale data, the LRR method suffers from high time complexity, and for dynamic data, it has to recompute a complex rank minimization for the entire data set whenever new samples are dynamically added, making it prohibitively expensive. Existing attempts to online LRR either take a stochastic approach or build the representation purely based on a small sample set and treat new input as out-of-sample data. The former often requires multiple runs for good performance and thus takes longer time to run, and the latter formulates online LRR as an out-ofsample classification problem and is less robust to noise. In this paper, a novel online low-rank representation subspace learning method is proposed for both large-scale and dynamic data. The proposed algorithm is composed of two stages: static learning and dynamic updating. In the first stage, the subspace structure is learned from a small number of data samples. In the second stage, the intrinsic principal components of the entire data set are computed incrementally by utilizing the learned subspace structure, and the low-rank representation matrix can also be incrementally solved by an efficient online singular value decomposition (SVD) algorithm. The time complexity is reduced dramatically for large-scale data, and repeated computation is avoided for dynamic problems. We further perform theoretical analysis comparing the proposed online algorithm with the batch LRR method. Finally, experimental results on typical tasks of subspace recovery and subspace clustering show that the proposed algorithm performs comparably or better than batch methods including the batch LRR, and significantly outperforms state-of-the-art online methods.

Concepts: Principal component analysis, Machine learning, Computer, Singular value decomposition, Computational complexity theory, Matrix, Singular value, Computer data