SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: C

176

This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK’s largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online <1> under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

Concepts: Genome-wide association study, C, Open source, Free software, Operating system, Text mining, Source text, Arduino

173

MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.

Concepts: DNA, Gene, Assembly language, Unix, C, Source code, Open source, SQL

173

The Ensembl Project provides release-specific Perl APIs for efficient high-level programmatic access to data stored in various Ensembl database schema. Although Perl scripts are perfectly suited for processing large volumes of text-based data, Perl is not ideal for developing large-scale software applications nor embedding in graphical interfaces. The provision of a novel Java API would facilitate type-safe, modular, object-orientated development of new Bioinformatics tools with which to access, analyse and visualize Ensembl data.

Concepts: Bioinformatics, Database, Computer program, C, Application programming interface, Graphical user interface, Computer software, Application software

172

Numerical processing has been demonstrated to be closely associated with arithmetic skills, however, our knowledge on the development of the relevant cognitive mechanisms is limited. The present longitudinal study investigated the developmental trajectories of numerical processing in 42 children with age-adequate arithmetic development and 41 children with dyscalculia over a 2-year period from beginning of Grade 2, when children were 7; 6 years old, to beginning of Grade 4. A battery of numerical processing tasks (dot enumeration, non-symbolic and symbolic comparison of one- and two-digit numbers, physical comparison, number line estimation) was given five times during the study (beginning and middle of each school year). Efficiency of numerical processing was a very good indicator of development in numerical processing while within-task effects remained largely constant and showed low long-term stability before middle of Grade 3. Children with dyscalculia showed less efficient numerical processing reflected in specifically prolonged response times. Importantly, they showed consistently larger slopes for dot enumeration in the subitizing range, an untypically large compatibility effect when processing two-digit numbers, and they were consistently less accurate in placing numbers on a number line. Thus, we were able to identify parameters that can be used in future research to characterize numerical processing in typical and dyscalculic development. These parameters can also be helpful for identification of children who struggle in their numerical development.

Concepts: Time, Mathematics, Developmental psychology, Number, C, Educational years, Counting, Third grade

170

MOTIVATION: BLAST remains one of the most widely used tools in computational biology. The rate at which new sequence data is available continues to grow exponentially, driving the emergence of new fields of biological research. At the same time multicore systems and conventional clusters are more accessible. ScalaBLAST has been designed to run on conventional multiprocessor systems with an eye to extreme parallelism, enabling parallel BLAST calculations using over 16,000 processing cores with a portable, robust, fault-resilient design that introduces little to no overhead with respect to serial BLAST. ScalaBLAST 2.0 source code can be freely downloaded from http://omics.pnl.gov/software/ScalaBLAST.php.

Concepts: Bioinformatics, Biology, Parallel computing, Computer program, Computational biology, C, Source code, Exponential growth

164

We present a web service to access Ensembl data using Representational State Transfer (REST). The Ensembl REST Server enables the easy retrieval of a wide range of Ensembl data by most programming languages, using standard formats such as JSON and FASTA whilst minimising client work. We also introduce bindings to the popular Ensembl Variant Effect Predictor (VEP) tool permitting large-scale programmatic variant analysis independent of any specific programming language. Availability: The Ensembl REST API can be accessed at http://rest.ensembl.org and source code is freely available under an Apache 2.0 license from http://github.com/Ensembl/ensembl-rest.

Concepts: Language, Computer program, Java, C, Programming language, Source code, Programmer, Compiler

131

The process of documentation in electronic health records (EHRs) is known to be time consuming, inefficient, and cumbersome. The use of dictation coupled with manual transcription has become an increasingly common practice. In recent years, natural language processing (NLP)-enabled data capture has become a viable alternative for data entry. It enables the clinician to maintain control of the process and potentially reduce the documentation burden. The question remains how this NLP-enabled workflow will impact EHR usability and whether it can meet the structured data and other EHR requirements while enhancing the user’s experience.

Concepts: Linguistics, Knowledge, C, Electronic health record, Programming language, Output, Natural language

66

DNA is an attractive medium to store digital information. Here we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14 × 10(6) bytes in DNA oligonucleotides and perfectly retrieved the information from a sequencing coverage equivalent to a single tile of Illumina sequencing. We also tested a process that can allow 2.18 × 10(15) retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.

Concepts: DNA, Water, Computer, C, Digital Equipment Corporation, Operating system, Process management, IBM System/360

29

Healthy aging has been associated with decreased specialization in brain function. This characterization has focused largely on describing age-accompanied differences in specialization at the level of neurons and brain areas. We expand this work to describe systems-level differences in specialization in a healthy adult lifespan sample (n = 210; 20-89 y). A graph-theoretic framework is used to guide analysis of functional MRI resting-state data and describe systems-level differences in connectivity of individual brain networks. Young adults' brain systems exhibit a balance of within- and between-system correlations that is characteristic of segregated and specialized organization. Increasing age is accompanied by decreasing segregation of brain systems. Compared with systems involved in the processing of sensory input and motor output, systems mediating “associative” operations exhibit a distinct pattern of reductions in segregation across the adult lifespan. Of particular importance, the magnitude of association system segregation is predictive of long-term memory function, independent of an individual’s age.

Concepts: Brain, Monotonic function, Function, Convex function, Magnetic resonance imaging, Systems theory, C, Functional analysis

29

We describe MetAMOS, an open source and modular metagenomic assembly and analysis pipeline. MetAMOS represents an important step towards fully automated metagenomic analysis, starting with next-generation sequencing reads and producing genomic scaffolds, open-reading frames and taxonomic or functional annotations. MetAMOS can aid in reducing assembly errors, commonly encountered when assembling metagenomic samples, and improves taxonomic assignment accuracy while also reducing computational cost. MetAMOS can be downloaded from: https://github.com/treangen/MetAMOS.

Concepts: Bioinformatics, Genomics, Metagenomics, Assembly language, Unix, C