SciCombinator

Discover the most talked about and latest scientific content & concepts.

Concept: SQL

183

There is an ever growing number of molecular phylogenetic studies published, due to, in part, the advent of new techniques that allow cheap and quick DNA sequencing. Hence, the demand for relational databases with which to manage and annotate the amassing DNA sequences, genes, voucher specimens and associated biological data is increasing. In addition, a user-friendly interface is necessary for easy integration and management of the data stored in the database back-end. Available databases allow management of a wide variety of biological data. However, most database systems are not specifically constructed with the aim of being an organizational tool for researchers working in phylogenetic inference. We here report a new software facilitating easy management of voucher and sequence data, consisting of a relational database as back-end for a graphic user interface accessed via a web browser. The application, VoSeq, includes tools for creating molecular datasets of DNA or amino acid sequences ready to be used in commonly used phylogenetic software such as RAxML, TNT, MrBayes and PAUP, as well as for creating tables ready for publishing. It also has inbuilt BLAST capabilities against all DNA sequences stored in VoSeq as well as sequences in NCBI GenBank. By using mash-ups and calls to web services, VoSeq allows easy integration with public services such as Yahoo! Maps, Flickr, Encyclopedia of Life (EOL) and GBIF (by generating data-dumps that can be processed with GBIF’s Integrated Publishing Toolkit).

Concepts: DNA, Molecular biology, Biology, Database, Relational database, Microsoft, SQL, Relational model

174

MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.

Concepts: DNA, Gene, Assembly language, Unix, C, Source code, Open source, SQL

154

Alveolar echinococcosis (AE) is an endemic zoonosis in France due to the cestode Echinococcus multilocularis. The French National Reference Centre for Alveolar Echinococcosis (CNR-EA), connected to the FrancEchino network, is responsible for recording all AE cases diagnosed in France. Administrative, epidemiological and medical information on the French AE cases may currently be considered exhaustive only on the diagnosis time. To constitute a reference data set, an information system (IS) was developed thanks to a relational database management system (MySQL language). The current data set will evolve towards a dynamic surveillance system, including follow-up data (e.g. imaging, serology) and will be connected to environmental and parasitological data relative to E. multilocularis to better understand the pathogen transmission pathway. A particularly important goal is the possible interoperability of the IS with similar European and other databases abroad; this new IS could play a supporting role in the creation of new AE registries.

Concepts: Database, Cestoda, Databases, SQL, Echinococcus multilocularis, Database management system, Relational database management system, Database model

138

The value of metabolomics in translational research is undeniable, and metabolomics data are increasingly generated in large cohorts. The functional interpretation of disease-associated metabolites though is difficult, and the biological mechanisms that underlie cell type or disease-specific metabolomics profiles are oftentimes unknown. To help fully exploit metabolomics data and to aid in its interpretation, analysis of metabolomics data with other complementary omics data, including transcriptomics, is helpful. To facilitate such analyses at a pathway level, we have developed RaMP (Relational database of Metabolomics Pathways), which combines biological pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, WikiPathways, and the Human Metabolome DataBase (HMDB). To the best of our knowledge, an off-the-shelf, public database that maps genes and metabolites to biochemical/disease pathways and can readily be integrated into other existing software is currently lacking. For consistent and comprehensive analysis, RaMP enables batch and complex queries (e.g., list all metabolites involved in glycolysis and lung cancer), can readily be integrated into pathway analysis tools, and supports pathway overrepresentation analysis given a list of genes and/or metabolites of interest. For usability, we have developed a RaMP R package (https://github.com/Mathelab/RaMP-DB), including a user-friendly RShiny web application, that supports basic simple and batch queries, pathway overrepresentation analysis given a list of genes or metabolites of interest, and network visualization of gene-metabolite relationships. The package also includes the raw database file (mysql dump), thereby providing a stand-alone downloadable framework for public use and integration with other tools. In addition, the Python code needed to recreate the database on another system is also publicly available (https://github.com/Mathelab/RaMP-BackEnd). Updates for databases in RaMP will be checked multiple times a year and RaMP will be updated accordingly.

Concepts: Database, Relational database, Relational algebra, Databases, SQL, Relational model, Database theory, Relation

28

With the advancement of pharmaceutical development, drug interactions have become increasingly complex. As a result, a computer-based drug interaction search system is required to organize the whole of drug interaction data. To overcome problems faced with the existing systems, we developed a drug interaction search system using a hash table, which offers higher processing speeds and easier maintenance operations compared with relational databases (RDB). In order to compare the performance of our system and MySQL RDB in terms of search speed, drug interaction searches were repeated for all 45 possible combinations of two out of a group of 10 drugs for two cases: 5,604 and 56,040 drug interaction data. As the principal result, our system was able to process the search approximately 19 times faster than the system using the MySQL RDB. Our system also has several other merits such as that drug interaction data can be created in comma-separated value (CSV) format, thereby facilitating data maintenance. Although our system uses the well-known method of a hash table, it is expected to resolve problems common to existing systems and to be an effective system that enables the safe management of drugs.

Concepts: Pharmacology, Drugs, Pharmaceutical drug, Relational database, Searching, SQL, Relational model, Relation

27

Targeted sequencing using next-generation sequencing technologies is currently being rapidly adopted for clinical sequencing and cancer marker tests. However, no existing bioinformatics tool is available for the analysis and visualization of multiple targeted sequencing datasets. In the present study, we use cancer panel targeted sequencing datasets generated by the Life Technologies Ion Personal Genome Machine (PGM) Sequencer as an example to illustrate how to develop an automated pipeline for the comparative analyses of multiple datasets. Cancer Panel Analysis Pipeline (CPAP) uses standard output files from variant calling software to generate a distribution map of SNPs among all of the samples in a circular diagram generated by Circos. The diagram is hyper-linked to a dynamic HTML table that allows the users to identify target SNPs by using different filters. CPAP also integrates additional information about the identified SNPs by linking to an integrated SQL database compiled from SNP-related databases, including dbSNP, 1000 Genomes Project, COSMIC and dbNSFP. CPAP only takes 17 minutes to complete a comparative analysis of 500 datasets. CPAP not only provides an automated platform for the analysis of multiple cancer panel datasets but can also serve as a model for any customized targeted sequencing project. This article is protected by copyright. All rights reserved.

Concepts: DNA, Bioinformatics, Molecular biology, Database, All rights reserved, Copyright, SQL, HTML

27

Our objective was to develop a software application that allows us to easily manage a portable database of information on radiopharmaceutical interactions with drugs or other agents and on radiopharmaceutical adverse effects. Methods: The application was developed and compiled with a commercially available data management system and programming language. All data entered into the database came from the scientific literature and were accompanied by their bibliographic references. Results: We developed the database, which we have called Datinrad. To date, it contains 275 drug interactions and 44 records of adverse reactions to radiopharmaceuticals. Conclusion: Datinrad contains all the information published to date on drug-radiopharmaceutical interactions and adverse effects of radiopharmaceuticals and allows users to introduce new data from future publications. The collection of these data and their easy availability to all nuclear medicine personnel will be useful in the recognition of a possible adverse reaction or drug interaction that may alter the radiopharmaceutical biodistribution and lead to a misdiagnosis. This open-access database application is available free of charge in both English and Spanish at www.radiopharmacy.net.

Concepts: Pharmacology, Database, Drugs, Computer program, Adverse drug reaction, Computer software, Application software, SQL

25

Light fields (LFs) have been shown to enable photorealistic visualization of complex scenes. In practice, however, an LF tends to have a relatively small angular range or spatial resolution, which limits the scope of virtual navigation. In this paper, we show how seamless virtual navigation can be enhanced by stitching multiple LFs. Our technique consists of two key components: LF registration and LF stitching. To register LFs, we use what we call the ray-space motion matrix (RSMM) to establish pairwise ray-ray correspondences. Using Pl ¨ucker coordinates, we show that the RSMM is a 5 6 matrix, which reduces to a 5 5 matrix under pure translation and/or in-plane rotation. The final LF stitching is done using multi-resolution, high-dimensional graph-cut in order to account for possible scene motion, imperfect RSMM estimation, and/or undersampling. We show how our technique allows us to create LFs with various enhanced features: extended horizontal and/or vertical field-of-view, larger synthetic aperture and defocus blur, and larger parallax.

Concepts: Optics, Virtual, The Final, Focus, SQL, Astronomical interferometer, Aperture synthesis, Angular resolution

23

Although materials and engineered surfaces are broadly utilized in creating assays and devices with wide applications in diagnostics, preservation of these immuno-functionalized surfaces on microfluidic devices remains a significant challenge to create reliable repeatable assays that would facilitate patient care in resource-constrained settings at the point-of-care (POC), where reliable electricity and refrigeration are lacking. To address this challenge, we present an innovative approach to stabilize surfaces on-chip with multiple layers of immunochemistry. The functionality of microfluidic devices using the presented method is evaluated at room temperature for up to 6-month shelf life. We integrated the preserved microfluidic devices with a lensless complementary metal oxide semiconductor (CMOS) imaging platform to count CD4(+) T cells from a drop of unprocessed whole blood targeting applications at the POC such as HIV management and monitoring. The developed immunochemistry stabilization method can potentially be applied broadly to other diagnostic immuno-assays such as viral load measurements, chemotherapy monitoring, and biomarker detection for cancer patients at the POC.

Concepts: HIV, Immune system, Blood, Patient, Engineering, Integrated circuit, CMOS, SQL

18

Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. One of its main priorities is to provide easy and efficient access to its high quality curated data. At present, biological pathway databases typically store their contents in relational databases. This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data. The same data in a graph database can be queried more efficiently. Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data. The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery. The adoption of this technology greatly improved query efficiency, reducing the average query time by 93%. The web service built on top of the graph database provides programmatic access to Reactome data by object oriented queries, but also supports more complex queries that take advantage of the new underlying graph-based data storage. By adopting graph database technology we are providing a high performance pathway data resource to the community. The Reactome graph database use case shows the power of NoSQL database engines for complex biological data types.

Concepts: Database, Relational database, Data management, Relational algebra, Databases, SQL, Relational model, Relation