Journal: Source code for biology and medicine
Although published material exists about the skills required for a successful bioinformatics career, strangely enough no work to date has addressed the matter of how to excel at not being a bioinformatician. A set of basic guidelines and a code of conduct is hereby presented to re-address that imbalance for fellow-practitioners whose aim is to not to succeed in their chosen bioinformatics field. By scrupulously following these guidelines one can be sure to regress at a highly satisfactory rate.
BACKGROUND: The whole-genome sequences of many non-model organisms have recently been determined. Using these genome sequences, next-generation sequencing based experiments such as RNA-seq and ChIP-seq have been performed and comparisons of the experiments between related species have provided new knowledge about evolution and biological processes. Although these comparisons require transformation of the genome coordinates of the reads between the species, current software tools are not suitable to convert the massive numbers of reads to the corresponding coordinates of other species' genomes. RESULTS: Here, we introduce a set of programs, called REad COordinate Transformer (RECOT), created to transform the coordinates of short reads obtained from the genome of a query species being studied to that of a comparison target species after aligning the query and target gene/genome sequences. RECOT generates output in SAM format that can be viewed using recent genome browsers capable of displaying next-generation sequencing data. CONCLUSIONS: We demonstrate the usefulness of RECOT in comparing ChIP-seq results between two closely-related fruit flies. The results indicate position changes of a transcription factor binding site caused sequence polymorphisms at the binding site.
Background Reproducibility is the hallmark of good science.Maintaining a high degree of transparency in scientific reporting isessential not just for gaining trust and credibility within thescientific community but also for facilitating the development of newideas. Sharing data and computer code associated with publications isbecoming increasingly common, motivated partly in response to datadeposition requirements from journals and mandates from funders. Despitethis increase in transparency, it is still difficult to reproduce orbuild upon the findings of most scientific publications without accessto a more complete workflow.Findings Version control systems (VCS), which have long beenused to maintain code repositories in the software industry, are nowfinding new applications in science. One such open source VCS, git,provides a lightweight yet robust framework that is ideal for managingthe full suite of research outputs such as datasets, statistical code,figures, lab notes, and manuscripts. For individual researchers, gitprovides a powerful way to track and compare versions, retrace errors,explore new approaches in a structured manner, while maintaining a fullaudit trail. For larger collaborative efforts, git and git hostingservices make it possible for everyone to work asynchronously and mergetheir contributions at any time, all the while maintaining a completeauthorship trail. In this paper I provide an overview of git along withuse-cases that highlight how this tool can be leveraged to make sciencemore reproducible and transparent, foster new collaborations, andsupport novel uses.
Next-generation sequencing can determine DNA bases and the results of sequence alignments are generally stored in files in the Sequence Alignment/Map (SAM) format and the compressed binary version (BAM) of it. SAMtools is a typical tool for dealing with files in the SAM/BAM format. SAMtools has various functions, including detection of variants, visualization of alignments, indexing, extraction of parts of the data and loci, and conversion of file formats. It is written in C and can execute fast. However, SAMtools requires an additional implementation to be used in parallel with, for example, OpenMP (Open Multi-Processing) libraries. For the accumulation of next-generation sequencing data, a simple parallelization program, which can support cloud and PC cluster environments, is required.
High-throughput primer design is routinely performed in a wide number of molecular applications including genotyping specimens using traditional PCR techniques as well as assembly PCR, nested PCR, and primer walking experiments. Batch primer design is also required in validation experiments from RNA-seq transcriptome sequencing projects, as well as in generating probes for microarray experiments. The growing popularity of next generation sequencing and microarray technology has created a greater need for more primer design tools to validate large numbers of candidate genes and markers.
The suffix -ome conveys “comprehensiveness” in some way. The idea of the Corpasome started half-jokingly, acknowledging the efforts to sequence five members of my family. After the unexpected response from many scientists from around the world, it has become clear how useful this approach could be for understanding the genomic information contained in our personal genomics tests.
Rare disease registries (RDRs) are an essential tool to improve knowledge and monitor interventions for rare diseases. If designed appropriately, patient and disease related information captured within them can become the cornerstone for effective diagnosis and new therapies. Surprisingly however, registries possess a diverse range of functionality, operate in different, often-times incompatible, software environments and serve various, and sometimes incongruous, purposes. Given the ambitious goals of the International Rare Diseases Research Consortium (IRDiRC) by 2020 and beyond, RDRs must be designed with the agility to evolve and efficiently interoperate in an ever changing rare disease landscape, as well as to cater for rapid changes in Information Communication Technologies. In this paper, we contend that RDR requirements will also evolve in response to a number of factors such as changing disease definitions and diagnostic criteria, the requirement to integrate patient/disease information from advances in either biotechnology and/or phenotypying approaches, as well as the need to adapt dynamically to security and privacy concerns. We dispel a number of myths in RDR development, outline key criteria for robust and sustainable RDR implementation and introduce the concept of a RDR Checklist to guide future RDR development.
BACKGROUND: The graph-theoretical analysis of molecular networks has a long tradition in chemoinformatics. Asdemonstrated frequently, a well designed format to encode chemical structures and structure-relatedinformation of organic compounds is the Molfile format. But when it comes to use modernprogramming languages for statistical data analysis in Bio- and Chemoinformatics, R as one of themost powerful free languages lacks tools to process R Molfile data collections and import molecularnetwork data into R. RESULTS: We design an R object which allows a lossless information mapping of structural information fromMolfiles into R objects. This provides the basis to use the RMol object as an anchor for connectingMolfile data collections with R libraries for analyzing graphs. Associated with the RMol objects, aset of R functions completes the toolset to organize, describe and manipulate the converted data sets.Further, we bypass R-typical limits for manipulating large data sets by storing R objects inbz-compressed serialized files instead of employing RData files. CONCLUSIONS: By design, RMol is a R tool set without dependencies to other libraries or programming languages. Itis useful to integrate into pipelines for serialized batch analysis by using network data and, therefore,helps to process sdf-data sets in R effeciently. It is freely available under the BSD licence. The scriptsource can be downloaded from http://sourceforge.net/p/rmol-toolset.
Over-representation analysis (ORA) detects enrichment of genes within biological categories. Gene Ontology (GO) domains are commonly used for gene/gene-product annotation. When ORA is employed, often times there are hundreds of statistically significant GO terms per gene set. Comparing enriched categories between a large number of analyses and identifying the term within the GO hierarchy with the most connections is challenging. Furthermore, ascertaining biological themes representative of the samples can be highly subjective from the interpretation of the enriched categories.
Matched sequencing of both tumor and normal tissue is routinely used to classify variants of uncertain significance (VUS) into somatic vs. germline. However, assays used in molecular diagnostics focus on known somatic alterations in cancer genes and often only sequence tumors. Therefore, an algorithm that reliably classifies variants would be helpful for retrospective exploratory analyses. Contamination of tumor samples with normal cells results in differences in expected allelic fractions of germline and somatic variants, which can be exploited to accurately infer genotypes after adjusting for local copy number. However, existing algorithms for determining tumor purity, ploidy and copy number are not designed for unmatched short read sequencing data.