Journal: Genomics, proteomics & bioinformatics
Presumptive identification of different Enterobacteriaceae species is routinely achieved based on the biochemical properties. Traditional practice includes manual comparison of each biochemical property of the unknown sample with known reference samples and inference of its identity based on the maximum similarity pattern with the known samples. This process is labor-intensive, time-consuming, error-prone, and subjective. Therefore, automation of sorting and similarity calculation would be advantageous. Here we present a MATLAB-based graphical user interface (GUI) tool named BioCluster. This tool was designed for automated clustering and identification of Enterobacteriaceae based on biochemical test results. In this tool, we used two types of algorithms, i.e., traditional hierarchical clustering (HC) and the Improved Hierarchical Clustering (IHC), a modified algorithm that was developed specifically for the clustering and identification of Enterobacterioceae species. IHC takes into account the variability in result of 1-47 biochemical tests within this Enterobacterioceae family. This tool also provides different options to optimize the clustering in a user-friendly way. Using computer-generated synthetic data and some real data, we have demonstrated that BioCluster has high accuracy in clustering and identifying enterobacterial species based on biochemical test data. This tool can be freely downloaded at http://microbialgen.du.ac.bd/biocluster/.
Recently non-coding RNA (ncRNA) genes have been found to serve many important functions in the cell such as regulation of gene expression at the transcriptional level. Potentially there are more ncRNA molecules yet to be found and their possible functions are to be revealed. The discovery of ncRNAs is a difficult task because they lack sequence indicators such as the start and stop codons displayed by protein-coding RNAs. Current methods utilize either sequence motifs or structural parameters to detect novel ncRNAs within genomes. Here, we present an ab initio ncRNA finder, named ncRNAscout, by utilizing both sequence motifs and structural parameters. Specifically, our method has three components: (i) a measure of the frequency of a sequence, (ii) a measure of the structural stability of a sequence contained in a t-score, and (iii) a measure of the frequency of certain patterns within a sequence that may indicate the presence of ncRNA. Experimental results show that, given a genome and a set of known ncRNAs, our method is able to accurately identify and locate a significant number of ncRNA sequences in the genome. The ncRNAscout tool is available for downloading at http://bioinformatics.njit.edu/ncRNAscout.
For the past few decades, intensive studies have been carried out in an attempt to understand how the amino acid sequences of proteins encode their three dimensional structures to perform their specific functions. In order to understand the sequence-structure relationship of proteins, several sub-sequence search studies in non-redundant sequence-structure databases have been undertaken which have given some fruitful clues. In our earlier work, we analyzed a set of 3124 non-redundant protein sequences from the Protein Data Bank (PDB) and retrieved 30 identical octapeptides having different secondary structure. These octapeptides were characterized by using different computational procedures. This prompted us to explore the presence of octapeptides with reverse sequences and to analyze whether these octapeptides would adopt similar structures as that of their parent octapeptides. Our identical reverse octapeptide search resulted in the finding of eight octapeptide pairs (octapeptide and reverse octapeptide) with similar secondary structure and 23 octapeptide pairs with different secondary structure. In the present work, the geometrical and biophysical characteristics of identical reverse octapeptides were explored and compared with unrelated octapeptide pairs by using various computational tools. We thus conclude that proteins containing identical reverse octapeptides are not very abundant and residues in the octapeptide pairs do not contribute to the stability of the protein. Furthermore, compared to unrelated octapeptides, identical reverse octapeptides do not show certain biophysical and geometrical properties.
Nanopore-based sequencers, as the fourth-generation DNA sequencing technology, have the potential to quickly and reliably sequence the entire human genome for less than $1000, and possibly for even less than $100. The single-molecule techniques used by this technology allow us to further study the interaction between DNA and protein, as well as between protein and protein. Nanopore analysis opens a new door to molecular biology investigation at the single-molecule scale. In this article, we have reviewed academic achievements in nanopore technology from the past as well as the latest advances, including both biological and solid-state nanopores, and discussed their recent and potential applications.
Single-molecule, real-time sequencing developed by Pacific BioSciences offers longer read lengths than the second-generation sequencing (SGS) technologies, making it well-suited for unsolved problems in genome, transcriptome, and epigenetics research. The highly-contiguous de novo assemblies using PacBio sequencing can close gaps in current reference assemblies and characterize structural variation (SV) in personal genomes. With longer reads, we can sequence through extended repetitive regions and detect mutations, many of which are associated with diseases. Moreover, PacBio transcriptome sequencing is advantageous for the identification of gene isoforms and facilitates reliable discoveries of novel genes and novel isoforms of annotated genes, due to its ability to sequence full-length transcripts or fragments with significant lengths. Additionally, PacBio’s sequencing technique provides information that is useful for the direct detection of base modifications, such as methylation. In addition to using PacBio sequencing alone, many hybrid sequencing strategies have been developed to make use of more accurate short reads in conjunction with PacBio long reads. In general, hybrid sequencing strategies are more affordable and scalable especially for small-size laboratories than using PacBio Sequencing alone. The advent of PacBio sequencing has made available much information that could not be obtained via SGS alone.
Variation of maternal gut microbiota may increase the risk of autism spectrum disorders (ASDs) in offspring. Animal studies have indicated that maternal gut microbiota is related to neurodevelopmental abnormalities in mouse offspring, while it is unclear whether there is a correlation between gut microbiota of ASD children and their mothers. We examined the relationships between gut microbiome profiles of ASD children and those of their mothers, and evaluated the clinical discriminatory power of discovered bacterial biomarkers. Gut microbiome was profiled and evaluated by 16S ribosomal RNA gene sequencing in stool samples of 59 mother-child pairs of ASD children and 30 matched mother-child pairs of healthy children. Significant differences were observed in the gut microbiome composition between ASD and healthy children in our Chinese cohort. Several unique bacterial biomarkers, such as Alcaligenaceae and Acinetobacter, were identified. Mothers of ASD children had more Proteobacteria, Alphaproteobacteria, Moraxellaceae, and Acinetobacter than mothers of healthy children. There was a clear correlation between gut microbiome profiles of children and their mothers; however, children with ASD still had unique bacterial biomarkers, such as Alcaligenaceae, Enterobacteriaceae, and Clostridium. Candidate biomarkers discovered in this study had remarkable discriminatory power. The identified patterns of mother-child gut microbiome profiles may be important for assessing risks during the early stage and planning of personalized treatment and prevention of ASD via microbiota modulation.
The revolution of genome sequencing is continuing after the successful second-generation sequencing (SGS) technology. The third-generation sequencing (TGS) technology, led by Pacific Biosciences (PacBio), is progressing rapidly, moving from a technology once only capable of providing data for small genome analysis, or for performing targeted screening, to one that promises high quality de novo assembly and structural variation detection for human-sized genomes. In 2014, the MinION, the first commercial sequencer using nanopore technology, was released by Oxford Nanopore Technologies (ONT). MinION identifies DNA bases by measuring the changes in electrical conductivity generated as DNA strands pass through a biological pore. Its portability, affordability, and speed in data production makes it suitable for real-time applications, the release of the long read sequencer MinION has thus generated much excitement and interest in the genomics community. Whilst de novo genome assemblies can be cheaply produced from SGS data, assembly continuity is often relatively poor, due to the limited ability of short reads to handle long repeats. Assembly quality can be greatly improved by using TGS long reads, since repetitive regions can be easily expanded into using longer sequencing lengths, despite having higher error rates at the base level. The potential of nanopore sequencing has been demonstrated by various studies in genome surveillance at locations where rapid and reliable sequencing is needed, but where resources are limited.
The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of version-dependent annotation files and other compatible public dataset for downstream analysis. To handle these tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool (RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNA-seq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool (RATT), RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification, genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/ and https://github.com/wushyer/RGAAT_v2 at no cost.
DNA methylation is an important epigenetic mark that plays a vital role in gene expression and cell differentiation. The average DNA methylation level among a group of cells has been extensively documented. However, the cell-to-cell heterogeneity in DNA methylation, which reflects the differentiation of epigenetic status among cells, remains less investigated. Here we established a gold standard of the cell-to-cell heterogeneity in DNA methylation based on single-cell bisulfite sequencing (BS-seq) data. With that we optimized a computational pipeline for estimating the heterogeneity in DNA methylation from bulk BS-seq data. We further built HeteroMeth, a database for searching, browsing, visualizing, and downloading the data for heterogeneity in DNA methylation for a total of 141 samples in humans, mice, Arabidopsis, and rice. Three genes are used as examples to illustrate the power of HeteroMeth in the identification of unique features in DNA methylation. The optimization of the computational strategy and the construction of the database in this study complement the recent experimental attempts on single-cell DNA methylomes and will facilitate the understanding of epigenetic mechanisms underlying cell differentiation and embryonic development. HeteroMeth is publicly available at http://qianlab.genetics.ac.cn/HeteroMeth.
Circular RNAs (circRNAs) from back-splicing of exon(s) have been recently identified to be broadly expressed in eukaryotes, in tissue- and species- specific manners. Although functions of most circRNAs remain elusive, some circRNAs are shown to be functional in gene expression regulation and potentially relate to diseases. Due to their stability, circRNAs can also be used as biomarkers for diagnosis. Profiling circRNAs by integrating their expression among different samples thus provides molecular basis for further functional study of circRNAs and their potential application in clinic. Here, we report CIRCpedia v2, an updated database for comprehensive circRNA annotation from over 180 RNA-seq datasets across six different species. This atlasallows users to search, browse, and download circRNAs with expression features in various cell types/tissues, including disease samples. In addition, the updated database incorporates conservation analysis of circRNAs between humans and mice. Finally, the web interface also contains computational tools to compare circRNA expression among samples. CIRCpedia v2 is accessible at http://www.picb.ac.cn/rnomics/circpedia.