Concept: DNA binding site
Gene regulatory networks are ultimately encoded by the sequence-specific binding of (TFs) to short DNA segments. Although it is customary to represent the binding specificity of a TF by a position-specific weight matrix (PSWM), which assumes each position within a site contributes independently to the overall binding affinity, evidence has been accumulating that there can be significant dependencies between positions. Unfortunately, methodological challenges have so far hindered the development of a practical and generally-accepted extension of the PSWM model. On the one hand, simple models that only consider dependencies between nearest-neighbor positions are easy to use in practice, but fail to account for the distal dependencies that are observed in the data. On the other hand, models that allow for arbitrary dependencies are prone to overfitting, requiring regularization schemes that are difficult to use in practice for non-experts. Here we present a new regulatory motif model, called dinucleotide weight tensor (DWT), that incorporates arbitrary pairwise dependencies between positions in binding sites, rigorously from first principles, and free from tunable parameters. We demonstrate the power of the method on a large set of ChIP-seq data-sets, showing that DWTs outperform both PSWMs and motif models that only incorporate nearest-neighbor dependencies. We also demonstrate that DWTs outperform two previously proposed methods. Finally, we show that DWTs inferred from ChIP-seq data also outperform PSWMs on HT-SELEX data for the same TF, suggesting that DWTs capture inherent biophysical properties of the interactions between the DNA binding domains of TFs and their binding sites. We make a suite of DWT tools available at dwt.unibas.ch, that allow users to automatically perform ‘motif finding’, i.e. the inference of DWT motifs from a set of sequences, binding site prediction with DWTs, and visualization of DWT ‘dilogo’ motifs.
Toll-like receptor 9 (TLR9) recognizes DNA containing CpG motifs derived from bacteria and viruses and activates the innate immune response to eliminate them. TLR9 is known to bind to CpG DNA, and here, we identified another DNA binding site in TLR9 that binds DNA containing cytosine at the second position from the 5' end (5'-xCx DNA). 5'-xCx DNAs bound to TLR9 in the presence of CpG DNA and cooperatively promoted dimerization and activation of TLR9. Binding at both sites was important for efficient activation of TLR9. The 5'-xCx DNA bound the site corresponding to the nucleoside binding site in TLR7 and TLR8 as revealed by the structural analysis. This study revealed that TLR9 recognizes two types of DNA through its two binding sites for efficient activation. This information may contribute to the development of drugs that control the activity of TLR9.
Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.
position weight matrix (PWM) and sequence logo are the most widely used representations of transcription factor binding site (TFBS) in biological sequences. Sequence logo - a graphical representation of PWM, has been widely used in scientific publications and reports, due to its easiness of human perception, rich information, and simple format. Different from sequence logo, PWM works great as a precise and compact digitalized form, which can be easily used by a variety of motif analysis software. There are a few available tools to generate sequence logos from PWM; however, no tool does the reverse. Such tool to convert sequence logo back to PWM is needed to scan a TFBS represented in logo format in a publication where the PWM is not provided or hard to be acquired. A major difficulty in developing such tool to convert sequence logo to PWM is to deal with the diversity of sequence logo images.
Distance measurements by pulse EPR techniques, such as double electron-electron resonance (DEER, also called PELDOR), have become an established tool to explore structural properties of bio-macromolecules and their assemblies. In such measurements a pair of spin labels provides a single distance constraint. Here we show that by employing three different types of spin labels that differ in their spectroscopic and spin dynamics properties it is possible to extract three independent distances from a single sample. We demonstrate this using the Antennapedia homeodomain orthogonally labeled with Gd3+ and Mn2+ tags in complex with its cognate DNA binding site labeled with a nitroxide.
Recent studies have shown that the traditional position weight matrix model is often insufficient for modeling transcription factor binding sites, as intra-motif dependencies play a significant role for an accurate description of binding motifs. Here, we present the Java application InMoDe, a collection of tools for learning, leveraging and visualizing such dependencies of putative higher order. The distinguishing feature of InMoDe is a robust model selection from a class of parsimonious models, taking into account dependencies only if justified by the data while choosing for simplicity otherwise.
- IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
- Published over 3 years ago
Through sequence-based classification, this paper tries to accurately predict the DNA binding sites of transcription factors (TFs) in an unannotated cellular context. Related methods in the literature fail to perform such predictions accurately, since they do not consider sample distribution shift of sequence segments from an annotated (source) context to an unannotated (target) context. We, therefore, propose a method called “Transfer String Kernel” (TSK) that achieves improved prediction of transcription factor binding site (TFBS) using knowledge transfer via cross-context sample adaptation. TSK maps sequence segments to a high-dimensional feature space using a discriminative mismatch string kernel framework. In this high-dimensional space, labeled examples of the source context are re-weighted so that the revised sample distribution matches the target context more closely. We have experimentally verified TSK for TFBS identifications on fourteen different TFs under a cross-organism setting. We find that TSK consistently outperforms the state-of-the-art TFBS tools, especially when working with TFs whose binding sequences are not conserved across contexts. We also demonstrate the generalizability of TSK by showing its cutting-edge performance on a different set of cross-context tasks for the MHC peptide binding predictions.
- Toxicon : official journal of the International Society on Toxinology
- Published almost 4 years ago
The marine polycyclic-ether toxin gambierol and 1-butanol (n-alkanol) inhibit Shaker-type Kv channels by interfering with the gating machinery. Competition experiments indicated that both compounds do not share an overlapping binding site but gambierol is able to affect 1-butanol affinity for Shaker through an allosteric effect. Furthermore, the Shaker-P475A mutant, which inverses 1-butanol effect, is inhibited by gambierol with nM affinity. Thus, gambierol and 1-butanol inhibit Shaker-type Kv channels via distinct parts of the gating machinery.
Protein-nucleic acid interactions are among the most important intermolecular interactions in the regulation of cellular events. Identifying residues involved in these interactions from protein structure alone is an important challenge. Here we introduce the webserver interface to DBSI (DNA Binding Site Identifier), a powerful structure-based SVM model for the prediction and visualization of DNA binding sites on protein structures. DBSI has been shown to be a top-performing model to predict DNA binding sites on the surface of a protein or peptide and shows promise in predicting RNA binding sites.
Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.