SUMMARY: Two methods to add unaligned sequences into an existing multiple sequence alignment have been implemented as the “–add” and “–addfragments” options in the MAFFT package. The former option is a basic one and applicable only to full-length sequences, while the latter option is applicable even when the unaligned sequences are short and fragmentary. These methods internally infer the phylogenetic relationship among the sequences in the existing alignment, as well as the phylogenetic positions of unaligned sequences. Benchmarks based on two independent simulations consistently suggest that the “–addfragments” option outperforms recent methods, PaPaRa and PAGAN, in accuracy for difficult problems and that these three methods appropriately handle easy problems. AVAILABILITY: http://mafft.cbrc.jp/alignment/software/ CONTACT: firstname.lastname@example.org SUPPLEMENTARY INFORMATION: Available at Bioinformatics online.
Multiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program’s algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution.
We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a “top-down” strategy with a favorable asymptotic time complexity that first identifies regions generally shared by all the input sequences, and then realigns closely related subgroups in tandem. (ii) It infers position-specific gap penalties that favor insertions or deletions (indels) within each sequence at alignment positions in which indels are invoked in other sequences. This favors the placement of insertions between conserved blocks, which can be understood as making up the proteins' structural core. (iii) It uses a Bayesian statistical measure of alignment quality based on the minimum description length principle and on Dirichlet mixture priors. Consequently, GISMO aligns sequence regions only when statistically justified. This is unlike methods based on the ad hoc, but widely used, sum-of-the-pairs scoring system, which will align random sequences. (iv) It defines a system for exploring alignment space that provides natural avenues for further experimentation through the development of new sampling strategies for more efficiently escaping from suboptimal traps. GISMO’s superior performance is illustrated using 408 protein sets containing, on average, 235 sequences. These sets correspond to NCBI Conserved Domain Database alignments, which have been manually curated in the light of available crystal structures, and thus provide a means to assess alignment accuracy. GISMO fills a different niche than other MSA programs, namely identifying and aligning a conserved domain present within a large, diverse set of full length sequences. The GISMO program is available at http://gismo.igs.umaryland.edu/.
Since 2004 the European Bioinformatics Institute (EMBL-EBI) has provided access to a wide range of databases and analysis tools via Web Services interfaces. This comprises services to search across the databases available from the EMBL-EBI and to explore the network of cross-references present in the data (e.g. EB-eye), services to retrieve entry data in various data formats and to access the data in specific fields (e.g. dbfetch), and analysis tool services, for example, sequence similarity search (e.g. FASTA and NCBI BLAST), multiple sequence alignment (e.g. Clustal Omega and MUSCLE), pairwise sequence alignment and protein functional analysis (e.g. InterProScan and Phobius). The REST/SOAP Web Services (http://www.ebi.ac.uk/Tools/webservices/) interfaces to these databases and tools allow their integration into other tools, applications, web sites, pipeline processes and analytical workflows. To get users started using the Web Services, sample clients are provided covering a range of programming languages and popular Web Service tool kits, and a brief guide to Web Services technologies, including a set of tutorials, is available for those wishing to learn more and develop their own clients. Users of the Web Services are informed of improvements and updates via a range of methods.
We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction.
We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most largescale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, that has equivalent accuracy to G-INS-1 and is applicable to 50,000 or more sequences.
Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.
- Journal of computational biology : a journal of computational molecular cell biology
- Published over 1 year ago
The alignment among three or more nucleotides/amino acids sequences at the same time is known as multiple sequence alignment (MSA), a nondeterministic polynomial time (NP)-hard optimization problem. The time complexity of finding an optimal alignment raises exponentially when the number of sequences to align increases. In this work, we deal with a multiobjective version of the MSA problem wherein the goal is to simultaneously optimize the accuracy and conservation of the alignment. A parallel version of the hybrid multiobjective memetic metaheuristics for MSA is proposed. To evaluate the parallel performance of our proposal, we have selected a pull of data sets with different number of sequences (up to 1000 sequences) and study its parallel performance against other well-known parallel metaheuristics published in the literature, such as MSAProbs, tree-based consistency objective function for alignment evaluation (T-Coffee), Clustal [Formula: see text], and multiple alignment using fast Fourier transform (MAFFT). The comparative study reveals that our parallel aligner obtains better results than MSAProbs, T-Coffee, Clustal [Formula: see text], and MAFFT. In addition, the parallel version is around 25 times faster than the sequential version with 32 cores, obtaining an efficiency around 80%.
This research work focus on the multiple sequence alignment, as developing an exact multiple sequence alignment for different protein sequences is a difficult computational task. In this research, a hybrid algorithm named Bacterial Foraging Optimization-Genetic Algorithm (BFO-GA) algorithm is aimed to improve the multi-objectives and carrying out measures of multiple sequence alignment. The proposed algorithm employs multi-objectives such as variable gap penalty minimization, maximization of similarity and non-gap percentage. The proposed BFO-GA algorithm is measured with various MSA methods such as T-Coffee, Clustal Omega, Muscle, K-Align, MAFFT, GA, ACO, ABC and PSO. The experiments were taken on four benchmark datasets such as BAliBASE 3.0, Prefab 4.0, SABmark 1.65 and Oxbench 1.3 databases and the outcomes prove that the proposed BFO-GA algorithm obtains better statistical significance results as compared with the other well-known methods. This research study also evaluates the practicability of the alignments of BFO-GA by applying the optimal sequence to predict the phylogenetic tree by using ClustalW2 Phylogeny tool and compare with the existing algorithms by using the Robinson-Foulds (RF) distance performance metric. Lastly, the statistical implication of the proposed algorithm is computed by using the Wilcoxon Matched-Pair Signed- Rank test and also it infers better results.
Intracellular trafficking and localization studies of spike protein from SARS and OC43 showed that SARS spike protein is localized in the ER or ERGIC compartment and OC43 spike protein is predominantly localized in the lysosome. Differential localization can be explained by signal sequence. The sequence alignment using Clustal W shows that the signal sequence present at the cytoplasmic tail plays an important role in spike protein localization. A unique GYQEL motif is identified at the cytoplasmic terminal of OC43 spike protein which helps in localization in the lysosome, and a novel KLHYT motif is identified in the cytoplasmic tail of SARS spike protein which helps in ER or ERGIC localization. This study sheds some light on the role of cytoplasmic tail of spike protein in cell-to-cell fusion, coronavirus host cell fusion and subsequent pathogenicity.