Concept: Amdahl's law
BACKGROUND: Most Bayesian models for the analysis of complex traits are not analytically tractable and inferences are based on computationally intensive techniques. This is true of Bayesian models for genome-enabled selection, which uses whole-genome molecular data to predict the genetic merit of candidate animals for breeding purposes. In this regard, parallel computing can overcome the bottlenecks that can arise from series computing. Hence, a major goal of the present study is to bridge the gap to high-performance Bayesian computation in the context of animal breeding and genetics. RESULTS: Parallel Monte Carlo Markov chain algorithms and strategies are described in the context of animal breeding and genetics. Parallel Monte Carlo algorithms are introduced as a starting point including their applications to computing single-parameter and certain multipleparameter models. Then, two basic approaches for parallel Markov chain Monte Carlo are described: one aims at parallelization within a single chain; the other is based on running multiple chains, yet some variants are discussed as well. Features and strategies of the parallel Markov chain Monte Carlo are illustrated using real data, including a large beef cattle dataset with 50K SNP genotypes. CONCLUSIONS: Parallel Markov chain Monte Carlo algorithms are useful for computing complex Bayesian models, which does not only lead to a dramatic speedup in computing but can also be used to optimize model parameters in complex Bayesian models. Hence, we anticipate that use of parallel Markov chain Monte Carlo will have a profound impact on revolutionizing the computational tools for genomic selection programs.
A new algorithm, “HiER-leap” (hierarchical exact reaction-leaping), is derived which improves on the computational properties of the ER-leap algorithm for exact accelerated simulation of stochastic chemical kinetics. Unlike ER-leap, HiER-leap utilizes a hierarchical or divide-and-conquer organization of reaction channels into tightly coupled “blocks” and is thereby able to speed up systems with many reaction channels. Like ER-leap, HiER-leap is based on the use of upper and lower bounds on the reaction propensities to define a rejection sampling algorithm with inexpensive early rejection and acceptance steps. But in HiER-leap, large portions of intra-block sampling may be done in parallel. An accept/reject step is used to synchronize across blocks. This method scales well when many reaction channels are present and has desirable asymptotic properties. The algorithm is exact, parallelizable and achieves a significant speedup over the stochastic simulation algorithm and ER-leap on certain problems. This algorithm offers a potentially important step towards efficient in silico modeling of entire organisms.
A wide variety of large-scale data have been produced in bioinformatics. In response, the need for efficient handling of biomedical big data has been partly met by parallel computing. However, the time demand of many bioinformatics programs still remains high for large-scale practical uses because of factors that hinder acceleration by parallelization. Recently, new generations of storage devices have emerged, such as NAND flash-based solid-state drives (SSDs), and with the renewed interest in near-data processing, they are increasingly becoming acceleration methods that can accompany parallel processing. In certain cases, a simple drop-in replacement of hard disk drives by SSDs results in dramatic speedup. Despite the various advantages and continuous cost reduction of SSDs, there has been little review of SSD-based profiling and performance exploration of important but time-consuming bioinformatics programs. For an informative review, we perform in-depth profiling and analysis of 23 key bioinformatics programs using multiple types of devices. Based on the insight we obtain from this research, we further discuss issues related to design and optimize bioinformatics algorithms and pipelines to fully exploit SSDs. The programs we profile cover traditional and emerging areas of importance, such as alignment, assembly, mapping, expression analysis, variant calling and metagenomics. We explain how acceleration by parallelization can be combined with SSDs for improved performance and also how using SSDs can expedite important bioinformatics pipelines, such as variant calling by the Genome Analysis Toolkit and transcriptome analysis using RNA sequencing. We hope that this review can provide useful directions and tips to accompany future bioinformatics algorithm design procedures that properly consider new generations of powerful storage devices.
Parallelization of molecular dynamics (MD) simulation is essential for investigating conformational dynamics of large biological systems, such as ribosomes, viruses, and multiple proteins in cellular environments. To improve efficiency in the parallel computation, we have to reduce the amount of data transfer between processors by introducing domain decomposition schemes. Also, it is important to optimize the computational balance between real-space non-bonded interactions and reciprocal-space interactions for long-range electrostatic interactions. Here, we introduce a novel parallelization scheme for large-scale MD simulations on massively parallel supercomputers consisting of only CPUs. We make use of a multiple program/multiple data (MPMD) approach for separating the real-space and reciprocal-space computations on different processors. We also utilize the r-RESPA multiple time step integrator on the framework of the MPMD approach in an efficient way: when the reciprocal-space computations are skipped in r-RESPA, processors assigned for them are utilized for half of the real-space computations. The new scheme allows us to use twice as many as processors that are available in the conventional single program approach. The best performances of all-atom MD simulations for 1 million (STMV), 8.5 million (8_STMV), and 28.8 million (27_STMV) atom systems on K computer are 65, 36, and 24 ns/day, respectively. The MPMD scheme can accelerate 23.4, 10.2, and 9.2 ns/day from the maximum performance of single-program approach for STMV, 8_STMV, and 27_STMV systems, respectively, which correspond to 57%, 39%, and 60% speed up. This suggests significant speedups by increasing the number of processors without losing parallel computational efficiency. © 2016 Wiley Periodicals, Inc.
MrBayes tgMC^3++: a High Performance and Resource-Efficient GPU-oriented Phylogenetic Analysis Method
- IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
- Published over 3 years ago
MrBayes is a widespread phylogenetic inference tool harnessing empirical evolutionary models and Bayesian statistics. However, the computational cost on the likelihood estimation is very expensive, resulting in undesirably long execution time. Although a number of multi-threaded optimizations have been proposed to speed up MrBayes, there are bottlenecks that severely limit the GPU thread-level parallelism of likelihood estimations. This study proposes a high performance and resource-efficient method for GPU-oriented parallelization of likelihood estimations. Instead of having to rely on empirical programming, the proposed novel decomposition storage model implements high performance data transfers implicitly. In terms of performance improvement, a speedup factor of up to 178 can be achieved on the analysis of simulated datasets by 4 Tesla K40 cards. In comparison to the other publicly available GPU-oriented MrBayes, the tgMC3++ method (proposed herein) outperforms the tgMC3 (v1.0), nMC3 (v2.1.1) and oMC3 (v1.00) methods by speedup factors of up to 1.6, 1.9 and 2.9, respectively. Moreover, tgMC3++ supports more evolutionary models and gamma categories, which previous GPU-oriented methods fail to take into analysis.
Forward Wright-Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the CPU, thus limiting their usefulness. The single-locus Wright-Fisher forward algorithm is, however, exceedingly parallelizable, with many steps which are so-called embarrassingly parallel, consisting of a vast number of individual computations that are all independent of each other and thus capable of being performed concurrently. The rise of modern Graphics Processing Units (GPUs) and programming languages designed to leverage the inherent parallel nature of these processors have allowed researchers to dramatically speed up many programs that have such high arithmetic intensity and intrinsic concurrency. The presented GPU Optimized Wright-Fisher simulation, or GO Fish for short, can be used to simulate arbitrary selection and demographic scenarios while running over 250-fold faster than its serial counterpart on the CPU. Even modest GPU hardware can achieve an impressive speedup of over two orders of magnitude. With simulations so accelerated, one can not only do quick parametric bootstrapping of previously estimated parameters, but also use simulated results to calculate the likelihoods and summary statistics of demographic and selection models against real polymorphism data - all without restricting the demographic and selection scenarios that can be modeled or requiring approximations to the single-locus forward algorithm for efficiency. Further, as many of the parallel programming techniques used in this simulation can be applied to other computationally intensive algorithms important in population genetics, GO Fish serves as an exciting template for future research into accelerating computation in evolution. GO Fish is part of the Parallel PopGen Package available at: http://dl42.github.io/ParallelPopGen/.
Infectious diseases have complex transmission cycles, and effective public health responses require the ability to monitor outbreaks in a timely manner. Space-time statistics facilitate the discovery of disease dynamics including rate of spread and seasonal cyclic patterns, but are computationally demanding, especially for datasets of increasing size, diversity and availability. High-performance computing reduces the effort required to identify these patterns, however heterogeneity in the data must be accounted for. We develop an adaptive space-time domain decomposition approach for parallel computation of the space-time kernel density. We apply our methodology to individual reported dengue cases from 2010 to 2011 in the city of Cali, Colombia. The parallel implementation reaches significant speedup compared to sequential counterparts. Density values are visualized in an interactive 3D environment, which facilitates the identification and communication of uneven space-time distribution of disease events. Our framework has the potential to enhance the timely monitoring of infectious diseases.
A massively parallel program for quantum mechanical-molecular mechanical (QM/MM) molecular dynamics simulation, called Platypus (PLATform for dYnamic Protein Unified Simulation), was developed to elucidate protein functions. The speedup and the parallelization ratio of Platypus in the QM and QM/MM calculations were assessed for a bacteriochlorophyll dimer in the photosynthetic reaction center (DIMER) on the K computer, a massively parallel computer achieving 10 PetaFLOPs with 705,024 cores. Platypus exhibited the increase in speedup up to 20,000 core processors at the HF/cc-pVDZ and B3LYP/cc-pVDZ, and up to 10,000 core processors by the CASCI(16,16)/6-31G** calculations. We also performed excited QM/MM-MD simulations on the chromophore of Sirius (SIRIUS) in water. Sirius is a pH-insensitive and photo-stable ultramarine fluorescent protein. Platypus accelerated on-the-fly excited-state QM/MM-MD simulations for SIRIUS in water, using over 4000 core processors. In addition, it also succeeded in 50-ps (200,000-step) on-the-fly excited-state QM/MM-MD simulations for the SIRIUS in water. © 2016 Wiley Periodicals, Inc.
- Risk analysis : an official publication of the Society for Risk Analysis
- Published about 3 years ago
Simulation models are widely used in risk analysis to study the effects of uncertainties on outcomes of interest in complex problems. Often, these models are computationally complex and time consuming to run. This latter point may be at odds with time-sensitive evaluations or may limit the number of parameters that are considered. In this article, we give an introductory tutorial focused on parallelizing simulation code to better leverage modern computing hardware, enabling risk analysts to better utilize simulation-based methods for quantifying uncertainty in practice. This article is aimed primarily at risk analysts who use simulation methods but do not yet utilize parallelization to decrease the computational burden of these models. The discussion is focused on conceptual aspects of embarrassingly parallel computer code and software considerations. Two complementary examples are shown using the languages MATLAB and R. A brief discussion of hardware considerations is located in the Appendix.
Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of eFindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of eFindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of eFindSite is freely available to the academic community at www.brylinski.org/efindsite.