### Concept: Parallel computing

#### 28

PURPOSE: To develop and test a new algorithm for fast direct Fourier transform (DrFT) reconstruction of MR data on non-Cartesian trajectories composed of lines with equally spaced points. THEORY AND METHODS: The DrFT, which is normally used as a reference in evaluating the accuracy of other reconstruction methods, can reconstruct images directly from non-Cartesian MR data without interpolation. However, DrFT reconstruction involves substantially intensive computation, which makes the DrFT impractical for clinical routine applications. In this article, the Chirp transform algorithm was introduced to accelerate the DrFT reconstruction of radial and Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction (PROPELLER) MRI data located on the trajectories that are composed of lines with equally spaced points. The performance of the proposed Chirp transform algorithm-DrFT algorithm was evaluated by using simulation and in vivo MRI data. RESULTS: After implementing the algorithm on a graphics processing unit, the proposed Chirp transform algorithm-DrFT algorithm achieved an acceleration of approximately one order of magnitude, and the speed-up factor was further increased to approximately three orders of magnitude compared with the traditional single-thread DrFT reconstruction. CONCLUSION: Implementation the Chirp transform algorithm-DrFT algorithm on the graphics processing unit can efficiently calculate the DrFT reconstruction of the radial and PROPELLER MRI data. Magn Reson Med, 2012. © 2012 Wiley Periodicals, Inc.

#### 27

##### Speeding Up Ecological and Evolutionary Computations in R; Essentials of High Performance Computing for Biologists

- OPEN
- PLoS computational biology
- Published almost 5 years ago
- Discuss

Computation has become a critical component of research in biology. A risk has emerged that computational and programming challenges may limit research scope, depth, and quality. We review various solutions to common computational efficiency problems in ecological and evolutionary research. Our review pulls together material that is currently scattered across many sources and emphasizes those techniques that are especially effective for typical ecological and environmental problems. We demonstrate how straightforward it can be to write efficient code and implement techniques such as profiling or parallel computing. We supply a newly developed R package (aprof) that helps to identify computational bottlenecks in R code and determine whether optimization can be effective. Our review is complemented by a practical set of examples and detailed Supporting Information material (S1-S3 Texts) that demonstrate large improvements in computational speed (ranging from 10.5 times to 14,000 times faster). By improving computational efficiency, biologists can feasibly solve more complex tasks, ask more ambitious questions, and include more sophisticated analyses in their research.

#### 27

##### VinaMPI: Facilitating multiple receptor high-throughput virtual docking on high-performance computers

- Journal of computational chemistry
- Published over 6 years ago
- Discuss

The program VinaMPI has been developed to enable massively large virtual drug screens on leadership-class computing resources, using a large number of cores to decrease the time-to-completion of the screen. VinaMPI is a massively parallel Message Passing Interface (MPI) program based on the multithreaded virtual docking program AutodockVina, and is used to distribute tasks while multithreading is used to speed-up individual docking tasks. VinaMPI uses a distribution scheme in which tasks are evenly distributed to the workers based on the complexity of each task, as defined by the number of rotatable bonds in each chemical compound investigated. VinaMPI efficiently handles multiple proteins in a ligand screen, allowing for high-throughput inverse docking that presents new opportunities for improving the efficiency of the drug discovery pipeline. VinaMPI successfully ran on 84,672 cores with a continual decrease in job completion time with increasing core count. The ratio of the number of tasks in a screening to the number of workers should be at least around 100 in order to have a good load balance and an optimal job completion time. The code is freely available and downloadable. Instructions for downloading and using the code are provided in the Supporting Information. © 2013 Wiley Periodicals, Inc.

#### 27

##### Accelerating image reconstruction in three-dimensional optoacoustic tomography on graphics processing units

- Medical physics
- Published about 7 years ago
- Discuss

Purpose: Optoacoustic tomography (OAT) is inherently a three-dimensional (3D) inverse problem. However, most studies of OAT image reconstruction still employ two-dimensional imaging models. One important reason is because 3D image reconstruction is computationally burdensome. The aim of this work is to accelerate existing image reconstruction algorithms for 3D OAT by use of parallel programming techniques.Methods: Parallelization strategies are proposed to accelerate a filtered backprojection (FBP) algorithm and two different pairs of projection/backprojection operations that correspond to two different numerical imaging models. The algorithms are designed to fully exploit the parallel computing power of graphics processing units (GPUs). In order to evaluate the parallelization strategies for the projection/backprojection pairs, an iterative image reconstruction algorithm is implemented. Computer simulation and experimental studies are conducted to investigate the computational efficiency and numerical accuracy of the developed algorithms.Results: The GPU implementations improve the computational efficiency by factors of 1000, 125, and 250 for the FBP algorithm and the two pairs of projection/backprojection operators, respectively. Accurate images are reconstructed by use of the FBP and iterative image reconstruction algorithms from both computer-simulated and experimental data.Conclusions: Parallelization strategies for 3D OAT image reconstruction are proposed for the first time. These GPU-based implementations significantly reduce the computational time for 3D image reconstruction, complementing our earlier work on 3D OAT iterative image reconstruction.

#### 27

##### Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines

- Journal of computational chemistry
- Published about 7 years ago
- Discuss

A mixed parallel scheme that combines message passing interface (MPI) and multithreading was implemented in the AutoDock Vina molecular docking program. The resulting program, named VinaLC, was tested on the petascale high performance computing (HPC) machines at Lawrence Livermore National Laboratory. To exploit the typical cluster-type supercomputers, thousands of docking calculations were dispatched by the master process to run simultaneously on thousands of slave processes, where each docking calculation takes one slave process on one node, and within the node each docking calculation runs via multithreading on multiple CPU cores and shared memory. Input and output of the program and the data handling within the program were carefully designed to deal with large databases and ultimately achieve HPC on a large number of CPU cores. Parallel performance analysis of the VinaLC program shows that the code scales up to more than 15K CPUs with a very low overhead cost of 3.94%. One million flexible compound docking calculations took only 1.4 h to finish on about 15K CPUs. The docking accuracy of VinaLC has been validated against the DUD data set by the re-docking of X-ray ligands and an enrichment study, 64.4% of the top scoring poses have RMSD values under 2.0 Å. The program has been demonstrated to have good enrichment performance on 70% of the targets in the DUD data set. An analysis of the enrichment factors calculated at various percentages of the screening database indicates VinaLC has very good early recovery of actives. © 2013 Wiley Periodicals, Inc.

#### 25

##### Will solid-state drives accelerate your bioinformatics? In-depth profiling, performance analysis and beyond

- Briefings in bioinformatics
- Published over 4 years ago
- Discuss

A wide variety of large-scale data have been produced in bioinformatics. In response, the need for efficient handling of biomedical big data has been partly met by parallel computing. However, the time demand of many bioinformatics programs still remains high for large-scale practical uses because of factors that hinder acceleration by parallelization. Recently, new generations of storage devices have emerged, such as NAND flash-based solid-state drives (SSDs), and with the renewed interest in near-data processing, they are increasingly becoming acceleration methods that can accompany parallel processing. In certain cases, a simple drop-in replacement of hard disk drives by SSDs results in dramatic speedup. Despite the various advantages and continuous cost reduction of SSDs, there has been little review of SSD-based profiling and performance exploration of important but time-consuming bioinformatics programs. For an informative review, we perform in-depth profiling and analysis of 23 key bioinformatics programs using multiple types of devices. Based on the insight we obtain from this research, we further discuss issues related to design and optimize bioinformatics algorithms and pipelines to fully exploit SSDs. The programs we profile cover traditional and emerging areas of importance, such as alignment, assembly, mapping, expression analysis, variant calling and metagenomics. We explain how acceleration by parallelization can be combined with SSDs for improved performance and also how using SSDs can expedite important bioinformatics pipelines, such as variant calling by the Genome Analysis Toolkit and transcriptome analysis using RNA sequencing. We hope that this review can provide useful directions and tips to accompany future bioinformatics algorithm design procedures that properly consider new generations of powerful storage devices.

#### 24

##### Multiple program/multiple data molecular dynamics method with multiple time step integrator for large biological systems

- Journal of computational chemistry
- Published over 3 years ago
- Discuss

Parallelization of molecular dynamics (MD) simulation is essential for investigating conformational dynamics of large biological systems, such as ribosomes, viruses, and multiple proteins in cellular environments. To improve efficiency in the parallel computation, we have to reduce the amount of data transfer between processors by introducing domain decomposition schemes. Also, it is important to optimize the computational balance between real-space non-bonded interactions and reciprocal-space interactions for long-range electrostatic interactions. Here, we introduce a novel parallelization scheme for large-scale MD simulations on massively parallel supercomputers consisting of only CPUs. We make use of a multiple program/multiple data (MPMD) approach for separating the real-space and reciprocal-space computations on different processors. We also utilize the r-RESPA multiple time step integrator on the framework of the MPMD approach in an efficient way: when the reciprocal-space computations are skipped in r-RESPA, processors assigned for them are utilized for half of the real-space computations. The new scheme allows us to use twice as many as processors that are available in the conventional single program approach. The best performances of all-atom MD simulations for 1 million (STMV), 8.5 million (8_STMV), and 28.8 million (27_STMV) atom systems on K computer are 65, 36, and 24 ns/day, respectively. The MPMD scheme can accelerate 23.4, 10.2, and 9.2 ns/day from the maximum performance of single-program approach for STMV, 8_STMV, and 27_STMV systems, respectively, which correspond to 57%, 39%, and 60% speed up. This suggests significant speedups by increasing the number of processors without losing parallel computational efficiency. © 2016 Wiley Periodicals, Inc.

#### 24

##### Real-time 3D digital image correlation method and its application in human pulse monitoring

- Applied optics
- Published about 4 years ago
- Discuss

In industrial measurements and online monitoring, full-field and high-efficiency deformation analysis has been increasingly important and highly demanded in recent years. In this paper, a fast three-dimensional digital image correlation (3D-DIC) method was proposed to implement real-time measurement. Two improvements were suggested to accelerate the computation speed without sacrificing the accuracy. First, an efficient inverse compositional Gauss-Newton (IC-GN) algorithm was developed to avoid redundant computation. Moreover, a seed point-based parallel method was extended for 3D-DIC to achieve parallel computation and faster convergence speed. The detailed process of the real-time measurement using the proposed method was also introduced. Benefiting from the efficient IC-GN algorithm and parallel processing software we developed, full-field, real-time 3D deformation monitoring was realized at a frame rate of 10 frames/s with resolution of 5000 points per frame. For validation, the displacement field of a four-point bending beam was determined by the real-time 3D-DIC. As an application, the real-time human pulse diagnosis was also performed based on the presented technique. Experimental results verify that the proposed real-time 3D-DIC is practicable and effective for traditional Chinese medicine.

#### 24

##### MrBayes tgMC^3++: a High Performance and Resource-Efficient GPU-oriented Phylogenetic Analysis Method

- IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM
- Published over 4 years ago
- Discuss

MrBayes is a widespread phylogenetic inference tool harnessing empirical evolutionary models and Bayesian statistics. However, the computational cost on the likelihood estimation is very expensive, resulting in undesirably long execution time. Although a number of multi-threaded optimizations have been proposed to speed up MrBayes, there are bottlenecks that severely limit the GPU thread-level parallelism of likelihood estimations. This study proposes a high performance and resource-efficient method for GPU-oriented parallelization of likelihood estimations. Instead of having to rely on empirical programming, the proposed novel decomposition storage model implements high performance data transfers implicitly. In terms of performance improvement, a speedup factor of up to 178 can be achieved on the analysis of simulated datasets by 4 Tesla K40 cards. In comparison to the other publicly available GPU-oriented MrBayes, the tgMC3++ method (proposed herein) outperforms the tgMC3 (v1.0), nMC3 (v2.1.1) and oMC3 (v1.00) methods by speedup factors of up to 1.6, 1.9 and 2.9, respectively. Moreover, tgMC3++ supports more evolutionary models and gamma categories, which previous GPU-oriented methods fail to take into analysis.

#### 14

##### Next-generation mapping: a novel approach for detection of pathogenic structural variants with a potential utility in clinical diagnosis

- OPEN
- Genome medicine
- Published over 2 years ago
- Discuss

Massively parallel DNA sequencing, such as exome sequencing, has become a routine clinical procedure to identify pathogenic variants responsible for a patient’s phenotype. Exome sequencing has the capability of reliably identifying inherited and de novo single-nucleotide variants, small insertions, and deletions. However, due to the use of 100-300-bp fragment reads, this platform is not well powered to sensitively identify moderate to large structural variants (SV), such as insertions, deletions, inversions, and translocations.