We present a way to improve the performance of the electronic structure Vienna Ab initio Simulation Package (VASP) program. We show that high-performance computers equipped with graphics processing units (GPUs) as accelerators may reduce drastically the computation time when offloading these sections to the graphic chips. The procedure consists of (i) profiling the performance of the code to isolate the time-consuming parts, (ii) rewriting these so that the algorithms become better-suited for the chosen graphic accelerator, and (iii) optimizing memory traffic between the host computer and the GPU accelerator. We chose to accelerate VASP with NVIDIA GPU using CUDA. We compare the GPU and original versions of VASP by evaluating the Davidson and RMM-DIIS algorithms on chemical systems of up to 1100 atoms. In these tests, the total time is reduced by a factor between 3 and 8 when running on n (CPU core + GPU) compared to n CPU cores only, without any accuracy loss. © 2012 Wiley Periodicals, Inc.
Purpose: The three-dimensional digital differential analyzer (3D-DDA) algorithm is a widely used ray traversal method, which is also at the core of many convolution∕superposition (C∕S) dose calculation approaches. However, porting existing C∕S dose calculation methods onto graphics processing unit (GPU) has brought challenges to retaining the efficiency of this algorithm. In particular, straightforward implementation of the original 3D-DDA algorithm inflicts a lot of branch divergence which conflicts with the GPU programming model and leads to suboptimal performance. In this paper, an efficient GPU implementation of the 3D-DDA algorithm is proposed, which effectively reduces such branch divergence and improves performance of the C∕S dose calculation programs running on GPU.Methods: The main idea of the proposed method is to convert a number of conditional statements in the original 3D-DDA algorithm into a set of simple operations (e.g., arithmetic, comparison, and logic) which are better supported by the GPU architecture. To verify and demonstrate the performance improvement, this ray traversal method was integrated into a GPU-based collapsed cone convolution∕superposition (CCCS) dose calculation program.Results: The proposed method has been tested using a water phantom and various clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based on the efficient 3D-DDA ray traversal implementation runs 1.42 ∼ 2.67× faster than the one based on the original 3D-DDA implementation, without losing any accuracy.Conclusions: The results show that the proposed method can effectively reduce branch divergence in the original 3D-DDA ray traversal algorithm and improve the performance of the CCCS program running on GPU. Considering the wide utilization of the 3D-DDA algorithm, various applications can benefit from this implementation method.
- IEEE transactions on visualization and computer graphics
- Published over 6 years ago
We propose the first GPU solution to compute the 2D constrained Delaunay triangulation (CDT) of a planar straight line graph (PSLG) consisting of points and edges. There are many existing CPU algorithms to solve the CDT problem in computational geometry, yet there has been no prior approach to solve this problem efficiently using the parallel computing power of the GPU. For the special case of the CDT problem where the PSLG consists of just points, which is simply the normal Delaunay triangulation problem, a hybrid approach using the GPU together with the CPU to partially speed up the computation has already been presented in the literature. Our work, on the other hand, accelerates the entire computation on the GPU. Our implementation using the CUDA programming model on NVIDIA GPUs is numerically robust, and runs up to an order of magnitude faster than the best sequential implementations on the CPU. This result is reflected in our experiment with both randomly generated PSLGs and real-world GIS data having millions of points and edges.
Modern parallel hardware such as multi-core processors (CPUs) and graphics processing units (GPUs) have a high computational power which can be greatly beneficial to the simulation of large-scale neural networks. Over the past years, a number of efforts have focused on developing parallel algorithms and simulators best suited for the simulation of spiking neural models. In this article, we aim at investigating the advantages and drawbacks of the CPU and GPU parallelization of mean-firing rate neurons, widely used in systems-level computational neuroscience. By comparing OpenMP, CUDA and OpenCL implementations towards a serial CPU implementation, we show that GPUs are better suited than CPUs for the simulation of very large networks, but that smaller networks would benefit more from an OpenMP implementation. As this performance strongly depends on data organization, we analyze the impact of various factors such as data structure, memory alignment and floating precision. We then discuss the suitability of the different hardware depending on the networks' size and connectivity, as random or sparse connectivities in mean-firing rate networks tend to break parallel performance on GPUs due to the violation of coalescence.
We present a General Purpose Graphics Processing Unit (GPGPU) based real-time traffic sign detection and recognition method that is robust against illumination changes. There have been many approaches to traffic sign recognition in various research fields; however, previous approaches faced several limitations when under low illumination or wide variance of light conditions. To overcome these drawbacks and improve processing speeds, we propose a method that 1) is robust against illumination changes, 2) uses GPGPU-based real-time traffic sign detection, and 3) performs region detecting and recognition using a hierarchical model. This method produces stable results in low illumination environments. Both detection and hierarchical recognition are performed in real-time, and the proposed method achieves 0.97 F1-score on our collective dataset, which uses the Vienna convention traffic rules (Germany and South Korea).
Our previously published CUDA-only application PaSWAS for Smith-Waterman (SW) sequence alignment of any type of sequence on NVIDIA-based GPUs is platform-specific and therefore adopted less than could be. The OpenCL language is supported more widely and allows use on a variety of hardware platforms. Moreover, there is a need to promote the adoption of parallel computing in bioinformatics by making its use and extension more simple through more and better application of high-level languages commonly used in bioinformatics, such as Python.
Technologies for scalable analysis of very large datasets have emerged in the domain of internet computing, but are still rarely used in neuroimaging despite the existence of data and research questions in need of efficient computation tools especially in fMRI. In this work, we present software tools for the application of Apache Spark and Graphics Processing Units (GPUs) to neuroimaging datasets, in particular providing distributed file input for 4D NIfTI fMRI datasets in Scala for use in an Apache Spark environment. Examples for using this Big Data platform in graph analysis of fMRI datasets are shown to illustrate how processing pipelines employing it can be developed. With more tools for the convenient integration of neuroimaging file formats and typical processing steps, big data technologies could find wider endorsement in the community, leading to a range of potentially useful applications especially in view of the current collaborative creation of a wealth of large data repositories including thousands of individual fMRI datasets.
The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers). An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU), can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem.
Shear wave calculations induced by an acoustic radiation force are very time-consuming on desktop computers, and high-performance graphics processing units (GPUs) achieve dramatic reductions in the computation time for these simulations. The acoustic radiation force is calculated using the fast near field method (FNM) and the angular spectrum approach (ASA), and then the shear waves are calculated in parallel with Green’s functions on a GPU. This combination enables rapid evaluation of shear waves for push beams with different spatial samplings and for apertures with different f/#. Relative to shear wave simulations that evaluate the same algorithm on an Intel i7 desktop computer, a high performance nVidia GPU reduces the time required for these calculations by a factor of 45 and 700 when applied to elastic and viscoelastic shear wave simulation models, respectively. These GPU-accelerated simulations also compared to measurements in different viscoelastic phantoms, and the results are similar. For parametric evaluations and for comparisons with measured shear wave data, shear wave simulations with the Green’s function approach are ideally suited for high-performance GPUs.
GATK HaplotypeCaller (HC) is a popular variant caller, which is widely used to identify variants in complex genomes. However, due to its high variants detection accuracy, it suffers from long execution time. In GATK HC, the pair-HMMs forward algorithm accounts for a large percentage of the total execution time. This article proposes to accelerate the pair-HMMs forward algorithm on graphics processing units (GPUs) to improve the performance of GATK HC. This article presents several GPU-based implementations of the pair-HMMs forward algorithm. It also analyzes the performance bottlenecks of the implementations on an NVIDIA Tesla K40 card with various data sets. Based on these results and the characteristics of GATK HC, we are able to identify the GPU-based implementations with the highest performance for the various analyzed data sets. Experimental results show that the GPU-based implementations of the pair-HMMs forward algorithm achieve a speedup of up to 5.47× over existing GPU-based implementations.