Concept: Computer architecture
The availability of a universal quantum computer may have a fundamental impact on a vast number of research fields and on society as a whole. An increasingly large scientific and industrial community is working toward the realization of such a device. An arbitrarily large quantum computer may best be constructed using a modular approach. We present a blueprint for a trapped ion-based scalable quantum computer module, making it possible to create a scalable quantum computer architecture based on long-wavelength radiation quantum gates. The modules control all operations as stand-alone units, are constructed using silicon microfabrication techniques, and are within reach of current technology. To perform the required quantum computations, the modules make use of long-wavelength radiation-based quantum gate technology. To scale this microwave quantum computer architecture to a large size, we present a fully scalable design that makes use of ion transport between different modules, thereby allowing arbitrarily many modules to be connected to construct a large-scale device. A high error-threshold surface error correction code can be implemented in the proposed architecture to execute fault-tolerant operations. With appropriate adjustments, the proposed modules are also suitable for alternative trapped ion quantum computer architectures, such as schemes using photonic interconnects.
This software package provides an R-based framework to make use of multi-core computers when running analyses in the population genetics program STRUCTURE. It is especially addressed to those users of STRUCTURE dealing with numerous and repeated data analyses, and who could take advantage of an efficient script to automatically distribute STRUCTURE jobs among multiple processors. It also consists of additional functions to divide analyses among combinations of populations within a single data set without the need to manually produce multiple projects, as it is currently the case in STRUCTURE. The package consists of two main functions: MPI_structure() and parallel_structure() as well as an example data file. We compared the performance in computing time for this example data on two computer architectures and showed that the use of the present functions can result in several-fold improvements in terms of computation time. ParallelStructure is freely available at https://r-forge.r-project.org/projects/parallstructure/.
We present a fast and model-free 2D and 3D single-molecule localization algorithm that allows more than 3 × 106localizations per second to be calculated on a standard multi-core central processing unit with localization accuracies in line with the most accurate algorithms currently available. Our algorithm converts the region of interest around a point spread function to two phase vectors (phasors) by calculating the first Fourier coefficients in both the x- and y-direction. The angles of these phasors are used to localize the center of the single fluorescent emitter, and the ratio of the magnitudes of the two phasors is a measure for astigmatism, which can be used to obtain depth information (z-direction). Our approach can be used both as a stand-alone algorithm for maximizing localization speed and as a first estimator for more time consuming iterative algorithms.
A heterogeneous computing accelerated SCE-UA global optimization method using OpenMP, OpenCL, CUDA, and OpenACC
- Water science and technology : a journal of the International Association on Water Pollution Research
- Published over 1 year ago
The shuffled complex evolution optimization developed at the University of Arizona (SCE-UA) has been successfully applied in various kinds of scientific and engineering optimization applications, such as hydrological model parameter calibration, for many years. The algorithm possesses good global optimality, convergence stability and robustness. However, benchmark and real-world applications reveal the poor computational efficiency of the SCE-UA. This research aims at the parallelization and acceleration of the SCE-UA method based on powerful heterogeneous computing technology. The parallel SCE-UA is implemented on Intel Xeon multi-core CPU (by using OpenMP and OpenCL) and NVIDIA Tesla many-core GPU (by using OpenCL, CUDA, and OpenACC). The serial and parallel SCE-UA were tested based on the Griewank benchmark function. Comparison results indicate the parallel SCE-UA significantly improves computational efficiency compared to the original serial version. The OpenCL implementation obtains the best overall acceleration results however, with the most complex source code. The parallel SCE-UA has bright prospects to be applied in real-world applications.
Next generation sequencing (NGS) data analysis is highly compute intensive. In-memory computing, vectorization, bulk data transfer, CPU frequency scaling are some of the hardware features in the modern computing architectures. To get the best execution time and utilize these hardware features, it is necessary to tune the system level parameters before running the application. We studied the GATK-HaplotypeCaller which is part of common NGS workflows, that consume more than 43% of the total execution time. Multiple GATK 3.x versions were benchmarked and the execution time of HaplotypeCaller was optimized by various system level parameters which included: (i) tuning the parallel garbage collection and kernel shared memory to simulate in-memory computing, (ii) architecture-specific tuning in the PairHMM library for vectorization, (iii) including Java 1.8 features through GATK source code compilation and building a runtime environment for parallel sorting and bulk data transfer (iv) the default ‘on-demand’ mode of CPU frequency is over-clocked by using ‘performance-mode’ to accelerate the Java multi-threads. As a result, the HaplotypeCaller execution time was reduced by 82.66% in GATK 3.3 and 42.61% in GATK 3.7. Overall, the execution time of NGS pipeline was reduced to 70.60% and 34.14% for GATK 3.3 and GATK 3.7 respectively.
Massively parallel algorithms are presented in this paper to reduce the computational burden associated with quantum transport simulations from first-principles. The power of modern hybrid computer architectures is harvested in order to determine the open boundary conditions that connect the simulation domain with its environment and to solve the resulting Schrödinger equation. While the former operation takes the form of an eigenvalue problem that is solved by a contour integration technique on the available central processing units (CPUs), the latter can be cast into a linear system of equations that is simultaneously processed by SplitSolve, a two-step algorithm, on general-purpose graphics processing units (GPUs). A significant decrease of the computational time by up to two orders of magnitude is obtained as compared to standard solution methods.
Resistive switching memory (RRAM) is considered as one of the most promising devices for parallel computing solutions that may overcome the von Neumann bottleneck of today’s electronic systems. However, the existing RRAM-based parallel computing architectures suffer from practical problems such as device variations and extra computing circuits. In this work, we propose a novel parallel computing architecture for pattern recognition by implementing k-nearest neighbor classification on metal-oxide RRAM crossbar arrays. Metal-oxide RRAM with gradual RESET behaviors is chosen as both the storage and computing components. The proposed architecture is tested by the MNIST database. High speed (~100 ns per example) and high recognition accuracy (97.05%) are obtained. The influence of several non-ideal device properties is also discussed, and it turns out that the proposed architecture shows great tolerance to device variations. This work paves a new way to achieve RRAM-based parallel computing hardware systems with high performance.
Neuromorphic engineering combines the architectural and computational principles of systems neuroscience with semiconductor electronics, with the aim of building efficient and compact devices that mimic the synaptic and neural machinery of the brain. The energy consumptions promised by neuromorphic engineering are extremely low, comparable to those of the nervous system. Until now, however, the neuromorphic approach has been restricted to relatively simple circuits and specialized functions, thereby obfuscating a direct comparison of their energy consumption to that used by conventional von Neumann digital machines solving real-world tasks. Here we show that a recent technology developed by IBM can be leveraged to realize neuromorphic circuits that operate as classifiers of complex real-world stimuli. Specifically, we provide a set of general prescriptions to enable the practical implementation of neural architectures that compete with state-of-the-art classifiers. We also show that the energy consumption of these architectures, realized on the IBM chip, is typically two or more orders of magnitude lower than that of conventional digital machines implementing classifiers with comparable performance. Moreover, the spike-based dynamics display a trade-off between integration time and accuracy, which naturally translates into algorithms that can be flexibly deployed for either fast and approximate classifications, or more accurate classifications at the mere expense of longer running times and higher energy costs. This work finally proves that the neuromorphic approach can be efficiently used in real-world applications and has significant advantages over conventional digital devices when energy consumption is considered.
Wireless Sensor and Actuators Networks (WSANs) constitute one of the most challenging technologies with tremendous socio-economic impact for the next decade. Functionally and energy optimized hardware systems and development tools maybe is the most critical facet of this technology for the achievement of such prospects. Especially, in the area of agriculture, where the hostile operating environment comes to add to the general technological and technical issues, reliable and robust WSAN systems are mandatory. This paper focuses on the hardware design architectures of the WSANs for real-world agricultural applications. It presents the available alternatives in hardware design and identifies their difficulties and problems for real-life implementations. The paper introduces SensoTube, a new WSAN hardware architecture, which is proposed as a solution to the various existing design constraints of WSANs. The establishment of the proposed architecture is based, firstly on an abstraction approach in the functional requirements context, and secondly, on the standardization of the subsystems connectivity, in order to allow for an open, expandable, flexible, reconfigurable, energy optimized, reliable and robust hardware system. The SensoTube implementation reference model together with its encapsulation design and installation are analyzed and presented in details. Furthermore, as a proof of concept, certain use cases have been studied in order to demonstrate the benefits of migrating existing designs based on the available open-source hardware platforms to SensoTube architecture.
Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.