Concept: Central processing unit
This software package provides an R-based framework to make use of multi-core computers when running analyses in the population genetics program STRUCTURE. It is especially addressed to those users of STRUCTURE dealing with numerous and repeated data analyses, and who could take advantage of an efficient script to automatically distribute STRUCTURE jobs among multiple processors. It also consists of additional functions to divide analyses among combinations of populations within a single data set without the need to manually produce multiple projects, as it is currently the case in STRUCTURE. The package consists of two main functions: MPI_structure() and parallel_structure() as well as an example data file. We compared the performance in computing time for this example data on two computer architectures and showed that the use of the present functions can result in several-fold improvements in terms of computation time. ParallelStructure is freely available at https://r-forge.r-project.org/projects/parallstructure/.
QIIME (Quantitative Insights Into Microbial Ecology) is one of the most popular open-source bioinformatics suite for performing metagenome, 16S rRNA amplicon and Internal Transcribed Spacer (ITS) data analysis. Although, it is very comprehensive and powerful tool, it lacks a method to provide publication ready taxonomic pie charts. The script plot_taxa_summary.py bundled with QIIME generate a html file and a folder containing taxonomic pie chart and legend as separate images. The images have randomly generated alphanumeric names. Therefore, it is difficult to associate the pie chart with the legend and the corresponding sample identifier. Even if the option to have the legend within the html file is selected while executing plot_taxa_summary.py, it is very tedious to crop a complete image (having both the pie chart and the legend) due to unequal image sizes. It requires a lot of time to manually prepare the pie charts for multiple samples for publication purpose. Moreover, there are chances of error while identifying the pie chart and legend pair due to random alphanumeric names of the images. To bypass all these bottlenecks and make this process efficient, we have developed a python based program, prepare_taxa_charts.py, to automate the renaming, cropping and merging of taxonomic pie chart and corresponding legend image into a single, good quality publication ready image. This program not only augments the functionality of plot_taxa_summary.py but is also very fast in terms of CPU time and user friendly.
Modern parallel hardware such as multi-core processors (CPUs) and graphics processing units (GPUs) have a high computational power which can be greatly beneficial to the simulation of large-scale neural networks. Over the past years, a number of efforts have focused on developing parallel algorithms and simulators best suited for the simulation of spiking neural models. In this article, we aim at investigating the advantages and drawbacks of the CPU and GPU parallelization of mean-firing rate neurons, widely used in systems-level computational neuroscience. By comparing OpenMP, CUDA and OpenCL implementations towards a serial CPU implementation, we show that GPUs are better suited than CPUs for the simulation of very large networks, but that smaller networks would benefit more from an OpenMP implementation. As this performance strongly depends on data organization, we analyze the impact of various factors such as data structure, memory alignment and floating precision. We then discuss the suitability of the different hardware depending on the networks' size and connectivity, as random or sparse connectivities in mean-firing rate networks tend to break parallel performance on GPUs due to the violation of coalescence.
A mixed parallel scheme that combines message passing interface (MPI) and multithreading was implemented in the AutoDock Vina molecular docking program. The resulting program, named VinaLC, was tested on the petascale high performance computing (HPC) machines at Lawrence Livermore National Laboratory. To exploit the typical cluster-type supercomputers, thousands of docking calculations were dispatched by the master process to run simultaneously on thousands of slave processes, where each docking calculation takes one slave process on one node, and within the node each docking calculation runs via multithreading on multiple CPU cores and shared memory. Input and output of the program and the data handling within the program were carefully designed to deal with large databases and ultimately achieve HPC on a large number of CPU cores. Parallel performance analysis of the VinaLC program shows that the code scales up to more than 15K CPUs with a very low overhead cost of 3.94%. One million flexible compound docking calculations took only 1.4 h to finish on about 15K CPUs. The docking accuracy of VinaLC has been validated against the DUD data set by the re-docking of X-ray ligands and an enrichment study, 64.4% of the top scoring poses have RMSD values under 2.0 Å. The program has been demonstrated to have good enrichment performance on 70% of the targets in the DUD data set. An analysis of the enrichment factors calculated at various percentages of the screening database indicates VinaLC has very good early recovery of actives. © 2013 Wiley Periodicals, Inc.
Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
Parallelization of molecular dynamics (MD) simulation is essential for investigating conformational dynamics of large biological systems, such as ribosomes, viruses, and multiple proteins in cellular environments. To improve efficiency in the parallel computation, we have to reduce the amount of data transfer between processors by introducing domain decomposition schemes. Also, it is important to optimize the computational balance between real-space non-bonded interactions and reciprocal-space interactions for long-range electrostatic interactions. Here, we introduce a novel parallelization scheme for large-scale MD simulations on massively parallel supercomputers consisting of only CPUs. We make use of a multiple program/multiple data (MPMD) approach for separating the real-space and reciprocal-space computations on different processors. We also utilize the r-RESPA multiple time step integrator on the framework of the MPMD approach in an efficient way: when the reciprocal-space computations are skipped in r-RESPA, processors assigned for them are utilized for half of the real-space computations. The new scheme allows us to use twice as many as processors that are available in the conventional single program approach. The best performances of all-atom MD simulations for 1 million (STMV), 8.5 million (8_STMV), and 28.8 million (27_STMV) atom systems on K computer are 65, 36, and 24 ns/day, respectively. The MPMD scheme can accelerate 23.4, 10.2, and 9.2 ns/day from the maximum performance of single-program approach for STMV, 8_STMV, and 27_STMV systems, respectively, which correspond to 57%, 39%, and 60% speed up. This suggests significant speedups by increasing the number of processors without losing parallel computational efficiency. © 2016 Wiley Periodicals, Inc.
- International journal of surgery (London, England)
- Published over 5 years ago
Personal portable information technology is advancing at a breathtaking speed. Google has recently introduced Glass, a device that is worn like conventional glasses, but that combines a computerized central processing unit, touchpad, display screen, high-definition camera, microphone, bone-conduction transducer, and wireless connectivity. We have obtained a Glass device through Google’s Explorer program and have tested its applicability in our daily pediatric surgical practice and in relevant experimental settings.
Integrated photonics changes the scaling laws of information and communication systems offering architectural choices that combine photonics with electronics to optimize performance, power, footprint, and cost. Application-specific photonic integrated circuits, where particular circuits/chips are designed to optimally perform particular functionalities, require a considerable number of design and fabrication iterations leading to long development times. A different approach inspired by electronic Field Programmable Gate Arrays is the programmable photonic processor, where a common hardware implemented by a two-dimensional photonic waveguide mesh realizes different functionalities through programming. Here, we report the demonstration of such reconfigurable waveguide mesh in silicon. We demonstrate over 20 different functionalities with a simple seven hexagonal cell structure, which can be applied to different fields including communications, chemical and biomedical sensing, signal processing, multiprocessor networks, and quantum information systems. Our work is an important step toward this paradigm.Integrated optical circuits today are typically designed for a few special functionalities and require complex design and development procedures. Here, the authors demonstrate a reconfigurable but simple silicon waveguide mesh with different functionalities.
Forward Wright-Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the CPU, thus limiting their usefulness. The single-locus Wright-Fisher forward algorithm is, however, exceedingly parallelizable, with many steps which are so-called embarrassingly parallel, consisting of a vast number of individual computations that are all independent of each other and thus capable of being performed concurrently. The rise of modern Graphics Processing Units (GPUs) and programming languages designed to leverage the inherent parallel nature of these processors have allowed researchers to dramatically speed up many programs that have such high arithmetic intensity and intrinsic concurrency. The presented GPU Optimized Wright-Fisher simulation, or GO Fish for short, can be used to simulate arbitrary selection and demographic scenarios while running over 250-fold faster than its serial counterpart on the CPU. Even modest GPU hardware can achieve an impressive speedup of over two orders of magnitude. With simulations so accelerated, one can not only do quick parametric bootstrapping of previously estimated parameters, but also use simulated results to calculate the likelihoods and summary statistics of demographic and selection models against real polymorphism data - all without restricting the demographic and selection scenarios that can be modeled or requiring approximations to the single-locus forward algorithm for efficiency. Further, as many of the parallel programming techniques used in this simulation can be applied to other computationally intensive algorithms important in population genetics, GO Fish serves as an exciting template for future research into accelerating computation in evolution. GO Fish is part of the Parallel PopGen Package available at: http://dl42.github.io/ParallelPopGen/.
Deep belief networks hold great promise for the simulation of human cognition because they show how structured and abstract representations may emerge from probabilistic unsupervised learning. These networks build a hierarchy of progressively more complex distributed representations of the sensory data by fitting a hierarchical generative model. However, learning in deep networks typically requires big datasets and it can involve millions of connection weights, which implies that simulations on standard computers are unfeasible. Developing realistic, medium-to-large-scale learning models of cognition would therefore seem to require expertise in programing parallel-computing hardware, and this might explain why the use of this promising approach is still largely confined to the machine learning community. Here we show how simulations of deep unsupervised learning can be easily performed on a desktop PC by exploiting the processors of low cost graphic cards (graphic processor units) without any specific programing effort, thanks to the use of high-level programming routines (available in MATLAB or Python). We also show that even an entry-level graphic card can outperform a small high-performance computing cluster in terms of learning time and with no loss of learning quality. We therefore conclude that graphic card implementations pave the way for a widespread use of deep learning among cognitive scientists for modeling cognition and behavior.