Discover the most talked about and latest scientific content & concepts.

Concept: Simultaneous multithreading


Understanding the detailed dynamics of neuronal networks will require the simultaneous measurement of spike trains from hundreds of neurons (or more). Currently, approaches to extracting spike times and labels from raw data are time consuming, lack standardization, and involve manual intervention, making it difficult to maintain data provenance and assess the quality of scientific results. Here, we describe an automated clustering approach and associated software package that addresses these problems and provides novel cluster quality metrics. We show that our approach has accuracy comparable to or exceeding that achieved using manual or semi-manual techniques with desktop central processing unit (CPU) runtimes faster than acquisition time for up to hundreds of electrodes. Moreover, a single choice of parameters in the algorithm is effective for a variety of electrode geometries and across multiple brain regions. This algorithm has the potential to enable reproducible and automated spike sorting of larger scale recordings than is currently possible.

Concepts: Nervous system, Neuron, Brain, Human brain, Psychometrics, Computer program, Central processing unit, Simultaneous multithreading


In the present study, the multi-threading performances of the Geant4, MCNP6, and PHITS codes were evaluated as a function of the number of the threads (N) and the complexity of the tetrahedral-mesh phantom. For this, three tetrahedral-mesh phantoms with different complexity (simple, moderately complex, and highly complex) were prepared and implemented in three different Monte Carlo codes, carrying out photon and neutron transport simulations. Subsequently, for each case, the initialization time, calculation time, and memory usage were measured as a function of the number of threads used in the simulation. It was found that for all codes, the initialization time significantly increases with the complexity of the phantom, but not much with the number of the threads. Geant4 showed much longer initialization time than the other codes, especially for the complex phantom (MRCP). In the present study, the improvement of the computation speed due to the use of a multi-threaded code was calculated as a speed-up factor, which is the ratio of the computation speed on a multi-threaded code to the computation speed on the single-threaded code. Geant4 showed the best multi-threading performance among the codes considered in this study, with the speed-up factor almost linearly increasing with the number of the threads reaching ~30 when N = 40. PHITS and MCNP6 showed much less increase of the speed-up factor with the number of threads. For PHITS, the speed-up factors were less than a few times when N = 40. For MCNP6, the increase of the speed-up factor is better, but still less than ~10 when N = 40. For memory usage, Geant4 was found to use more memory than the other codes.

Concepts: Present, Time, Mathematics, Ratio, Monte Carlo, Monte Carlo method, Concurrency, Simultaneous multithreading


Along with advances in thread lift techniques and materials, ancillary procedures such as fat grafting, liposuction, or filler injections have been performed simultaneously. Some surgeons think that these ancillary procedures might affect the aesthetic outcomes of thread lifting possibly due to inadvertent injury to threads or loosening of soft tissue via passing the cannula in the surgical plane of the thread lifts. The purpose of the current study is to determine the effect of such ancillary procedures on the outcome of thread lifts in the human and cadaveric setting.

Concepts: Surgery, Tissues, Plastic surgery, Soft tissue, The Current, Outcome, Threads, Simultaneous multithreading


As the energy consumption has been surging in an unsustainable way, it is important to understand the impact of existing architecture designs from energy efficiency perspective, which is especially valuable for High Performance Computing (HPC) and datacenter environment hosting tens of thousands of servers. One obstacle hindering the advance of comprehensive evaluation on energy efficiency is the deficient power measuring approach. Most of the energy study relies on either external power meters or power models, both of these two methods contain intrinsic drawbacks in their practical adoption and measuring accuracy. Fortunately, the advent of Intel Running Average Power Limit (RAPL) interfaces has promoted the power measurement ability into next level, with higher accuracy and finer time resolution. Therefore, we argue it is the exact time to conduct an in-depth evaluation of the existing architecture designs to understand their impact on system energy efficiency. In this paper, we leverage representative benchmark suites including serial and parallel workloads from diverse domains to evaluate the architecture features such as Non Uniform Memory Access (NUMA), Simultaneous Multithreading (SMT) and Turbo Boost. The energy is tracked at subcomponent level such as Central Processing Unit (CPU) cores, uncore components and Dynamic Random-Access Memory (DRAM) through exploiting the power measurement ability exposed by RAPL. The experiments reveal non-intuitive results: 1) the mismatch between local compute and remote memory node caused by NUMA effect not only generates dramatic power and energy surge but also deteriorates the energy efficiency significantly; 2) for multithreaded application such as the Princeton Application Repository for Shared-Memory Computers (PARSEC), most of the workloads benefit a notable increase of energy efficiency using SMT, with more than 40% decline in average power consumption; 3) Turbo Boost is effective to accelerate the workload execution and further preserve the energy, however it may not be applicable on system with tight power budget.

Concepts: Parallel computing, Computer, Central processing unit, Thread, Efficient energy use, Energy consumption, Simultaneous multithreading, Multithreading


In this paper, we propose a new visual-inertial Simultaneous Localization and Mapping (SLAM) algorithm. With the tightly coupled sensor fusion of a global shutter monocular camera and a low-cost Inertial Measurement Unit (IMU), this algorithm is able to achieve robust and real-time estimates of the sensor poses in unknown environment. To address the real-time visual-inertial fusion problem, we present a parallel framework with a novel IMU initialization method. Our algorithm also benefits from the novel IMU factor, the continuous preintegration method, the vision factor of directional error, the separability trick and the robust initialization criterion which can efficiently output reliable estimates in real-time on modern Central Processing Unit (CPU). Tremendous experiments also validate the proposed algorithm and prove it is comparable to the state-of-art method.

Concepts: Measurement, Metrology, Psychometrics, Parallel computing, Computer program, Units of measurement, Central processing unit, Simultaneous multithreading


For high-resolution, iterative 3D PET image reconstruction the efficient implementation of forward-backward projectors is essential to minimise the calculation time. Mathematically, the projectors are summarised as a system response matrix (SRM) whose elements define the contribution of image voxels to lines-of-response (LORs). In fact, the SRM easily comprises billions of non-zero matrix elements to evaluate the tremendous number of LORs as provided by state-of-the-art PET scanners. Hence, the performance of iterative algorithms, e.g. maximum-likelihood-expectation-maximisation (MLEM), suffers from severe computational problems due to the intensive memory access and huge number of floating point operations.Here, symmetries occupy a key role in terms of efficient implementation. They reduce the amount of independent SRM elements, thus allowing for a significant matrix compression according to the number of exploitable symmetries. With our previous work, the PET REconstruction Software TOolkit (PRESTO), very high compression factors (>300) are demonstrated by using specific non-Cartesian voxel patterns involving discrete polar symmetries. In this way, a pre-calculated memory-resident SRM using complex volume-of-intersection calculations can be achieved. However, our original ray-driven implementation suffers from addressing voxels, projection data and SRM elements in disfavoured memory access patterns. As a consequence, a rather limited numerical throughput is observed due to the massive waste of memory bandwidth and inefficient usage of cache respectively.In this work, an advantageous symmetry-driven evaluation of the forward-backward projectors is proposed to overcome these inefficiencies. The polar symmetries applied in PRESTO suggest a novel organisation of image data and LOR projection data in memory to enable an efficient single instruction multiple data vectorisation, i.e. simultaneous use of any SRM element for symmetric LORs. In addition, the calculation time is further reduced by using simultaneous multi-threading (SMT). A global speedup factor of 11 without SMT and above 100 with SMT has been achieved for the improved CPU-based implementation while obtaining equivalent numerical results.

Concepts: Mathematics, Symmetry, Geometry, Group, Number, Pixel, SIMD, Simultaneous multithreading