Filters








78 Hits in 0.33 sec

Cluster Computing for Determining Three-Dimensional Protein Structure

Paulius Micikevicius, Narsingh Deo
2005 Journal of Supercomputing  
Determining the three-dimensional structure of proteins is crucial to efficient drug design and understanding biological processes. One successful method for computing the molecule's shape relies on inter-atomic distance bounds provided by Nuclear Magnetic Resonance spectroscopy. The accuracy of computed structures as well as the time required to obtain them are greatly improved if the gaps between the upper and lower distance-bounds are reduced. These gaps are reduced most effectively by
more » ... ng the tetrangle inequality, derived from the Cayley-Menger determinant, to all atom-quadruples. However, tetrangle-inequality bound-smoothing is an extremely computation intensive task, requiring O(n 4 ) time for an n-atom molecule. To reduce computation time, we propose a novel coarse-grained parallel algorithm intended for a Beowulf-type cluster of PCs. The algorithm employs p ≤ n/6 processors and requires O(n 4 /p) time and O(p 2 ) communications, where n is the number of atoms in a molecule. The number of communications is at least an order of magnitude lower than in the earlier parallelizations. Our implementation utilized processors with at least 59% efficiency (including the communication overhead) -an impressive figure for a non-embarrassingly parallel problem 2 on a cluster of workstations.
doi:10.1007/s11227-005-1168-0 fatcat:4sm4jkpoi5bvdfjbxxx7ktquyi

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation [article]

Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, Paulius Micikevicius
2020 arXiv   pre-print
Quantization techniques can reduce the size of Deep Neural Networks and improve inference latency and throughput by taking advantage of high throughput integer instructions. In this paper we review the mathematical aspects of quantization parameters and evaluate their choices on a wide range of neural network models for different application domains, including vision, speech, and language. We focus on quantization techniques that are amenable to acceleration by processors with high-throughput
more » ... teger math pipelines. We also present a workflow for 8-bit quantization that is able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.
arXiv:2004.09602v1 fatcat:ykqrhfoa7zdqbjj7n6pd3l5u2i

Coarse-Grained Parallelization of Distance-Bound Smoothing for the Molecular Conformation Problem [chapter]

Narsingh Deo, Paulius Micikevicius
2002 Lecture Notes in Computer Science  
Determining the three-dimensional structure of proteins is crucial to efficient drug design and understanding biological processes. One successful method for computing the molecule's shape relies on the inter-atomic distance bounds provided by the Nucleo-Magnetic Resonance (NMR) spectroscopy. The accuracy of computed structures as well as the time required to obtain them are greatly improved if the gaps between the upper and lower distance-bounds are reduced. These gaps are reduced most
more » ... ely by applying the tetrangle inequality, derived from the Cayley-Menger determinant, to all atomquadruples. However, tetrangle-inequality bound-smoothing is an extremely computation intensive task, requiring O(n 4 ) time for an n-atom molecule. To reduce the computation time, we propose a novel coarse-grained parallel algorithm intended for a Beowulf-type cluster of PCs. The algorithm employs p ≤ n/6 processors and requires O(n 4 /p) time and O(p 2 ) communications. The number of communications is at least an order of magnitude lower than in the earlier parallelizations. Our implementation utilized the processors with at least 59% efficiency (including the communication overhead) -an impressive figure for a non-embarrassingly parallel problem on a cluster of workstations.
doi:10.1007/3-540-36385-8_6 fatcat:z2nowmkupfebpprlvclf7l7xza

Mixed Precision Training [article]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu
2018 arXiv   pre-print
Deep neural networks have enabled progress in a wide variety of applications. Growing the size of the neural network typically results in improved accuracy. As model sizes grow, the memory and compute requirements for training these models also increases. We introduce a technique to train deep neural networks using half precision floating point numbers. In our technique, weights, activations and gradients are stored in IEEE half-precision format. Half-precision floating numbers have limited
more » ... rical range compared to single-precision numbers. We propose two techniques to handle this loss of information. Firstly, we recommend maintaining a single-precision copy of the weights that accumulates the gradients after each optimizer step. This single-precision copy is rounded to half-precision format during training. Secondly, we propose scaling the loss appropriately to handle the loss of information with half-precision gradients. We demonstrate that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. This technique works for large scale models with more than 100 million parameters trained on large datasets. Using this approach, we can reduce the memory consumption of deep learning models by nearly 2x. In future processors, we can also expect a significant computation speedup using half-precision hardware units.
arXiv:1710.03740v3 fatcat:azpkxhsw7bahljufrylfysbqo4

3D finite difference computation on GPUs using CUDA

Paulius Micikevicius
2009 Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-2  
In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla
more » ... s GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation.
doi:10.1145/1513895.1513905 dblp:conf/asplos/Micikevicius09 fatcat:btkwmljcpbbrrcoe3lj6zsilu4

Fusing convolution kernels through tiling

Mahesh Ravishankar, Paulius Micikevicius, Vinod Grover
2015 Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming - ARRAY 2015  
Image processing pipelines are continuously being developed to deduce more information about objects captured in images. To facilitate the development of such pipelines several Domain Specific Languages (DSLs) have been proposed that provide constructs for easy specification of such computations. It is then upto the DSL compiler to generate code to efficiently execute the pipeline on multiple hardware architectures. While such compilers are getting ever more sophisticated, to achieve large
more » ... adoption these DSLs have to beat, or at least match, the performance that can be achieved by a skilled programmer. Many of these pipelines use a sequence of convolution kernels that are memory bandwidth bound. One way to address this bottleneck is through use of tiling. In this paper we describe an approach to tiling within the context of a DSL called Forma. Using the high-level specification of the pipeline in this DSL, we describe a code generation algorithm that fuses multiple stages of the pipeline through the use of tiling to reduce the memory bandwidth requirements on both GPU and CPU. Using this technique improves the performance of pipelines like Canny Edge Detection by 58% on NVIDIA GPUs, and of the Harris Corner Detection pipeline by 71% on CPUs.
doi:10.1145/2774959.2774965 dblp:conf/pldi/RavishankarMG15 fatcat:6g6un5g6crcgxmuow5fllc4ziu

Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq [article]

Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Jason Li, Huyen Nguyen, Carl Case, Paulius Micikevicius
2018 arXiv   pre-print
To prevent accuracy loss due to the reduced precision, we use two techniques suggested by Micikevicius et al. (2017): 1.  ...  . • Support for mixed-precision training (Micikevicius et al., 2017) , that utilizes Tensor Cores introduced in NVIDIA Volta GPUs. • Fast, simple-to-use, Horovod-based distributed training via data parallelism  ... 
arXiv:1805.10387v2 fatcat:sccepevlovfvppigqjwq56zrv4

Scaling the Power Wall: A Path to Exascale

Oreste Villa, Daniel R. Johnson, Mike Oconnor, Evgeny Bolotin, David Nellans, Justin Luitjens, Nikolai Sakharnykh, Peng Wang, Paulius Micikevicius, Anthony Scudiero, Stephen W. Keckler, William J. Dally
2014 SC14: International Conference for High Performance Computing, Networking, Storage and Analysis  
Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the
more » ... sign of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signaling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.
doi:10.1109/sc.2014.73 dblp:conf/sc/VillaJOBNLSWMSKD14 fatcat:63yyeyb5jrhqpp6mypnn5hz44m

MLPerf Training Benchmark [article]

Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen (+24 others)
2020 arXiv   pre-print
., 2017; Micikevicius et al., 2018) and performance optimizations, ML benchmarks must include accuracy metrics.  ... 
arXiv:1910.01500v3 fatcat:ciwfjyu3x5crrm2fy2275g3wiy

MLPerf Inference Benchmark [article]

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman (+35 others)
2020 arXiv   pre-print
Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and
more » ... ies. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.
arXiv:1911.02549v2 fatcat:jewandlmivctjb7wywadxuuqju

GPU implementation of minimal dispersion recursive operators for reverse time migration

Allon Bartana*, Dan Kosloff, Brandon Warnell, Chris Connor, Jeff Codd, David Kessler, Paulius Micikevicius, Ty Mckercher, Peng Wang, Paul Holzhauer
2015 SEG Technical Program Expanded Abstracts 2015   unpublished
To understand the GPU implementation with recursive operators, and the challenges it presents, we first review the main elements of the GPU implementation using FD approximation of the derivatives (Micikevičius  ... 
doi:10.1190/segam2015-5754164.1 fatcat:yyic3x5xkjge3erdmrz5jl3mxe

OpenSeq2Seq: Extensible Toolkit for Distributed and Mixed Precision Training of Sequence-to-Sequence Models

Oleksii Kuchaiev, Boris Ginsburg, Igor Gitman, Vitaly Lavrukhin, Carl Case, Paulius Micikevicius
2018 Proceedings of Workshop for NLP Open Source Software (NLP-OSS)   unpublished
In particular, OpenSeq2Seq adds support for mixed precision training as described in (Micikevicius et al., 2017) .  ...  However, this method has proven to be robust across a variety of large set of complex models (Micikevicius et al., 2017) .  ... 
doi:10.18653/v1/w18-2507 fatcat:v4jgh6kkpbgzlaebk5cyritmk4

Page 8573 of Mathematical Reviews Vol. , Issue 2001K [page]

2001 Mathematical Reviews  
Michor, Peter W. . 13024 37031 . 47022 81371 58052 93019 74083 82090 49027 65132 65007 11157 34015 Micikevicius, Paulius Mickelsson, J.  ... 

Page 7601 of Mathematical Reviews Vol. , Issue 2000k [page]

2000 Mathematical Reviews  
Fennessey, On a finite polynomial generating function for Catalan subsequences: an 18th century observation proved (49-60); Narsingh Deo and Paulius Micikevicius, A heuris- tic for a leaf constrained minimum  ... 

Page 786 of Mathematical Reviews Vol. , Issue 2003A [page]

2003 Mathematical Reviews  
Micikevicius, Paulius Miculan, Marino . Miculescu, Radu Mielke, Alexander . Mielnik, Bogdan Mignon, Thierry . Mikami, Toshio Mikhailov, Andrei .  ... 
« Previous Showing results 1 — 15 out of 78 results