2,857 Hits in 7.4 sec

Understanding Scalability and Fine-Grain Parallelism of Synchronous Data Parallel Training

Jiali Li, Bogdan Nicolae, Justin Wozniak, George Bosilca
2019 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)  
With increasing complexity of learning models and amounts of training data, data-parallel approaches based on frequent all-reduce synchronization steps are increasingly popular.  ...  Index Terms-deep learning, data-parallel training, behavior analysis  ...  It used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.  ... 
doi:10.1109/mlhpc49564.2019.00006 dblp:conf/sc/LiNWB19 fatcat:pcxwhll7xncrdp2m652gpx323u

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications [article]

Bita Hasheminezhad, Shahrzad Shirzad, Nanmiao Wu, Patrick Diehl, Hannes Schulz, Hartmut Kaiser
2020 arXiv   pre-print
effective and efficient fine-grained inter-node communication.  ...  and concurrency (HPX), leveraging fine-grained threading and an active messaging task-based runtime system.  ...  Acknowledgements The authors are grateful for the support of this work by the LSU Center for Computation & Technology and by the DTIC project: Phylanx Engine Enhancement and Visualizations Development  ... 
arXiv:2010.03012v1 fatcat:2hy7evtvdra2dotv35dvbhv7mu

Coarse grain parallelization of deep neural networks

Marc Gonzalez Tallada
2016 Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '16  
This paper describes the implementation and analysis of a network-agnostic and convergence-invariant coarse-grain parallelization of the DNN training algorithm.  ...  The coarse-grain parallelization is achieved through the exploitation of the batch-level parallelism. This strategy is independent from the support of specialized and optimized libraries.  ...  Overall Performance Coarse-grain vs Fine-grain: Pros and Cons In this section we present the lessons learnt regarding the coarsegrain and fine-grain parallelizations of the DNN training process.  ... 
doi:10.1145/2851141.2851158 dblp:conf/ppopp/Tallada16 fatcat:ipabmp2f6revhfly6agikqkt74

BigDL: A Distributed Deep Learning Framework for Big Data [article]

Jason Dai, Yiheng Wang, Xin Qiu, Ding Ding, Yao Zhang, Yanzhang Wang, Xianyan Jia, Cherry Zhang, Yan Wan, Zhichao Li, Jiao Wang, Shengsheng Huang, Zhongyuan Wu, Yang Wang (+6 others)
2018 arXiv   pre-print
as to achieve highly scalable, data-parallel distributed training.  ...  It is implemented on top of Apache Spark, and allows users to write their deep learning applications as standard Spark programs (running directly on large-scale big data clusters in a distributed fashion  ...  Acknowledgement We gratefully acknowledge contributions from our (current and former) colleagues at Intel  ... 
arXiv:1804.05839v3 fatcat:u5afdn37l5c7lalqxqmlj5se6e

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Bogdan Nicolae, Jiali Li, Justin M. Wozniak, George Bosilca, Matthieu Dorier, Franck Cappello
2020 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)  
However, with increasing size of the learning models and popularity of distributed data-parallel training approaches, simple checkpointing techniques used so far face several limitations: low serialization  ...  In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications.  ...  and vertical scalability (e.g., fine-grain layerwise parallelism).  ... 
doi:10.1109/ccgrid49817.2020.00-76 dblp:conf/ccgrid/NicolaeLWBDC20 fatcat:s4565nfzczhfzmk4gir3tgkt64

Graph Analytics Through Fine-Grained Parallelism

Zechao Shang, Feifei Li, Jeffrey Xu Yu, Zhiwei Zhang, Hong Cheng
2016 Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16  
Among them, synchronous processing has the potential to achieve the best performance due to fine-grained parallelism, while ensuring the correctness and the convergence of the computation, if an effective  ...  To increase efficiency and scalability, in-memory computation and parallelism have been explored extensively to speed up various graph analytical workloads.  ...  The work was supported by grant of Research Grants Council of the Hong Kong SAR, China No. 14209314, The Chinese University of Hong Kong Direct Grant No. 4055048.  ... 
doi:10.1145/2882903.2915238 dblp:conf/sigmod/ShangLYZC16 fatcat:ru3d5ebn75b3pihqyqieqsbcje

Complementing user-level coarse-grain parallelism with implicit speculative parallelism

Nikolas Ioannou, Marcelo Cintra
2011 Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 '11  
understand the issue and provide invaluable feedback.  ...  This thesis proposes using cores that do not contribute to performance improvement for running implicit fine-grain speculative threads.  ...  However, parallel programming is often hard and error prone, especially when addressing fine-grain threading which involves complex synchronization, communication, data partitioning, and scheduling [67  ... 
doi:10.1145/2155620.2155654 dblp:conf/micro/IoannouC11 fatcat:fskfxnf45jcvhm4n3vnw3crvm4


Gokcen Kestor, Vasileios Karakostas, Osman S. Unsal, Adrian Cristal, Ibrahim Hur, Mateo Valero
2011 Proceeding of the second joint WOSP/SIPEW international conference on Performance engineering - ICPE '11  
Its main goal is to make parallel programming for Chip Multiprocessors (CMPs) easier than using the traditional lock synchronization constructs, without compromising the performance and the scalability  ...  In addition to featuring current TM research issues such as nesting and I/O and system calls inside transactions, the RMS-TM applications also provide a mix of short and long transactions with small/large  ...  yes fine-grained UtilityMine association rule mining coarse-grained yes yes memory management yes fine-grained operations Bodytrack computer vision fine-grained no yes library calls yes  ... 
doi:10.1145/1958746.1958795 dblp:conf/wosp/KestorKUCHV11 fatcat:v4o3drvdtbc5haop4zykxovwrq

A view of programming scalable data analysis: from clouds to exascale

Domenico Talia
2019 Journal of Cloud Computing: Advances, Systems and Applications  
Scalable big data analysis today can be achieved by parallel implementations that are able to exploit the computing and storage facilities of high performance computing (HPC) systems and clouds, whereas  ...  Here is discussed how clouds currently support the development of scalable data mining solutions and are outlined and examined the main challenges to be addressed and solved for implementing innovative  ...  Availability of data and materials Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.  ... 
doi:10.1186/s13677-019-0127-x fatcat:l5mimqzwibh7fn4fedlsz4jkji


Giorgis Georgakoudis, Hans Vandierendonck, Peter Thoman, Bronis R. De Supinski, Thomas Fahringer, Dimitrios S. Nikolopoulos
2017 ACM Transactions on Architecture and Code Optimization (TACO)  
Unlike previous approaches, SCALO di ers by including dynamic contention e ects on scalability and by controlling the parallelism during the execution of parallel regions.  ...  SCALO monitors co-executing applications at runtime to evaluate their scalability. Its optimizing thread allocator analyzes these scalability estimates to adapt the parallelism of each program.  ...  Fig. 9 . 9 Thread allocations for the workload t-mg 4.2.3 Variations in scalability and fine-grain control of parallelism.  ... 
doi:10.1145/3158643 fatcat:dh4dljqmg5fjlb73y6p4ouyh7m

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale [article]

Bogdan Nicolae and Adam Moody and Gregory Kosinovsky and Kathryn Mohror and Franck Cappello
2021 arXiv   pre-print
Checkpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications.  ...  However, such a pattern frequently leads to I/O bottlenecks that lead to poor scalability and performance.  ...  Furthermore, the separation between fine-grained declarations of critical memory regions and the actual checkpoint request opens several optimization opportunities compared with writing the critical data  ... 
arXiv:2103.02131v1 fatcat:53tvxe2iszde5gwkr4dy6gxeeq

Adaptive Space-Shared Scheduling for Shared-Memory Parallel Programs [chapter]

Younghyun Cho, Surim Oh, Bernhard Egger
2017 Lecture Notes in Computer Science  
A simple performance model based solely on online profile data is used to characterize the performance scalability of applications.  ...  To properly assign the disjoint set of cores to simultaneously running parallel applications, the proposed scheme considers the performance characteristics of the executing (parallel) code section of all  ...  Also, in this work, we have only focused on coarse-grained scheduling issues and left the fine-grained task-to-core mapping to the Linux scheduler.  ... 
doi:10.1007/978-3-319-61756-5_9 fatcat:zi4nrki3d5f7fjsxxsre7at2vq

Putting The "Learning" In Machine Learning Processors: An Introduction To The Loihi Neuromorphic Research Chip

Mike Davies
2018 Zenodo  
By maintaining the same locality of information processing and integrated memory-compute architecture as the brain, Loihi promises to provide highly efficient and scalable learning performance for supervised  ...  This talk provides an overview of the Loihi architecture and preliminary results towards our vision of low power and real-time on-chip learning.  ...  • Network configuration • Synchronous design Neuromorphic core • LIF neuron model • Programmable learning • 128 KB synaptic memory • Up to 1,024 neurons • Asynchronous design Mesh Operation: Fine-Grained  ... 
doi:10.5281/zenodo.1313406 fatcat:jjnmtsopvveqdiu6zcfwcwyozi

Scalable fine-grained call path tracing

Nathan R. Tallent, John Mellor-Crummey, Michael Franco, Reed Landrum, Laksono Adhianto
2011 Proceedings of the international conference on Supercomputing - ICS '11  
Moreover, to obtain actionable performance feedback for modular parallel software systems, it is often necessary to collect and present fine-grained contextsensitive data -the very thing scalable tools  ...  This paper describes how to collect, analyze and present fine-grained call path traces for parallel programs.  ...  However, using fine-grained instrumentation to generate a fine-grained trace has not been shown to be scalable.  ... 
doi:10.1145/1995896.1995908 dblp:conf/ics/TallentMFLA11 fatcat:fv3pz2ifovadrpg5gmyhxmtshe

Parallel Phase Model: A Programming Model for High-end Parallel Machines with Manycores

Ron Brightwell, Mike Heroux, Zhaofang Wen, Junfeng Wu
2009 2009 International Conference on Parallel Processing  
The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism.  ...  Several unstructured applications that inherently require high-volume random fine-grained data accesses have been implemented in PPM with very promising results.  ...  The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism.  ... 
doi:10.1109/icpp.2009.69 dblp:conf/icpp/BrightwellHWW09 fatcat:cosq32nilrbv5bkom7usbczkbi
« Previous Showing results 1 — 15 out of 2,857 results