Filters








217 Hits in 5.1 sec

Exploring the limits of Concurrency in ML Training on Google TPUs [article]

Sameer Kumar and James Bradbury and Cliff Young and Yu Emma Wang and Anselm Levskaya and Blake Hechtman and Dehao Chen and HyoukJoong Lee and Mehmet Deveci and Naveen Kumar and Pankaj Kanwar and Shibo Wang and Skye Wanderman-Milne and Steve Lacy and Tao Wang and Tayo Oguntebi and Yazhou Zu and Yuanzhong Xu and Andy Swing
2021 arXiv   pre-print
We also present performance resultsfrom the recent Google submission to the MLPerf-v0.7 benchmark contest, achieving record training times from16 to 28 seconds in four MLPerf models on the Google TPU-v3  ...  This paper presents techniques to scaleML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips.  ...  In order to explore the limits of concurrency in the MLPerf models we assembled a TPU-v3 multipod with 4096 chips.  ... 
arXiv:2011.03641v3 fatcat:d6kjkdjmrvdw7go2m4vbysde4e

Pathways: Asynchronous Distributed Dataflow for ML [article]

Paul Barham and Aakanksha Chowdhery and Jeff Dean and Sanjay Ghemawat and Steven Hand and Dan Hurt and Michael Isard and Hyeontaek Lim and Ruoming Pang and Sudip Roy and Brennan Saeta and Parker Schuh and Ryan Sepassi and Laurent El Shafey and Chandramohan A. Thekkath and Yonghui Wu
2022 arXiv   pre-print
Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models.  ...  Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane.  ...  ACKNOWLEDGEMENTS We gratefully acknowledge contributions to the design and implementation of the PATHWAYS system from many colleagues at Google, and from members of the wider machine learning community  ... 
arXiv:2203.12533v1 fatcat:mybwfoemx5duznofoaakxzdc64

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models [article]

Shang Wang, Peiming Yang, Yuxuan Zheng, Xin Li, Gennady Pekhimenko
2021 arXiv   pre-print
To show the generality of our solution, we apply HFTA to six DL models training on state-of-the-art accelerators (GPUs and TPUs).  ...  Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years.  ...  We want to thank Google for TPU credits and early accesses to the GCP A2 Alpha version instances.  ... 
arXiv:2102.02344v3 fatcat:hguhwetzljbwhbdetwv2tfmcz4

Molecular Dynamics Simulations on Cloud Computing and Machine Learning Platforms [article]

Prateek Sharma, Vikram Jadhao
2021 arXiv   pre-print
However, we are seeing a paradigm shift in the computational structure, design, and requirements of these applications.  ...  Finally, we present some low-hanging fruits and long-term challenges in cloud resource management, and the integration of molecular dynamics simulations into ML platforms (such as TensorFlow).  ...  that can be run on ML system backends such as Google Colab (compared to cumbersome configuring of simulations on different HPC systems). • Integration with data-driven approaches: Data operations in ML  ... 
arXiv:2111.06466v1 fatcat:fjzj64dhsfcbbbcvw4pcqvu7lu

A Systematic Methodology for Analysis of Deep Learning Hardware and Software Platforms

Yu Wang, Gu-Yeon Wei, David Brooks
2020 Conference on Machine Learning and Systems  
We demonstrate its utility by comparing two generations of specialized platforms (Google's Cloud TPU v2/v3), three heterogeneous platforms (Google TPU, Nvidia GPU, and Intel CPU), and specialized software  ...  Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware and software specialization to improve performance.  ...  ACKNOWLEDGEMENT This work was supported in part by Googles TensorFlow Research Cloud (TFRC) program, NSF Grant CCF1533737, and the Center for Applications Driving Architectures (ADA), one of six centers  ... 
dblp:conf/mlsys/WangW020 fatcat:dgw4pk574vdghgp3uuqdolixza

Training and Serving ML workloads with Kubeflow at CERN

Dejan Golubovic, Ricardo Rocha, C. Biscarat, S. Campana, B. Hegner, S. Roiser, C.I. Rovelli, G.A. Stewart
2021 EPJ Web of Conferences  
We describe a new service available at CERN, based on Kubeflow and managing the full ML lifecycle: data preparation and interactive analysis, large scale distributed model training and model serving.  ...  Machine Learning (ML) has been growing in popularity in multiple areas and groups at CERN, covering fast simulation, tracking, anomaly detection, among many others.  ...  Figure 6 : 6 (left) The impact of increasing number of total GPUs assigned to workers on time to process one epoch of training. (right) Cost of running one epoch on the Google Cloud Platform.  ... 
doi:10.1051/epjconf/202125102067 fatcat:2ysx7q5hijaiditzrlmb5w423i

Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks [article]

Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, Onur Mutlu
2021 arXiv   pre-print
To understand how edge ML accelerators perform, we characterize the performance of a commercial Google Edge TPU, using 24 Google edge NN models (which span a wide range of NN model types) and analyzing  ...  Our characterization reveals that the one-size-fits-all, monolithic design of the Edge TPU ignores the high degree of heterogeneity both across different NN models and across different NN layers within  ...  We acknowledge the generous gifts of our industrial partners, especially Google, Huawei, Intel, Microsoft, and VMware. This research was partially supported by the Semiconductor Research Corporation.  ... 
arXiv:2109.14320v1 fatcat:u4zisd7zovauhongievozrbwyu

GPTPU: Accelerating Applications using Edge Tensor Processing Units [article]

Kuan-Chieh Hsu, Hung-Wei Tseng
2021 arXiv   pre-print
NN accelerators share the idea of providing native hardware support for operations on multidimensional tensor data.  ...  GPTPU includes a powerful programming interface with efficient runtime system-level support -- similar to that of CUDA/OpenCL in GPGPU computing -- to bridge the gap between application demands and mismatched  ...  ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their helpful comments. This work was sponsored by an National Science Foundation (NSF) award, 2007124.  ... 
arXiv:2107.05473v2 fatcat:tfdvtpczo5ch5otbo474zxxbty

No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy [article]

Amit Samanta and Suhas Shrinivasan and Antoine Kaufmann and Jonathan Mace
2019 arXiv   pre-print
With the rise of machine learning, inference on deep neural networks (DNNs) has become a core building block on the critical path for many cloud applications.  ...  A shared system enables cost-efficient operation with consistent performance across the full spectrum of workloads.  ...  Aside from dedicated VMs, 1 cloud customers can alternatively use a hosted system to serve their model, such as Google ML Engine [5] and Microsoft Azure ML [3] In the rest of the paper we use VMs  ... 
arXiv:1901.06887v2 fatcat:k6bgjy7m3vcvrog5skrh55bzza

DiVa: An Accelerator for Differentially Private Machine Learning [article]

Beomsik Park, Ranggi Hwang, Dongho Yoon, Yoonhyuk Choi, Minsoo Rhu
2022 arXiv   pre-print
In this work, we conduct a detailed workload characterization on a state-of-the-art differentially private ML training algorithm named DP-SGD.  ...  The widespread deployment of machine learning (ML) is raising serious concerns on protecting the privacy of users who contributed to the collection of training data.  ...  ACKNOWLEDGMENT This research is partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (NRF-2021R1A2C2091753), the Engineering Research Center Program  ... 
arXiv:2208.12392v1 fatcat:7jy6xquazjcxdis2pwljz6weha

MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving

Chengliang Zhang, Minchen Yu, Wei Wang, Feng Yan
2019 USENIX Annual Technical Conference  
The advances of Machine Learning (ML) have sparked a growing demand of ML-as-a-Service: developers train ML models and publish them in the cloud as online services to provide low-latency inference at scale  ...  We evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras.  ...  Acknowledgement This work was supported in part by RGC ECS grant 26213818, NSF grant CCF-1756013, and IIS-1838024 (using resources provided by AWS as part of the NSF BIGDATA program).  ... 
dblp:conf/usenix/ZhangYWY19 fatcat:byntdditebfhvj7e4ixl6537yq

Concurrent Adversarial Learning for Large-Batch Training [article]

Yong Liu, Xiangning Chen, Minhao Cheng, Cho-Jui Hsieh, Yang You
2022 arXiv   pre-print
Large-batch training has become a commonly used technique when training neural networks with a large number of GPU/TPU processors.  ...  In this paper, we propose to use adversarial learning to increase the batch size in large-batch training.  ...  ACKNOWLEDGEMENTS We thank Google TFRC for supporting us to get access to the Cloud TPUs.  ... 
arXiv:2106.00221v2 fatcat:wihz4a27vzawpjqshlh7ooft2e

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning [article]

Yu Emma Wang, Gu-Yeon Wei, David Brooks
2019 arXiv   pre-print
We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models.  ...  Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.  ...  Data parallelism is im- plemented on the TPU, where one batch of training data is split evenly and sent to the 8 cores on the TPU board.  ... 
arXiv:1907.10701v4 fatcat:s5br43qnqjb2xnht6ulvh45gkm

A Survey of Neural Network Hardware Accelerators in Machine Learning

Fatimah Jasem, Manar AlSaraf
2021 Machine Learning and Applications An International Journal  
However, due to the exponential growth in technology constraints (especially in terms of energy) which could lead to heterogeneous multicores, and increasing number of defects, the strategy of defect-tolerant  ...  The use of Machine Learning in Artificial Intelligence is the inspiration that shaped technology as it is today. Machine Learning has the power to greatly simplify our lives.  ...  Processing Unit (TPU) In May 2016, Google working labour announced the Tensor Processing Unit (TPU) which is a custom ASIC that is created only for ML.  ... 
doi:10.5121/mlaij.2021.8402 fatcat:vaya6cwywjaq3jefppxt6w2nuu

The OoO VLIW JIT Compiler for GPU Inference [article]

Paras Jain, Xiangxi Mo, Ajay Jain, Alexey Tumanov, Joseph E. Gonzalez, Ion Stoica
2019 arXiv   pre-print
Current trends in Machine Learning~(ML) inference on hardware accelerated devices (e.g., GPUs, TPUs) point to alarmingly low utilization.  ...  We quantify the inefficiencies of space-only and time-only multiplexing alternatives and demonstrate an achievable 7.7x opportunity gap through spatial coalescing.  ...  Amazon estimates that 90% of production ML infrastructure costs are for inference, not training [27] .  ... 
arXiv:1901.10008v2 fatcat:xie3vplzwvbotginlsywecppae
« Previous Showing results 1 — 15 out of 217 results