130 Hits in 5.1 sec

Transferable Graph Optimizers for ML Compilers [article]

Yanqi Zhou, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, Phitchaya Mangpo Phothilimthana, Shen Wang, Anna Goldie, Azalia Mirhoseini, James Laudon
2021 arXiv   pre-print
over the prior state of the art with 15x faster convergence, on a device placement task evaluated in real systems.  ...  Existing learning based approaches in the literature are sample inefficient, tackle a single optimization problem, and do not generalize to unseen graphs making them infeasible to be deployed in practice  ...  Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.  ... 
arXiv:2010.12438v2 fatcat:ju26bxgmajbgfa4wtwvgrc6k2a

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems [article]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp (+24 others)
2016 arXiv   pre-print
A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed  ...  The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research  ...  Without a doubt, the usability and functionality of TensorFlow has been greatly expanded by listening to their feedback. Many  ... 
arXiv:1603.04467v2 fatcat:v7vqnzquxrffrojafup72yggni

Machine Learning for Systems

Heiner Litz, Milad Hashemi
2020 IEEE Micro  
In "A Single-Shot Generalized Device Placement for Large Dataflow Graphs," a new Guest Editors' Introduction Guest Editors' Introduction As machine learning models and systems improve, there is a growing  ...  From a learning perspective, systems is a challenging application, where input features are often large and sparse, action spaces are gigantic, and generalization is a key attribute.  ... 
doi:10.1109/mm.2020.3016551 fatcat:7lbbknjtmjamha2ek5fnc2tbem

A Transferable Approach for Partitioning Machine Learning Models on Multi-Chip-Modules [article]

Xinfeng Xie, Prakash Prabhu, Ulysse Beaugnon, Phitchaya Mangpo Phothilimthana, Sudip Roy, Azalia Mirhoseini, Eugene Brevdo, James Laudon, Yanqi Zhou
2021 arXiv   pre-print
The architectural choices we make for the policy network allow us to generalize across different ML graphs.  ...  One such problem is the multi-chip partitioning problem where compilers determine the optimal partitioning and placement of operations in tensor computation graphs on chiplets in MCMs.  ...  ., 2021) is the first single-shot method that generates placement decisions for an entire graph and generalizes to unseen data.  ... 
arXiv:2112.04041v1 fatcat:2m64g7rdabevdpoc4xi6io6fa4

AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning

Hao Zhang, Yuan Li, Zhijie Deng, Xiaodan Liang, Lawrence Carin, Eric P. Xing
2020 Neural Information Processing Systems  
In this paper, we develop a model-and resource-dependent representation for synchronization, which unifies multiple synchronization aspects ranging from architecture, message partitioning, placement scheme  ...  Existing synchronization systems often only consider a single or a few synchronization aspects, and the burden of deciding the right synchronization strategy is then placed on the ML practitioners, who  ...  scenarios where single-shot model training happens often.  ... 
dblp:conf/nips/0025LDLCX20 fatcat:arubsyv6wvgd7cs75n3ywmd73e

A Model and Survey of Distributed Data-Intensive Systems [article]

Alessandro Margara, Gianpaolo Cugola, Nicolò Felicioni, Stefano Cilloni
2022 arXiv   pre-print
Data is a precious resource in today's society, and is generated at an unprecedented and constantly growing pace.  ...  The need to store, analyze, and make data promptly available to a multitude of users introduces formidable challenges in modern software platforms.  ...  Pregel [64] is a programming and execution model for computations on large-scale graph data structures.  ... 
arXiv:2203.10836v1 fatcat:xbg34nuzhndwjpv2asbbhtgvlu

Automatic Mediation of Privacy-Sensitive Resource Access in Smartphone Applications

Benjamin Livshits, Jaeyeon Jung
2013 USENIX Security Symposium  
Our approach scales well: once an app is represented in the form of a graph, the remaining static analysis takes under a second on average.  ...  We design and implement a graphtheoretic algorithm to place mediation prompts that protect every resource access, while avoiding repetitive prompting and prompting in background tasks or third-party libraries  ...  This approach is linear in the size of the graph, and is generally quite fast, even when there are hundreds of nodes reachable from a resource access.  ... 
dblp:conf/uss/LivshitsJ13 fatcat:gi3gpzhhtjdknercymncv3cnsq

Prioritizing Attention in Analytic Monitoring

Peter Bailis, Edward Gan, Kexin Rong, Sahaana Suri
2017 Conference on Innovative Data Systems Research  
As a result, users need analytics engines that can assist in prioritizing attention in this fast data that is too large for manual inspection.  ...  for fast data.  ...  Acknowledgments We thank the many members of the Stanford InfoLab, our collaborators at MIT and Waterloo, and the early adopters of the Mac-roBase prototype for providing feedback on and inspiration for  ... 
dblp:conf/cidr/BailisGRS17 fatcat:i4lixrxqybcmrag4lt7dsbfhqy

Designing applications for heterogeneous many-core architectures with the FlexTiles Platform

Benedikt Janssen, Fynn Schwiegelshohn, Martijn Koedam, Francois Duhem, Leonard Masing, Stephan Werner, Christophe Huriaux, Antoine Courtay, Emilie Wheatley, Kees Goossens, Fabrice Lemonnier, Philippe Millet (+3 others)
2015 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)  
Therefore, the concept contains a dedicated automated tool-flow for creating both the hardware and the software, a simulation platform that can execute the same binaries as the FPGA prototype and a virtualization  ...  layer to manage the final heterogeneous many-core architecture for run-time adaptability.  ...  A good example of this is that often with porting large applications a big 'one-shot' effort has to be done to shrink it down, only allowing verification afterwards.  ... 
doi:10.1109/samos.2015.7363683 dblp:conf/samos/JanssenSKDM0HCW15 fatcat:cfbyosvh2rbcppp57x6t63baqy

FASTER: Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration

D. Pnevmatikatos, T. Becker, A. Brokalakis, K. Bruneel, G. Gaydadjiev, W. Luk, K. Papadimitriou, I. Papaefstathiou, O. Pell, C. Pilato, M. Robart, M.D. Santambrogio (+3 others)
2012 2012 15th Euromicro Conference on Digital System Design  
with general-purpose processors and acceleration modules implemented in the latest reconfigurable technology.  ...  This is a clear advantage over the more straightforward software component adaptivity. However, designing a changing hardware system is both challenging and time consuming.  ...  Each dataflow engine utilizes a large Xilinx Virtex-6 FPGA attached to up to 48GB of DDR3 DRAM. The FPGAs are connected to the CPUs via PCI Express.  ... 
doi:10.1109/dsd.2012.59 dblp:conf/dsd/PnevmatikatosBBBGLPPPPRSSST12 fatcat:etaunro6tfcw3oi6ml3ifidhf4

Park: An Open Platform for Learning-Augmented Computer Systems

Hongzi Mao, Parimarjan Negi, Akshay Narayan, Hanrui Wang, Jiacheng Yang, Haonan Wang, Ryan Marcus, Ravichandra Addanki, Mehrdad Khani Shirkoohi, Songtao He, Vikram Nathan, Frank Cangialosi (+5 others)
2019 Neural Information Processing Systems  
We present Park, a platform for researchers to experiment with Reinforcement Learning (RL) for computer systems.  ...  Using RL for improving the performance of systems has a lot of potential, but is also in many ways very different from, for example, using RL for games.  ...  We thank the anonymous NeurIPS reviewers for their constructive feedback.  ... 
dblp:conf/nips/MaoNN0YWMASHNCV19 fatcat:ns664s6rjnhe7ioiz56z57vkai

Memory-Optimized Software Synthesis from Dataflow Program Graphs with Large Size Data Samples

Hyunok Oh, Soonhoi Ha
2003 EURASIP Journal on Advances in Signal Processing  
The proposed algorithm reduces 67% memory for a JPEG encoder, 40% for an H.263 encoder compared with unshared versions, and 22% compared with the previous sharing algorithm for the H.263 encoder.  ...  This paper addresses the problem of minimizing the buffer memory requirement for such applications in embedded software synthesis from graphical dataflow programs based on the synchronous dataflow (SDF  ...  Memory-Optimized Software Synthesis from Dataflow Program Graphs with Large Size Data Samples 517 Memory-Optimized Software Synthesis from Dataflow Program Graphs with Large Size Data Samples 523 1  ... 
doi:10.1155/s1110865703212130 fatcat:lsl4wejmhvatzoocizsahj3ubm

Automap: Towards Ergonomic Automated Parallelism for ML Models [article]

Michael Schaarschmidt and Dominik Grewe and Dimitrios Vytiniotis and Adam Paszke and Georg Stefan Schmid and Tamara Norman and James Molloy and Jonathan Godwin and Norman Alexander Rink and Vinod Nair and Dan Belov
2021 arXiv   pre-print
The rapid rise in demand for training large neural network architectures has brought into focus the need for partitioning strategies, for example by using data, model, or pipeline parallelism.  ...  Through a combination of inductive tactics and search in a platform-independent partitioning IR, automap can recover expert partitioning strategies such as Megatron sharding for transformer layers.  ...  GSPMD: general and scalable parallelization for ML computation graphs. CoRR, abs/2105.04663, 2021.  ... 
arXiv:2112.02958v1 fatcat:tlda37oxgjeezggohojvh4sdni

An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators

Seyed Morteza Nabavinejad, Mohammad Baharloo, Kun-Chih Chen, Maurizio Palesi, Tim Kogel, Masoumeh Ebrahimi
2020 IEEE Journal on Emerging and Selected Topics in Circuits and Systems  
Currently, a large body of research aims to find an efficient on-chip interconnection to achieve low-power and high-bandwidth DNN computing.  ...  As a result, efficient interconnection and data movement mechanisms for future on-chip artificial intelligence (AI) accelerators are worthy of study.  ...  However, the bandwidth deficiency between PEs and memory in a single GPU or the inter-GPU communication in multi-GPU configurations (which is widely used for training large and complex NNs) has always  ... 
doi:10.1109/jetcas.2020.3022920 fatcat:idqitgwnrnegbd4dhrly3xsxbi

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han
2022 ACM Transactions on Design Automation of Electronic Systems  
To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization.  ...  Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video, and natural language processing by exploiting their spatial sparsity and temporal  ...  [140] combine the FPGA device placement with neural architecture design by conducting device placement optimization and NAS iteratively: the best pipelined FPGA configuration is identified for the proposed  ... 
doi:10.1145/3486618 fatcat:h6xwv2slo5eklift2fl24usine
« Previous Showing results 1 — 15 out of 130 results