14 Hits in 6.8 sec

Ansor : Generating High-Performance Tensor Programs for Deep Learning [article]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, Ion Stoica
2020 arXiv   pre-print
We present Ansor, a tensor program generation framework for deep learning applications.  ...  High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks.  ...  Acknowledgement We would like to thank Weizhao Xian, Tianqi Chen, Frank Luan, anonymous reviewers, and our shepherd, Derek Murray, for their insightful feedback.  ... 
arXiv:2006.06762v4 fatcat:as6rrj2bvjcwtmkjremrrfkqhq

Reusing Auto-Schedules for Efficient DNN Compilation [article]

Perry Gibson, José Cano
2022 arXiv   pre-print
Auto-scheduling is a process where a search algorithm automatically explores candidate schedules (program transformations) for a given tensor program on a given hardware platform to improve its performance  ...  However this can be a very time consuming process, depending on the complexity of the tensor program, and capacity of the target device, with often many thousands of program variants being explored.  ...  TVM builds on the ideas of Halide to bring a usable schedule compiler for deep learning.  ... 
arXiv:2201.05587v1 fatcat:vo66466dkrhwrkmanqyu2r3kbi

Tuna: A Static Analysis Approach to Optimizing Deep Neural Networks [article]

Yao Wang, Xingyu Zhou, Yanming Wang, Rui Li, Yong Wu, Vin Sharma
2021 arXiv   pre-print
We use static analysis of the relative performance of tensor operations to optimize the deep learning program.  ...  The optimization of tensor operations such as convolutions and matrix multiplications is the key to improving the performance of deep neural networks.  ...  Current machine learning compilers use two common ways to generate high performance deep learning code for multiple target hardware.  ... 
arXiv:2104.14641v3 fatcat:yea2apeyjbf65i22omcioejoky

LoopStack: a Lightweight Tensor Algebra Compiler Stack [article]

Bram Wasti, José Pablo Cambronero, Benoit Steiner, Hugh Leather, Aleksandar Zlateski
2022 arXiv   pre-print
We present LoopStack, a domain specific compiler stack for tensor operations, composed of a frontend, LoopTool, and an efficient optimizing code generator, LoopNest.  ...  exceeds the performance of in state-of-the-art machine learning frameworks in both cases.  ...  Halide [42] and TVM [7] are compilers for general computational pipelines (with operators commonly used in image and tensor processing) and deep learning, respectively.  ... 
arXiv:2205.00618v1 fatcat:tk7gbuxl6feczkx36khx5s3bmu

MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks [article]

Jaehun Ryu, Hyojin Sung
2021 arXiv   pre-print
Deep learning compiler frameworks are gaining ground as a more portable back-end for deep learning applications on increasingly diverse hardware.  ...  This paper proposes MetaTune, a meta-learning based cost model that more quickly and accurately predicts the performance of optimized codes with pre-trained model parameters.  ...  Auto-scheduling for tensor programs FlexTensor (Zheng et al., 2020b) aims to generate dynamic schedules for tensor operations and perform automatic optimization with them on heterogeneous systems.  ... 
arXiv:2102.04199v2 fatcat:5jxb6kizvfbbpaeynljfgnfakm

Learning from distinctive candidates to optimize reduced-precision convolution program on tensor cores [article]

Junkyeong Choi, Hyucksung Kwon, Woongkyu Lee, Jungwook Choi, Jieun Lim
2022 arXiv   pre-print
However, it is challenging to achieve optimal performance since the best scheduling of MMA instructions varies for different convolution sizes.  ...  Convolution is one of the fundamental operations of deep neural networks with demanding matrix computation.  ...  There also exists an approach for automatically learning a performance model for Tensor Processing Unit (TPU) (Kaufman et al., 2021) .  ... 
arXiv:2202.06819v2 fatcat:nax323vvvjhype3d5fjhhqfdqm

Machine Learning for CUDA+MPI Design Rules [article]

Carl Pearson, Aurya Javeed, Karen Devine
2022 arXiv   pre-print
In our approach, a directed acyclic graph of CUDA and MPI operations defines the design space for the program.  ...  A sequence-to-vector transformation defines features for each explored implementation, and each implementation is assigned a class label according to its relative performance.  ...  A neural network is used to estimate the performance of the proposed networks during MCTS. Zheng et al. [7] propose Ansor for automatically generating tensor programs.  ... 
arXiv:2203.02530v2 fatcat:anvgdyvdljhiplzsyclo2htv2m

Joint Program and Layout Transformations to enable Convolutional Operators on Specialized Hardware based on Constraint Programming [article]

Dennis Rieber, Axel Acosta, Holger Fröning
2021 arXiv   pre-print
The success of Deep Artificial Neural Networks (DNNs) in many domains created a rich body of research concerned with hardware accelerators for compute-intensive DNN operators.  ...  Further, we show that dynamically determining the data layout based on intrinsic and workload is beneficial for hardware utilization and performance.  ...  Several tools in the Deep Learning community leverage this representation to generate efficient DNNs kernels.  ... 
arXiv:2104.04731v4 fatcat:g4uzwaivtjhajjwaimdct7y5wu

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads [article]

Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin
2021 arXiv   pre-print
We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models  ...  For this problem, current just-in-time (JIT) kernel fusion and code generation techniques have limitations, such as rough fusion plan exploration strategies and limited code generation ability.  ...  We show that memory intensive ops are vital to end-to-end performance of various deep learning models.  ... 
arXiv:2009.10924v2 fatcat:6lrkhmesljgfrlysahmnp64nhe

Joint Program and Layout Transformations to Enable Convolutional Operators on Specialized Hardware Based on Constraint Programming

Dennis Rieber, Axel Acosta, Holger Fröning
2022 ACM Transactions on Architecture and Code Optimization (TACO)  
The success of Deep Artificial Neural Networks (DNNs) in many domains created a rich body of research concerned with hardware accelerators for compute-intensive DNN operators.  ...  Further, we show that dynamically determining the data layout based on intrinsic and workload is beneficial for hardware utilization and performance.  ...  is designed for Deep Learning (DL) applications and programmed with TVM.  ... 
doi:10.1145/3487922 fatcat:gnuvco7rffcdzcuirotonjnssi

DynaComm: Accelerating Distributed CNN Training between Edges and Clouds through Dynamic Communication Scheduling [article]

Shangming Cai, Dongsheng Wang, Haixia Wang, Yongqiang Lyu, Guangquan Xu, Xi Zheng, Athanasios V. Vasilakos
2021 arXiv   pre-print
To reduce uploading bandwidth and address privacy concerns, deep learning at the network edge has been an emerging topic.  ...  Typically, edge devices collaboratively train a shared model using real-time generated data through the Parameter Server framework.  ...  of deep learning on edge devices.  ... 
arXiv:2101.07968v1 fatcat:x4i5orbq5bdhlasfhaycsawlay

IOS: Inter-Operator Scheduler for CNN Acceleration [article]

Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, Song Han
2021 arXiv   pre-print
To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization.  ...  However, a single operator can no longer fully utilize the available parallelism given the rapid advances in high-performance hardware, resulting in a large gap between the peak performance and the real  ...  Frameworks such as TVM and Ansor (Zheng et al., 2020) search the tensor program schedule for each kernel.  ... 
arXiv:2011.01302v2 fatcat:cisjar7tjrcszgcjer2rxdjhse

D1.1 - State of the Art Analysis

Danilo Ardagna
2021 Zenodo  
Then, the deliverable provides a background on AI applications design, also considering some advanced design trends (e.g., Network Architecture Search, Federated Learning, Deep Neural Networks partitioning  ...  In the last part of the deliverable, we report an overview of the performance modelling solutions, security, and privacy problems for AI applications in edge environments.  ...  General Purpose Deep Learning Frameworks Keras Keras 63 is a high-level API specifically developed to enable fast experimentation.  ... 
doi:10.5281/zenodo.6372377 fatcat:f6ldfuwivbcltew4smiiwphfty

Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization [article]

Zhihe Zhao, Xian Shuai, Yang Bai, Neiwen Ling, Nan Guan, Zhenyu Yan, Guoliang Xing
To generate tensor programs efficiently, a key component of DNN compilers is the cost model that can predict the performance of each configuration on specific devices.  ...  Achieving efficient execution of machine learning models has attracted significant attention recently.  ...  This dataset includes randomly generated tensor programs for widely deep learning models. Step 2.  ... 
doi:10.48550/arxiv.2201.05752 fatcat:kmgz6dfumrervdgk3dwchf6ogm