Filters








43,205 Hits in 8.9 sec

Minimizing the cost of iterative compilation with active learning

William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather
2017 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)  
Balaprakash, of Argonne National Laboratory, for his kind help in providing us the initial data for our research.  ...  Acknowledgments This work was partly supported by the UK Engineering and Physical Sciences Research Council (EPSRC) under grants EP/L000055/1 (ALEA), EP/M01567X/1 (SANDeRs), EP/M015823/1, and EP/M015793  ...  No existing work has used sequential analysis and active learning to reduce the overhead of iterative compilation.  ... 
doi:10.1109/cgo.2017.7863744 fatcat:uwbdczsdevgiplgbwmeaxpdifu

Optimizing DNN Compilation for Distributed Training With Joint OP and Tensor Fusion

Xiaodong Yi, Shiwei Zhang, Lansong Diao, Chuan Wu, Zhen Zheng, Shiqing Fan, Siyu Wang, Jun Yang, Wei Lin
2022 IEEE Transactions on Parallel and Distributed Systems  
This paper proposes DisCo, an automatic deep learning compilation module for data-parallel distributed training.  ...  A backtracking search algorithm is driven by the simulator, navigating efficiently in the large strategy space to identify good operator/tensor fusion strategies that minimize distributed training time  ...  The goal is to minimize per-iteration training time of the DNN model, i.e., end-to-end execution time of the HLO module in the distributed setting (including execution time of computation ops and AllReduce  ... 
doi:10.1109/tpds.2022.3201531 fatcat:ejikycvxnjfhdgcrr6fvekxnze

Practical Design Space Exploration [article]

Luigi Nardi and David Koeplinger and Kunle Olukotun
2019 arXiv   pre-print
We apply and evaluate the new methodology to the automatic static tuning of hardware accelerators within the recently introduced Spatial programming language, with minimization of design run-time and compute  ...  Our results show that HyperMapper 2.0 provides better Pareto fronts compared to state-of-the-art baselines, with better or competitive hypervolume indicator and with 8x improvement in sampling budget for  ...  This process is repeated over a number of iterations forming the active learning loop.  ... 
arXiv:1810.05236v3 fatcat:65la5kczxrbtvchokqp6nqivuq

Teaching-Learning based Task Scheduling Optimization in Cloud Computing Environments

2019 International journal of recent technology and engineering  
The proposed algorithm finds the best solution by minimizing the execution time and response time while maximizing the throughput of all resources to complete the assigned tasks.  ...  In this paper, a new and efficient evolutionary algorithm named teaching-learning based algorithm has been implemented first time to solve the task scheduling problem in cloud environments.  ...  Fig. 5 5 Fitness variation with TLBO iterations of Case-1 Fig. 6 6 Compilation of TLBO algorithm in MATLAB cloud environment of Case-2 Fig. 7 7 Fitness variation with TLBO iterations of Case-2 Table  ... 
doi:10.35940/ijrte.b2672.078219 fatcat:bo5f4h6usnbzjbdanhtun7pawi

Declarative recursive computation on an RDBMS

Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, Zekai J. Gao
2019 Proceedings of the VLDB Endowment  
A number of popular systems, most notably Google's TensorFlow, have been implemented from the ground up to support machine learning tasks.  ...  We consider how to make a very small set of changes to a modern relational database management system (RDBMS) to make it suitable for distributed learning computations.  ...  We unroll 60 iterations of the learning and compare the per-iteration running time using the full cutting algorithm along with the cost model of Section 6.3 with a monolithic execution of the entire, unrolled  ... 
doi:10.14778/3317315.3317323 fatcat:2r5it5kfbzfwfphrr7mev2gswm

Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation [article]

Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, Hadi Esmaeilzadeh
2020 arXiv   pre-print
Achieving faster execution with shorter compilation time can foster further diversity and innovation in neural networks.  ...  Experimentation with real hardware shows that Chameleon provides 4.45x speed up in optimization time over AutoTVM, while also improving inference time of the modern deep networks by 5.6%.  ...  Later, AutoTVM (Chen et al., 2018b) incorporates learning with boosted trees within the cost model for TVM to reduce the number of real hardware measurements.  ... 
arXiv:2001.08743v1 fatcat:we2zi5q3mvc4hezbcou2kvrleu

Efficient global register allocation [article]

Ian Rogers
2020 arXiv   pre-print
An advantageous property of the approach is an ability to make these trade-offs. A key result is the 'future-active' set can remove any liveness model for over 90% of instructions and 80% of methods.  ...  Registers are allocated and freed in the manner of linear scan, although other ordering heuristics could improve code quality or lower runtime cost.  ...  ACKNOWLEDGMENTS The author wishes to thank Google for their support. This work wouldn't have been possible without the encouragement, feedback, support and energy of Lirong  ... 
arXiv:2011.05608v1 fatcat:ksfjpztvnvdk3gsg4kfbjbwl5m

Accelerating Human-in-the-loop Machine Learning: Challenges and Opportunities [article]

Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, Aditya Parameswaran
2018 arXiv   pre-print
Development of machine learning (ML) workflows is a tedious process of iterative experimentation: developers repeatedly make changes to workflows until the desired accuracy is attained.  ...  We describe our vision for a "human-in-the-loop" ML system that accelerates this process: by intelligently tracking changes and intermediate results over time, such a system can enable rapid iteration,  ...  CONCLUSIONS We presented our vision for an efficient end-to-end ML system focused on supporting iterative, human-in-the-loop workflow development.  ... 
arXiv:1804.05892v1 fatcat:pu2ywdlvj5dddjtpnf7ncpqgey

Iterative compilation on mobile devices [article]

Paschalis Mpeis, Pavlos Petoumenos, Hugh Leather
2016 arXiv   pre-print
At idle periods, this minimal state is combined with different binaries of the application, each one build with different optimizations enabled.  ...  Replaying the targeted functions allows us to evaluate the effectiveness of each set of optimizations for the actual way the user interacts with the application.  ...  Finally, on the capture side, which is the only active component while the user interacts with the device, we use novel ideas to minimize the overhead both in terms of runtime and storage so that the user  ... 
arXiv:1511.02603v4 fatcat:b6xbzindujddti2amqbz5edsou

ProTuner: Tuning Programs with Monte Carlo Tree Search [article]

Ameer Haj-Ali, Hasan Genc, Qijing Huang, William Moses, John Wawrzynek, Krste Asanović, Ion Stoica
2020 arXiv   pre-print
We further explore modifications to the standard MCTS algorithm as well as combining real execution time measurements with the cost model.  ...  We build our framework on top of Halide and show that MCTS can outperform the state-of-the-art beam-search algorithm.  ...  The schedule with the best real execution time, rather than the one with the lowest cost, is used as the new root of the MCTSes for the next iteration.  ... 
arXiv:2005.13685v1 fatcat:3yrh5sxgbrfojgcjsjfqcylnvi

A semi-agnostic ansatz with variable structure for quantum machine learning [article]

M. Bilkis, M. Cerezo, Guillaume Verdon, Patrick J. Coles, Lukasz Cincio
2022 arXiv   pre-print
Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest.  ...  Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics  ...  Figure 13 shows VAns results for n = 10 qubit QFT compilation. Panel (a) depicts how the value of the cost function C(k, θ) is minimized over the iterations.  ... 
arXiv:2103.06712v2 fatcat:3tlpagnddfa3zbmx4xdvyleadu

Active Sampler: Light-weight Accelerator for Complex Data Analytics at Scale [article]

Jinyang Gao, H.V.Jagadish, Beng Chin Ooi
2015 arXiv   pre-print
We propose an Active Sampler algorithm, where training data with more "learning value" to the model are sampled more frequently.  ...  Most popular algorithms for model training are iterative. Due to the surging volumes of data, we can usually afford to process only a fraction of the training data in each iteration.  ...  This idea is very similar to active learning but with a major difference -the objective of active learning is to reduce the number of training samples, while the objective of active sampling is to reduce  ... 
arXiv:1512.03880v1 fatcat:ezvhnggpljd5bca6aob7umbfei

Universal compiling and (No-)Free-Lunch theorems for continuous variable quantum learning [article]

Tyler Volkoff and Zoë Holmes and Andrew Sornborger
2021 arXiv   pre-print
We use these results to motivate several, closely related, short depth CV algorithms for the quantum compilation of a target unitary U with a parameterized circuit V(θ).  ...  We analyse the trainability of our proposed cost functions and numerically demonstrate our algorithms for learning arbitrary Gaussian operations and Kerr non-linearities.  ...  ACKNOWLEDGMENTS The authors thank Kunal Sharma  ... 
arXiv:2105.01049v1 fatcat:cx32zucifzg45itiq24khmgt4i

PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [article]

Sanket Tavarageri, Alexander Heinecke, Sasikanth Avancha, Gagandeep Goyal, Ramakrishna Upadrasta, Bharat Kaul
2020 arXiv   pre-print
The two approaches have their drawbacks: even though a custom built library can deliver very good performance, the cost and time of development of the library can be high.  ...  compilation represented by the TensorFlow XLA compiler.  ...  The state-of-theart polyhedral compiler -Pluto [7] derives a schedule for the code that attempts to minimize data reuse distances.  ... 
arXiv:2002.02145v1 fatcat:rs77unnkyfdg3c7zwnpolml2yy

Compiler optimization on VLIW instruction scheduling for low power

Chingren Lee, Jenq Kuen Lee, Tingting Hwang, Shi-Chun Tsai
2003 ACM Transactions on Design Automation of Electronic Systems  
activities of the instruction bus as compared with conventional list scheduling for an extensive set of benchmarks.  ...  The additional reduction for transitional activities of the instruction bus from horizontal to vertical scheduling with window size four is around 4.57 to 10.42%, and the average is 7.66%.  ...  The algorithm in Figure 4 always gives the optimal solution to the problem of minimization of transition activity of the instruction bus when microinstructions are rescheduled with horizontal moves for  ... 
doi:10.1145/762488.762494 fatcat:rpg7af5firendhlm23a4zfsjoa
« Previous Showing results 1 — 15 out of 43,205 results