368 Hits in 1.6 sec

ProTuner: Tuning Programs with Monte Carlo Tree Search [article]

Ameer Haj-Ali, Hasan Genc, Qijing Huang, William Moses, John Wawrzynek, Krste Asanović, Ion Stoica
2020 arXiv   pre-print
We explore applying the Monte Carlo Tree Search (MCTS) algorithm in a notoriously difficult task: tuning programs for high-performance deep learning and image processing. We build our framework on top of Halide and show that MCTS can outperform the state-of-the-art beam-search algorithm. Unlike beam search, which is guided by greedy intermediate performance comparisons between partial and less meaningful schedules, MCTS compares complete schedules and looks ahead before making any intermediate
more » ... cheduling decision. We further explore modifications to the standard MCTS algorithm as well as combining real execution time measurements with the cost model. Our results show that MCTS can outperform beam search on a suite of 16 real benchmarks.
arXiv:2005.13685v1 fatcat:3yrh5sxgbrfojgcjsjfqcylnvi

AutoCkt: Deep Reinforcement Learning of Analog Circuit Designs [article]

Keertana Settaluri, Ameer Haj-Ali, Qijing Huang, Kourosh Hakhamaneshi, Borivoje Nikolic
2020 arXiv   pre-print
Domain specialization under energy constraints in deeply-scaled CMOS has been driving the need for agile development of Systems on a Chip (SoCs). While digital subsystems have design flows that are conducive to rapid iterations from specification to layout, analog and mixed-signal modules face the challenge of a long human-in-the-middle iteration loop that requires expert intuition to verify that post-layout circuit parameters meet the original design specification. Existing automated solutions
more » ... that optimize circuit parameters for a given target design specification have limitations of being schematic-only, inaccurate, sample-inefficient or not generalizable. This work presents AutoCkt, a machine learning optimization framework trained using deep reinforcement learning that not only finds post-layout circuit parameters for a given target specification, but also gains knowledge about the entire design space through a sparse subsampling technique. Our results show that for multiple circuit topologies, AutoCkt is able to converge and meet all target specifications on at least 96.3% of tested design goals in schematic simulation, on average 40X faster than a traditional genetic algorithm. Using the Berkeley Analog Generator, AutoCkt is able to design 40 LVS passed operational amplifiers in 68 hours, 9.6X faster than the state-of-the-art when considering layout parasitics.
arXiv:2001.01808v2 fatcat:5gq6flvajnahphzguaxswsdynm

A View on Deep Reinforcement Learning in System Optimization [article]

Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Joseph Gonzalez, Krste Asanovic, Ion Stoica
2019 arXiv   pre-print
Correspondence to: Ameer Haj-Ali <>.  ... 
arXiv:1908.01275v3 fatcat:ih52psaazzcs3pulz4nnnjk2di

AutoPhase: Juggling HLS Phase Orderings in Random Forests with Deep Reinforcement Learning [article]

Qijing Huang, Ameer Haj-Ali, William Moses, John Xiang, Ion Stoica, Krste Asanovic, John Wawrzynek
2020 arXiv   pre-print
Correspondence to: Qijing Huang <>, Ameer Haj-Ali <>. 3 ucb-bar/autophase.  ...  NeuroVectorizer (Haj-Ali et al., 2020; 2019a) used deep RL for automatically tuning compiler pragmas such as vectorization and interleaving factors.  ... 
arXiv:2003.00671v2 fatcat:xemglojhkfhllo7oeo4aosqala

IMAGING-In-Memory AlGorithms for Image processiNG

Ameer Haj-Ali, Rotem Ben-Hur, Nimrod Wald, Ronny Ronen, Shahar Kvatinsky
2018 IEEE Transactions on Circuits and Systems Part 1: Regular Papers  
Data-intensive applications such as image processing suffer from massive data movement between memory and processing units. The severe limitations on system performance and energy efficiency imposed by this data movement are further exacerbated with any increase in the distance the data must travel. This data transfer and its associated obstacles could be eliminated by the use of emerging non-volatile resistive memory technologies (memristors) that make it possible to both store and process
more » ... within the same memory cells. In this paper, we propose four in-memory algorithms for efficient execution of fixed point multiplication using MAGIC gates. These algorithms achieve much better latency and throughput than a previous work and significantly reduce the area cost. They can thus be feasibly implemented inside the size-limited memory arrays. We use these fixed point multiplication algorithms to efficiently perform more complex in-memory operations such as image convolution and further show how to partition large images to multiple memory arrays so as to maximize the parallelism. All the proposed algorithms are evaluated and verified using a cycle-accurate and functional simulator. Our algorithms provide on average 200× better performance over state-of-the-art APIM, a processing inmemory architecture for data intensive applications. Index Terms-von Neumann bottleneck, memristors, MAGIC, algorithms, processing in memory.
doi:10.1109/tcsi.2018.2846699 fatcat:wt5t7eks45d7ngi7quqrxe55au

AutoPhase: Compiler Phase-Ordering for High Level Synthesis with Deep Reinforcement Learning [article]

Ameer Haj-Ali, Qijing Huang, William Moses, John Xiang, Ion Stoica, Krste Asanovic, John Wawrzynek
2019 arXiv   pre-print
The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Choosing a good order--often referred to as the phase-ordering problem--is an NP-hard problem. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. We implement a framework in the context of the
more » ... LLVM compiler to optimize the ordering for HLS programs and compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem. Overall, our framework runs one to two orders of magnitude faster than these algorithms, and achieves a 16% improvement in circuit performance over the -O3 compiler flag.
arXiv:1901.04615v2 fatcat:nga3sq2wqrconhmlm7ndfblsry

Supporting the Momentum Training Algorithm Using a Memristor-Based Synapse

Tzofnat Greenberg-Toledo, Roee Mazor, Ameer Haj-Ali, Shahar Kvatinsky
2019 IEEE Transactions on Circuits and Systems Part 1: Regular Papers  
Despite the increasing popularity of deep neural networks (DNNs), they cannot be trained efficiently on existing platforms, and efforts have thus been devoted to designing dedicated hardware for DNNs. In our recent work, we have provided direct support for the stochastic gradient descent (SGD) training algorithm by constructing the basic element of neural networks, the synapse, using emerging technologies, namely memristors. Due to the limited performance of SGD, optimization algorithms are
more » ... only employed in DNN training. Therefore, DNN accelerators that only support SGD might not meet DNN training requirements. In this paper, we present a memristorbased synapse that supports the commonly used momentum algorithm. Momentum significantly improves the convergence of SGD and facilitates the DNN training stage. We propose two design approaches to support momentum: 1) a hardware friendly modification of the momentum algorithm using memory external to the synapse structure, and 2) updating each synapse with a built-in memory. Our simulations show that the proposed DNN training solutions are as accurate as training on a GPU platform while speeding up the performance by 886× and decreasing energy consumption by 7×, on average.
doi:10.1109/tcsi.2018.2888538 fatcat:3fm3kro25jctfemo2o2stt2az4

NeuroVectorizer: End-to-End Vectorization with Deep Reinforcement Learning [article]

Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Sophia Shao, Krste Asanovic, Ion Stoica
2020 arXiv   pre-print
One of the key challenges arising when compilers vectorize loops for today's SIMD-compatible architectures is to decide if vectorization or interleaving is beneficial. Then, the compiler has to determine how many instructions to pack together and how many loop iterations to interleave. Compilers are designed today to use fixed-cost models that are based on heuristics to make vectorization decisions on loops. However, these models are unable to capture the data dependency, the computation graph,
more » ... or the organization of instructions. Alternatively, software engineers often hand-write the vectorization factors of every loop. This, however, places a huge burden on them, since it requires prior experience and significantly increases the development time. In this work, we explore a novel approach for handling loop vectorization and propose an end-to-end solution using deep reinforcement learning (RL). We conjecture that deep RL can capture different instructions, dependencies, and data structures to enable learning a sophisticated model that can better predict the actual performance cost and determine the optimal vectorization factors. We develop an end-to-end framework, from code to vectorization, that integrates deep RL in the LLVM compiler. Our proposed framework takes benchmark codes as input and extracts the loop codes. These loop codes are then fed to a loop embedding generator that learns an embedding for these loops. Finally, the learned embeddings are used as input to a Deep RL agent, which determines the vectorization factors for all the loops. We further extend our framework to support multiple supervised learning methods. We evaluate our approaches against the currently used LLVM vectorizer and loop polyhedral optimization techniques. Our experiments show 1.29X-4.73X performance speedup compared to baseline and only 3% worse than the brute-force search on a wide range of benchmarks.
arXiv:1909.13639v4 fatcat:pl3mcmsmxbaizgimjxekt5qstu

Ansor : Generating High-Performance Tensor Programs for Deep Learning [article]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, Ion Stoica
2020 arXiv   pre-print
High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of
more » ... ing high-performance programs due to restricted search space and ineffective exploration strategy. We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8×, 2.6×, and 1.7×, respectively.
arXiv:2006.06762v4 fatcat:as6rrj2bvjcwtmkjremrrfkqhq

SIMPLER MAGIC: Synthesis and Mapping of In-Memory Logic Executed in a Single Row to Improve Throughput

Rotem Ben-Hur, Ronny Ronen, Ameer Haj-Ali, Debjyoti Bhattacharjee, Adi Eliahu, Natan Peled, Shahar Kvatinsky
2019 IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  
In-memory processing can dramatically improve the latency and energy consumption of computing systems by minimizing the data transfer between the memory and the processor. Efficient execution of processing operations within the memory is therefore, a highly motivated objective in modern computer architecture. This article presents a novel automatic framework for efficient implementation of arbitrary combinational logic functions within a memristive memory. Using tools from logic design, graph
more » ... eory and compiler register allocation technology, we developed synthesis and in-memory mapping of logic execution in a single row (SIMPLER), a tool that optimizes the execution of in-memory logic operations in terms of throughput and area. Given a logical function, SIMPLER automatically generates a sequence of atomic memristor-aided logic (MAGIC) NOR operations and efficiently locates them within a single sizelimited memory row, reusing cells to save area when needed. This approach fully exploits the parallelism offered by the MAGIC NOR gates. It allows multiple instances of the logic function to be performed concurrently, each compressed into a single row of the memory. This virtue makes SIMPLER an attractive candidate for designing in-memory single instruction, multiple data (SIMD) operations. Compared to the previous work (that optimizes latency rather than throughput for a single function), SIMPLER achieves an average throughput improvement of 435×. When the previous tools are parallelized similarly to SIMPLER, SIMPLER achieves higher throughput of at least 5×, with 23× improvement in area and 20× improvement in area efficiency. These improvements more than fully compensate for the increase (up to 17% on average) in latency. Index Terms-Logic design, logic synthesis, memristor-aided logic (MAGIC), memristive systems, memristor, memristive memory-processing unit (mMPU), throughput, von Neumann architecture.
doi:10.1109/tcad.2019.2931188 fatcat:djvgvsavqzeflhujqthrqfu62u

Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration [article]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert Ou, Colin Schmidt (+7 others)
2021 arXiv   pre-print
DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments. This makes it difficult to appreciate the impact of System-on-Chip (SoC) resource contention, OS overheads, and programming-stack inefficiencies on overall performance/energy-efficiency. To address this challenge, we present Gemmini, an open-source*, full-stack DNN accelerator generator. Gemmini generates a wide design-space of efficient ASIC
more » ... erators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level effects. Gemmini-generated accelerators have also been fabricated, delivering up to three orders-of-magnitude speedups over high-performance CPUs on various DNN benchmarks. *
arXiv:1911.09925v3 fatcat:yftbmax3c5dqtfvovhyz57oihy


2011 Camden Fifth Series  
Ali Waris Ameer Ali Esq.  ...  by Sir Torick Ameer Ali. 3.  ... 
doi:10.1017/s096011631000028x fatcat:wt2aih5gnvfirby2qlf7jr2z74

Page 441 of The Lancet Vol. 191, Issue 4853 [page]

1916 The Lancet  
Ali, Syed, Mw A.  ...  —Ali bin Musa, or Ali 1II., son of the pre- ceding, died near to Meshed A.pD. 818 (A. H. 203); buried at Meshed. Ninth Imam.  ... 


S. M. Zwemer
1912 The Muslim world  
Ameer Ali has written perhaps the most clever, though unhistorical apology possible, for the life of the Prophet.  ...  Koelle, Mohammed and Mohammedan ism Crit ica IZy Considered (Rivington, London, 1888) :* Ameer Ali, The flpirit of Islum ; or, The Life and Teachings of Alohanimed (8. K.  ... 
doi:10.1111/j.1478-1913.1912.tb00114.x fatcat:4mvencf7wrfnjcyhbtvjufxv5q


2011 Camden Fifth Series  
Mohemadi Aqil Jung, and Al Haj Ali Raza. 60 BL, IOR, L/P&J/12/468, fos 373-374, G.H.  ...  Forward, 'Syed Ameer Ali: a bridge-builder?', Islam and Christian-Muslim Relations, 6, no. 1 (1995), pp. 50-51.  ... 
doi:10.1017/s0960116310000278 fatcat:j6kelrkj35bfxjcozk6fhf6pym
« Previous Showing results 1 — 15 out of 368 results