Filters








1,771 Hits in 5.3 sec

Fine-grained partitioning for aggressive data skipping

Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin
2014 Proceedings of the 2014 ACM SIGMOD international conference on Management of data - SIGMOD '14  
In this paper, we propose a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively.  ...  By maintaining some metadata for each block of tuples, a query may skip a data block if the metadata indicates that the block does not contain relevant data.  ...  We thank Ion Stoica for providing the Conviva dataset. We also thank Sameer Agarwal, Ameet Talwalkar, Di Wang, Jiannan Wang and the reviewers for their insightful feedback.  ... 
doi:10.1145/2588555.2610515 dblp:conf/sigmod/SunFKX14 fatcat:xcszhfpzebctxbqtztr5xjzhki

A partitioning framework for aggressive data skipping

Liwen Sun, Sanjay Krishnan, Reynold S. Xin, Michael J. Franklin
2014 Proceedings of the VLDB Endowment  
We propose to demonstrate a fine-grained partitioning framework that reorganizes the data tuples into small blocks at data loading time.  ...  The goal is to enable queries to maximally skip scanning data blocks.  ...  The fine-grained tuple-level partitioning decision output by WARP o↵ers greater flexibility and better chances for data skipping.  ... 
doi:10.14778/2733004.2733044 fatcat:ohnfkv6o6restdnixuznng6aiu

An asynchronous matrix-vector multiplier for discrete cosine transform

Kyeounsoo Kim, Peter A. Beerel, Youpyo Hong
2000 Proceedings of the 2000 international symposium on Low power electronics and design - ISLPED '00  
In particular, it skips multiplication by zero and dynamically activates/deactivates required bit-slices of fine-grain bit-partitioned adders using simplified, static-logic-based speculative completion  ...  The design achieves low power and high performance by taking advantage of the typically large fraction of zero and small-valued data in DCT and IDCT applications.  ...  The proposed architecture is partitioned into fine-grain bit-slices to better take advantage of the data statistics than previously developed two-way partitioning [7] .  ... 
doi:10.1145/344166.344621 dblp:conf/islped/KimBH00 fatcat:2z4p3gzexbhf3jza5oxfud45uq

An asynchronous matrix-vector multiplier for discrete cosine transform

Kyeounsoo Kim, P.A. Beerel, Youpyo Hong
2000 ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514)  
In particular, it skips multiplication by zero and dynamically activates/deactivates required bit-slices of fine-grain bit-partitioned adders using simplified, static-logic-based speculative completion  ...  The design achieves low power and high performance by taking advantage of the typically large fraction of zero and small-valued data in DCT and IDCT applications.  ...  The proposed architecture is partitioned into fine-grain bit-slices to better take advantage of the data statistics than previously developed two-way partitioning [7] .  ... 
doi:10.1109/lpe.2000.155295 fatcat:mv3m4nxe55gytfzel7ejlpgzga

Row-based configuration mechanism for a 2-D processing element array in coarse-grained reconfigurable architecture

LeiBo Liu, YanSheng Wang, ShouYi Yin, Min Zhu, Xing Wang, ShaoJun Wei
2014 Science China Information Sciences  
Row-based configuration mechanism for a 2-D processing element array in coarse-grained reconfigurable architecture.  ...  Furthermore, the proposed RCM offers much more efficient storage for the contexts.  ...  Configuration granularity As discussed above, besides the coarse-grained configuration and the fine-grained configuration, aggressive trade-offs have been made in exploiting the configuration granularity  ... 
doi:10.1007/s11432-013-4973-8 fatcat:2f4ajx2u7bdjjbs6nrzw25tqde

Adaptive parallelism for web search

Myeongjae Jeon, Yuxiong He, Sameh Elnikety, Alan L. Cox, Scott Rixner
2013 Proceedings of the 8th ACM European Conference on Computer Systems - EuroSys '13  
The experimental results show that the mean and 95thpercentile response times for queries are reduced by more than 50% under light or moderate load.  ...  Since each server may be processing multiple queries concurrently, we also present a adaptive resource management algorithm that chooses the degree of parallelism at run-time for each query, taking into  ...  from Microsoft Research for the insightful discussions and feedback.  ... 
doi:10.1145/2465351.2465367 dblp:conf/eurosys/JeonHECR13 fatcat:26g6fk3vi5crfa7bosgpuncivi

A GPGPU compiler for memory optimization and parallelism management

Yi Yang, Ping Xiang, Jingfei Kong, Huiyang Zhou
2010 SIGPLAN notices  
insertion for partition-camping elimination.  ...  Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or addressoffset  ...  Acknowledgements We thank the anonymous reviewers and Professor Vivek Sarkar for their valuable comments to improve our paper. This work is supported by an NSF CAREER award CCF-0968667.  ... 
doi:10.1145/1809028.1806606 fatcat:olbf6a5zuvcwnnkyqoiw5lyrse

A GPGPU compiler for memory optimization and parallelism management

Yi Yang, Ping Xiang, Jingfei Kong, Huiyang Zhou
2010 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation - PLDI '10  
insertion for partition-camping elimination.  ...  Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or addressoffset  ...  Acknowledgements We thank the anonymous reviewers and Professor Vivek Sarkar for their valuable comments to improve our paper. This work is supported by an NSF CAREER award CCF-0968667.  ... 
doi:10.1145/1806596.1806606 dblp:conf/pldi/YangXKZ10 fatcat:d36xrccpdra6de3s2cvganylf4

Using Aggressor Thread Information to Improve Shared Cache Management for CMPs

Wanli Liu, D. Yeung
2009 2009 18th International Conference on Parallel Architectures and Compilation Techniques  
To make AGGRESSOR-VT feasible for real systems, we develop a sampling algorithm that "learns" the identity of aggressor threads via runtime performance feedback.  ...  Techniques like cache partitioning can address this problem by performing explicit allocation to prevent aggressor threads from taking over the cache.  ...  ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their helpful comments, and Aamer Jaleel, Bruce Jacob, Meng-Ju Wu, and Rajeev Barua for insightful discussion.  ... 
doi:10.1109/pact.2009.13 dblp:conf/IEEEpact/LiuY09 fatcat:ke4rbiln4zafvm52zvquvctv4q

FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator [article]

Geng Yuan, Payman Behnam, Zhengang Li, Ali Shafiee, Sheng Lin, Xiaolong Ma, Hang Liu, Xuehai Qian, Mahdi Nazm Bojnordi, Yanzhi Wang, Caiwen Ding
2021 arXiv   pre-print
To achieve high accuracy, we propose to use fine-grained sub-array columns, which provide a unique opportunity for input zero-skipping, significantly avoiding unnecessary computations.  ...  To better solve this problem, this paper proposes FORMS, a fine-grained ReRAM-based DNN accelerator with polarized weights.  ...  At the hardware level, we design a fine-grained DNN accelerator architecture leveraging fine-grained computations.  ... 
arXiv:2106.09144v1 fatcat:qsn6nmh6u5entbk5qjdm6fkcoe

SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators [article]

Javier Picorel, Seyed Alireza Sanaee Kohroudi, Zi Yan, Abhishek Bhattacharjee, Babak Falsafi, Djordje Jevdjic
2020 arXiv   pre-print
Performing the translation for memory accesses on the memory side allows SPARTA to overlap data fetch with translation, and avoids the replication of TLB entries for data shared among accelerators.  ...  To further improve the performance and efficiency of the memory-side translation, SPARTA logically partitions the memory space, delegating translation to small and efficient per-partition translation hardware  ...  SPARTA preserves the paged organization and fine-grained memory protection.  ... 
arXiv:2001.07045v1 fatcat:kra7tnl5nza7dga5qut2vddghy

MLP-Aware Instruction Queue Resizing: The Key to Power-Efficient Performance [chapter]

Pavlos Petoumenos, Georgia Psychou, Stefanos Kaxiras, Juan Manuel Cebrian Gonzalez, Juan Luis Aragon
2010 Lecture Notes in Computer Science  
We propose a novel mechanism that deals with this realization by collecting fine-grain information about the maximum IQ resizing that does not affect the MLP of the program.  ...  We compare our technique to a previously proposed non-MLP-aware management technique and our results show a significant increase in EDP savings for most benchmarks of the SPEC2000 suite. 1.  ...  We simulate 300M instructions after skipping 1B instructions for all benchmarks except for vpr, twolf, mcf where we skip 2B instructions and ammp where we skip 3B instructions. 6.2 Results Overview The  ... 
doi:10.1007/978-3-642-11950-7_11 fatcat:sovpsn4ezffd7nzkeaac4ekj5u

Data-centric execution of speculative parallel programs

Mark C. Jeffrey, Suvinay Subramanian, Maleen Abeydeera, Joel Emer, Daniel Sanchez
2016 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)  
Multicore systems must exploit locality to scale, scheduling tasks to minimize data movement.  ...  A hint is an abstract integer, given when a speculative task is created, that denotes the data that the task is likely to access.  ...  William Hasenplaugh and Chia-Hsin Chen graciously shared the serial code for the color [30] and nocsim benchmarks.  ... 
doi:10.1109/micro.2016.7783708 dblp:conf/micro/JeffreySAES16 fatcat:b6nbzdafhzcazp74ify77niwa4

Trading cache hit rate for memory performance

Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance  ...  First, it presents a compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code  ...  If the assignment fails due to the limited memory positions in a row, we simply skip this edge and proceed with the next one. FINE-GRAIN LAYOUT Definition 6.1. Partition.  ... 
doi:10.1145/2628071.2628082 dblp:conf/IEEEpact/DingKGJDY14 fatcat:a7j3vxei3bas3bbmasntq2bcua

Hardware support for protective and collaborative cache sharing

Raj Parihar, Jacob Brock, Chen Ding, Michael C. Huang
2016 SIGPLAN notices  
We show that rationing provides good resource protection and full cache utilization of the shared cache for a variety of co-runs.  ...  This paper explores cache management policies that allow conservative sharing to protect the cache occupancy for individual programs, yet enable full cache utilization whenever there is an opportunity  ...  Coarse-grained rationing can victimize a less aggressive application in the sets in which its data occupies all cache ways.  ... 
doi:10.1145/3241624.2926705 fatcat:6fp3bq6cc5bhjk36b76wsfsmsm
« Previous Showing results 1 — 15 out of 1,771 results