Filters








343 Hits in 6.2 sec

An experimental comparison of cache-oblivious and cache-conscious programs

Kamen Yotov, Tom Roeder, Keshav Pingali, John Gunnels, Fred Gustavson
2007 Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures - SPAA '07  
An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem?  ...  Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs.  ...  Acknowledgements: We would like to thank Matteo Frigo and Gianfranco Bilardi for useful discussions.  ... 
doi:10.1145/1248377.1248394 dblp:conf/spaa/YotovRPGG07 fatcat:jysuedsrvnhx3awp2w6vya3agq

Ideal and Predictable Hit Ratio for Matrix Transposition in Data Caches

Alba Pedro-Zapater, Clemente Rodríguez, Juan Segarra, Rubén Gran Gran Tejero, Víctor Viñals-Yúfera
2020 Mathematics  
We also analyze the energy consumption and execution time of matrix transposition on real hardware with pseudo-LRU (PLRU) caches.  ...  Our analytical hit/miss assessment enables the usage of a data cache for matrix transposition in real-time systems, since the number of misses in the worst case is bound.  ...  An Experimental Comparison of Cache-oblivious and Cache-conscious Programs Reference [10] compares experimentally cache-oblivious and cache-conscious (tiling transformation applied) programs for matrix  ... 
doi:10.3390/math8020184 fatcat:7ig76wmj65hvbnzcbyra33wpwy

Improving locality of nonserial polyadic dynamic programming

Guangming Tan, Ninghui Sun, Dongbo Bu
2006 Proceedings 20th IEEE International Parallel & Distributed Processing Symposium  
We exploit the property of the algorithm to develop a high performance implementation using the combination of cache-oblivious and cache-conscious strategy.  ...  The efficiency in our improved algorithm comes from two sources: reducing the number of cache misses and TLB misses.  ...  In this paper, we develop a high perfor-mance implementation of nonserial polyadic dynamic programming algorithm through the combination of cache conscious and oblivious approaches.  ... 
doi:10.1109/ipdps.2006.1639718 dblp:conf/ipps/TanSB06 fatcat:zpaudytz7jfdhjrbb2esdtmpsi

Automatically Tuned Dynamic Programming with an Algorithm-by-Blocks

Jiajia Li, Guangming Tan, Mingyu Chen
2010 2010 IEEE 16th International Conference on Parallel and Distributed Systems  
First, an algorithm-by-blocks for dynamic programming is designed to facilitate optimizing with well-known techniques including cache and register tiling.  ...  In this paper, we propose an Automatically Tuned Dynamic Programming (ATDP) to optimize performance of dynamic programming algorithm across various architectures.  ...  [20] found that cache-oblivious programs are defeated by cache conscious ones for dense linear algebra, even though the cache-oblivious programs are highly optimized.  ... 
doi:10.1109/icpads.2010.117 dblp:conf/icpads/LiTC10 fatcat:v3h7txafvfb57faqhnaxwv6jhm

Efficiency vs. portability in cluster-based network servers

Enrique V. Carrera, Ricardo Bianchini
2001 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming - PPoPP '01  
To fill this gap, in this paper we use modeling and experimentation to study this tradeoff in the context of an interesting class of content-based network servers, the localityconscious servers, under  ...  Efficiency and portability are conflicting objectives for clusterbased network servers that distribute the clients' requests across the cluster based on the actual content requested.  ...  Acknowledgements We would like to thank Liviu Iftode, Rich Martin, Thu Nguyen, and Tao Yang for discussions on the topic of this paper.  ... 
doi:10.1145/379539.379589 dblp:conf/ppopp/CarreraB01 fatcat:tckl2xig5bayrcl5zvif3bhzim

Improving main memory hash joins on Intel Xeon Phi processors

Saurabh Jha, Bingsheng He, Mian Lu, Xuntao Cheng, Huynh Phung Huynh
2015 Proceedings of the VLDB Endowment  
Second, hardware oblivious algorithms can outperform hardware conscious algorithms on a wide parameter window.  ...  For each camp, we study the impact of architectural features and software optimizations on Xeon Phi in comparison with results on multi-core CPUs.  ...  Acknowledgment We would like to thank the authors of [7] and [8] for providing the source code. This work is supported by a MoE AcRF Tier 2 grant (MOE2012-T2-2-067) in Singapore.  ... 
doi:10.14778/2735703.2735704 fatcat:ion4mquxq5difphvo3fe6pqfma

Cache oblivious algorithms for nonserial polyadic programming

Guangming Tan, Shengzhong Feng, Ninghui Sun
2007 Journal of Supercomputing  
Experimental results on several platforms show that the optimized algorithms improve the cache performance and achieves speedups of 2-10 times.  ...  Based on the ideal cache model of the cache oblivious algorithm, the approximate bound of cache misses is given by ( n 3 Z L √ Z ).  ...  Both our work and previous research [14, 19] show that the algorithmic transformation of dynamic programming is an efficient and important approach to optimize irregular programs.  ... 
doi:10.1007/s11227-007-0106-8 fatcat:3dvaov2xhzhcplqc6cfa5uhdvi

Efficient sorting using registers and caches

Rajiv Wickremesinghe, Lars Arge, Jeffrey S. Chase, Jeffrey Scott Vitter
2002 ACM Journal of Experimental Algorithmics  
Inadequate models lead to poor algorithmic choices and an incomplete understanding of algorithm behavior on real machines.  ...  Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching,  ...  Fig. 3 .Fig. 4 . 34 Comparison with cache-conscious sort programs of LaMarca and Ladner[1997]. Time per key (microseconds/key) vs. number of keys (×2 20 keys).  ... 
doi:10.1145/944618.944627 fatcat:yewe6tclkrhn7lbvt3ktip343q

An Experimental Study of Self-Optimizing Dense Linear Algebra Software

M. Kulkarni, K. Pingali
2008 Proceedings of the IEEE  
The cache-oblivious approach uses divide-and-conquer to perform approximate blocking; how well does approximate blocking perform compared to precise blocking?  ...  Each step of divide-and-conquer generates problems of smaller size.  ...  Acknowledgment The authors would like to acknowledge the contributions of a number of people who participated in the work discussed here and reported in earlier papers: M. Garzaran, J. Gunnels, F.  ... 
doi:10.1109/jproc.2008.917732 fatcat:4kyi7ju3vzcgxjgscujjzmoo6y

Redesigning the string hash table, burst trie, and BST to exploit cache

Nikolas Askitis, Justin Zobel
2010 ACM Journal of Experimental Algorithmics  
We then replace the chains of the hash table, burst trie, and BST using dynamic arrays, creating new cache-conscious array representations called the array hash, array burst trie, and array BST, respectively  ...  Our results show that, in an architecture with cache, our array data structures can yield startling improvements over their standard, compact, and clustered chained variants.  ...  ACKNOWLEDGMENTS We thank the anonymous reviewers of this article and software.  ... 
doi:10.1145/1671970.1921704 fatcat:eimxbg3zvjcpfefwa7jr7wx76i

Efficient Sorting Using Registers and Caches [chapter]

Lars Arge, Jeff Chase, Jeffrey S. Vitter, Rajiv Wickremesinghe
2001 Lecture Notes in Computer Science  
Inadequate models lead to poor algorithmic choices and an incomplete understanding of algorithm behavior on real machines.  ...  Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching,  ...  Fig. 3 .Fig. 4 . 34 Comparison with cache-conscious sort programs of LaMarca and Ladner[1997]. Time per key (microseconds/key) vs. number of keys (×2 20 keys).  ... 
doi:10.1007/3-540-44691-5_5 fatcat:ilvnnmf6xvbmrkpaxnlb7hnhoa

Lists Revisited: Cache Conscious STL Lists [chapter]

Leonor Frias, Jordi Petit, Salvador Roura
2006 Lecture Notes in Computer Science  
We present three cache conscious implementations of STL standard compliant lists.  ...  In this paper, we show the competitiveness of our implementations with an extensive experimental analysis. This shows, for instance, 5-10 times faster traversals and 3-5 times faster internal sort.  ...  We have implemented and experimentally evaluated three different variants of cache conscious lists supporting fully standard iterator functionality.  ... 
doi:10.1007/11764298_11 fatcat:766m6ty2ejgulgtmlt52327y3u

Lists revisited

Leonor Frias, Jordi Petit, Salvador Roura
2009 ACM Journal of Experimental Algorithmics  
We present three cache conscious implementations of STL standard compliant lists.  ...  In this paper, we show the competitiveness of our implementations with an extensive experimental analysis. This shows, for instance, 5-10 times faster traversals and 3-5 times faster internal sort.  ...  We have implemented and experimentally evaluated three different variants of cache conscious lists supporting fully standard iterator functionality.  ... 
doi:10.1145/1498698.1564505 fatcat:fa7cdtpqxfaarho47hat5regju

Towards pB+Trees in the Field: Implementation Choices and Performance

Árni Már Jónsson, Björn Þór Jónsson
2006 Evaluation of Data Management Systems  
This paper is part of this trend towards the deployment of cache-conscious structures "in the field".  ...  In particular, B + -trees have been shown to utilize cache memory poorly, triggering the development of many cache-conscious indices.  ...  parts of the program and where cache-misses occur.  ... 
dblp:conf/expdb/JonssonJ06 fatcat:fueegi56trhafo7ecnqh3q4du4

Dynamic Data Layouts for Cache-Conscious Implementation of a Class of Signal Transforms

N. Park, V.K. Prasanna
2004 IEEE Transactions on Signal Processing  
Experimental results show that our FFT and WHT achieve performance improvement of up to 3.52 times over other state-of-the-art FFT and WHT packages.  ...  In this paper, we develop a cache-conscious technique, called a dynamic data layout, to improve the performance of large signal transforms.  ...  Moura, and M. Veloso. They also thank G. Govindu, B. Hong, and B. Gundala for their editorial assistance.  ... 
doi:10.1109/tsp.2004.828946 fatcat:3kw3rbjumrg33gbllyisj6pdri
« Previous Showing results 1 — 15 out of 343 results