104 Hits in 9.0 sec

The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation

Rezaul Alam Chowdhury, Vijaya Ramachandran
2010 Theory of Computing Systems  
We consider triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm).  ...  Cache-oblivious I-GEP computes in-place and solves most of the known applications of GEP including Gaussian elimination and LU-decomposition without pivoting and Floyd-Warshall all-pairs shortest paths  ...  In this paper we introduce a cache-oblivious framework, which we call GEP or the Gaussian Elimination Paradigm.  ... 
doi:10.1007/s00224-010-9273-8 fatcat:emgjqjjvh5dg3gq6bcexmu4jkq


Cyril Gavoille, Boaz Patt-Shamir, Christian Scheideler
2010 Theory of Computing Systems  
The conference defines parallelism very broadly and therefore includes papers on various problems in wired and wireless networks as well as parallel and multicore systems, cache-oblivious algorithms, network  ...  games and concurrent and parallel programming approaches.  ...  Framework, Parallelization and Experimental Evaluation" presents parallel variants of the standard Gaussian elimination algorithm that achieve a good speed-up and match the sequential caching performance  ... 
doi:10.1007/s00224-010-9284-5 fatcat:twbodq7n7ngkvmxxppywh2pmny

Cache-Oblivious Dynamic Programming for Bioinformatics

Rezaul Alan Chowdhury, Hai-Son Le, Vijaya Ramachandran
2010 IEEE/ACM Transactions on Computational Biology & Bioinformatics  
For each of these problems we present cache-oblivious algorithms that match the best-known time complexity, match or improve the best-known space complexity, and improve significantly over the cache-efficiency  ...  We present efficient cache-oblivious algorithms for some well-studied string problems in bioinformatics including the longest common subsequence, global pairwise sequence alignment and 3-way sequence alignment  ...  ACKNOWLEDGMENT We thank Mike Brudno for the CFTR DNA sequences, Robin Gutell for the rRNA sequences, and David Zhao for the MED-Knudsen, MED-ukk.alloc and MED-ukk.checkp code.  ... 
doi:10.1109/tcbb.2008.94 pmid:20671320 fatcat:vn2vc4qlqnervbqj3veiyt36e4

PCOT: Cache Oblivious Tiling of Polyhedral Programs [article]

Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, Sanjay Rajopadhye
2018 arXiv   pre-print
This paper studies two variants of tiling: iteration space tiling (or loop blocking) and cache-oblivious methods that recursively split the iteration space with divide-and-conquer.  ...  The conclusion is that cache oblivious code is most useful when the aim is to have reduced off-chip memory accesses, e.g., lower energy, albeit certain situations that diminish its effectiveness exist.  ...  Common Subsequence [10, 13] , global pairwise sequence alignment problem in bioinformatics [12] , Gaussian Elimination Paradigm [14] , etc.  ... 
arXiv:1802.00166v1 fatcat:tsnkzovxmzdg5pbo3lotif3hoe

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, Katherine Yelick
2009 SIAM Review  
Our work targets cache reuse methodologies across single and multiple stencil sweeps, examining cache-aware algorithms as well as cache-oblivious techniques on the Intel Itanium2, AMD Opteron, and IBM  ...  We also show that a cache-aware implementation is significantly faster than a cache-oblivious approach, while the explicitly managed memory on Cell enables the highest overall efficiency: Cell attains  ...  The authors would like to thank Parry Husbands for his many contributions.  ... 
doi:10.1137/070693199 fatcat:3ov7comgd5c2ziqpqnbf6wmrg4

A high-level characterisation and generalisation of communication-avoiding programming techniques [article]

Tobias Weinzierl
2019 arXiv   pre-print
They have changed our notion of performance and, hence, of what a good code is: Good code has, first of all, to be able to exploit the unprecedented levels of parallelism.  ...  We characterise and classify the field of communication-avoiding algorithms.  ...  As a consequence, an algorithm becomes inherently cache-optimal (cache oblivious)-the probability that the head of a stack remains in a cache is always very high as long as the number of stacks remains  ... 
arXiv:1909.10853v2 fatcat:72wuro6bhjhmfnilmvulwx4eqm

D7.5: HPC Programming Techniques

Cevdet Aykanat, Antun Balaz, Iris Christadler, Ivan Girotto, Jose Gracia, Vladimir Slavnic, Andy Sunderland, Ata Türk
2012 Zenodo  
This task worked with users to implement new programming techniques, paradigms and algorithms for Tier-1 and Tier-0 systems, which have the potential to facilitate significant improvements in their applications  ...  the hybridization of important user codes to test the mixed OpenMP and MPI programming model.  ...  This program retains the cache oblivious property.  ... 
doi:10.5281/zenodo.6552939 fatcat:z2gdhmnojrh6bj2lordyl7lmnq


Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari, Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song
2013 Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security - CCS '13  
the data needed, and completing and reviewing the collection of information.  ...  Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining  ...  The L1 instruction cache is 2-way associative, the L1 data cache 4-way associative and the L2 cache 8-way associative We enforced σ 2 ≥ 1 since otherwise the optimization can result in Gaussians with  ... 
doi:10.1145/2508859.2516692 dblp:conf/ccs/MaasLSTSAKS13 fatcat:q2y4y6qas5fobe5nb47xla42be

Solving path problems on the GPU

Aydın Buluç, John R. Gilbert, Ceren Budak
2010 Parallel Computing  
The blocked recursive elimination strategy we use is applicable to a class of algorithms (such as all-pairs shortest-paths, transitive closure, and LU decomposition without pivoting) having similar data  ...  The impressive computational power and memory bandwidth of the GPU make it an attractive platform to run such computationally intensive algorithms.  ...  Acknowledgments We acknowledge the kind permission of Charles Leiserson and CilkArts to use an alpha release of the Cilk++ language.  ... 
doi:10.1016/j.parco.2009.12.002 fatcat:gpdffk6s4fa4tifrawtmq5x22a

Minimizing Communication in All-Pairs Shortest Paths

Edgar Solomonik, Aydin Buluc, James Demmel
2013 2013 IEEE 27th International Symposium on Parallel and Distributed Processing  
The 2.5D APSP algorithm, which is based on the divide-andconquer paradigm, satisfies both of these requirements: it can utilize any extra available memory to perform asymptotically less communication,  ...  Our implementation achieves impressive performance and scaling to 24,576 cores of a Cray XE6 supercomputer by utilizing well-tuned intra-node kernels within the distributed memory algorithm.  ...  The Gaussian elimination paradigm of Chowdhury and Ramachandran [13] provides a cache-oblivious framework for these problems, similar to Toledo's recursive blocked LU factorization [41] .  ... 
doi:10.1109/ipdps.2013.111 dblp:conf/ipps/SolomonikBD13 fatcat:ctp25zrbdfa2fjsefxpdjw2jz4

Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales

Ke Wang, Kan Qiao, Iman Sadooghi, Xiaobing Zhou, Tonglin Li, Michael Lang, Ioan Raicu
2015 Concurrency and Computation  
We implemented the technique in MATRIX, a distributed MTC task execution framework.  ...  In this work, we devise an analytical sub-optimal upper bound of the proposed technique; compare MATRIX with other scheduling systems; and explore the scalability of the technique at extreme scales.  ...  We implemented the technique in MATRIX [22, 64, 70] , a MTC task execution framework, and evaluated up to 200 cores.  ... 
doi:10.1002/cpe.3617 fatcat:ka2sslewpzbyjc4n3yuwq27gsi

Distributed clustering of ubiquitous data streams

Pedro Pereira Rodrigues, João Gama
2013 Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery  
, being subject to the same interactions required by previous static and centralized applications.  ...  Conflict of interest: The authors have declared no conflicts of interest for this article. controlled by both the human user and a common centralized control process.  ...  ACKNOWLEDGMENTS This work is financed by the ERDF-European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT-Fundação  ... 
doi:10.1002/widm.1109 fatcat:4ybb44a6szgkxkhtwni3z3apw4

Reconstructing Hardware Transactional Memory for Workload Optimized Systems [chapter]

Kunal Korgaonkar, Prabhat Jain, Deepak Tomar, Kashyap Garimella, Veezhinathan Kamakoti
2011 Lecture Notes in Computer Science  
The two-day technical program of APPT 2011 provided an excellent venue capturing the state of the art and practice in parallel architectures, parallel software and distributed and cloud computing.  ...  With the continuity of Moore's law in the multicore era and the emerging cloud computing, parallelism has been pervasively available almost everywhere, from traditional processor pipelines to large-scale  ...  The research was funded by Intel  ... 
doi:10.1007/978-3-642-24151-2_1 fatcat:32cx745cn5cfdm5sbeah6eyiey

A Survey of General-Purpose Computation on Graphics Hardware [article]

John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, Timothy J. Purcell
2005 Eurographics State of the Art Reports  
We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest  ...  Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware.  ...  Acknowledgements Thanks to Ian Buck, Jeff Bolz, Daniel Horn, Marc Pollefeys, and Robert Strzodka for their thoughtful comments, and to the anonymous reviewers for their helpful and constructive criticism  ... 
doi:10.2312/egst.20051043 fatcat:7jved5a5v5ezjpvfgxtye5xscu

Software challenges in extreme scale systems

Vivek Sarkar, William Harrod, Allan E Snavely
2009 Journal of Physics, Conference Series  
" and similar paradigm shifts.  ...  probably represents the first true parallel processor to fly in space, and one of the earliest examples of multi-threaded architectures.  ...  The static mapping eliminated much of the overhead.  ... 
doi:10.1088/1742-6596/180/1/012045 fatcat:iukutry2dvbitfdh6ng7kgz564
« Previous Showing results 1 — 15 out of 104 results