Filters








2,539 Hits in 3.9 sec

Bit Reversal on Uniprocessors

Alan H. Karp
1996 SIAM Review  
This paper collects 30 methods for bit reversing an array.  ...  Many v ersions of the fast Fourier transform require a reordering of either the input or the output data that corresponds to reversing the order of the bits in the array index.  ...  For example, they have used memory chips that are considerably slower than those used in the processor.  ... 
doi:10.1137/1038001 fatcat:wkxtljudwrdglkykaoxb3o6myi

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

Ardavan Pedram, John D. McCalpin, Andreas Gerstlauer
2014 Journal of Signal Processing Systems  
Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor.  ...  efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm 2 , respectively.  ...  Acknowledgements Authors wish to thank John Brunhaver for providing synthesis results for the raw components of the Transposer.  ... 
doi:10.1007/s11265-014-0896-x fatcat:ce5vw2a4dne5bmlkwbjuxdr75q

Fast Bit-Reversals on Uniprocessors and Shared-Memory Multiprocessors

Zhao Zhang, Xiaodong Zhang
2001 SIAM Journal on Scientific Computing  
In this paper, we examine different methods using techniques of blocking, buffering, and padding for efficient implementations of bit-reversals.  ...  , are cache-optimal and fast. (2) We show that our padding methods outperform other software-oriented methods, and we believe they are the fastest in terms of minimizing both CPU and memory access cycles  ...  We feel fortunate to have had Alan Karp's expert views and comments on this work. We have also exchanged bit-reversal programs of different methods to compare the performance.  ... 
doi:10.1137/s1064827599359709 fatcat:6t7kyn7ghfe7nhqrpmjm6gvzha

Optimized strategies for mapping three-dimensional FFTs onto CUDA GPUs

Jing Wu, Joseph JaJa
2012 2012 Innovative Parallel Computing (InPar)  
The bandwidths achieved by our algorithms reach over 90 GB/s for the GTX280 and around 140 GB/s for the GTX480.  ...  We exploit the high-degree of multithreading offered by the CUDA environment while carefully managing the multiple levels of the memory hierarchy in such a way that: (i) all global memory accesses are  ...  with bit-reversed order output.  ... 
doi:10.1109/inpar.2012.6339608 fatcat:tl4cwrovvjgvfkbgkqfthkcqyi

Towards a Theory of Cache-Efficient Algorithms [article]

Sandeep Sen, Siddhartha Chatterjee, Neeraj Dumir
2000 arXiv   pre-print
We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting.  ...  Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and dealing with the hitherto unresolved problem of limited associativity.  ...  Acknowledgments We are grateful to Alvin Lebeck for valuable discussions related to present and future trends of different aspects of memory hierarchy design.  ... 
arXiv:cs/0010007v1 fatcat:zb2jij6j6rc3bk4o7z7pentngi

The potential of the cell processor for scientific computing

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick
2006 Proceedings of the 3rd conference on Computing frontiers - CF '06  
Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture.  ...  First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs.  ...  The authors gratefully thank Bracy Elton and Adrian Tate for their assistance in obtaining X1E FFT performance data, and Eduardo D'Azevedo for providing us with an optimized X1E SpMV implementation.  ... 
doi:10.1145/1128022.1128027 dblp:conf/cf/WilliamsSOKHY06 fatcat:vmlyxmeyazgrrn5vseohoaohdi

Permuting Web and Social Graphs

Paolo Boldi, Massimo Santini, Sebastiano Vigna
2009 Internet Mathematics  
In particular, we show that for the transposed web graph URL ordering is significantly less effective, and that some new mixed orderings combining host information and Gray/lexicographic orderings outperform  ...  all previous methods: in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link.  ...  Second, we have shown that transposed graphs behave in a radically different manner when permuted with our techniques, giving rise to extreme compression rates.  ... 
doi:10.1080/15427951.2009.10390641 fatcat:urnthxifszarnor5neant2rpwa

Towards a theory of cache-efficient algorithms

Sandeep Sen, Siddhartha Chatterjee, Neeraj Dumir
2002 Journal of the ACM  
Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and for dealing with the hitherto unresolved problem of limited associativity.  ...  We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting.  ...  We are grateful to Alvin Lebeck for valuable discussions related to present and future trends of different aspects of memory hierarchy design.  ... 
doi:10.1145/602220.602225 fatcat:72kwpxglfvfdxb4qqhnnzpqvte

In-place transposition of rectangular matrices on accelerators

I-Jui Sung, Juan Gómez-Luna, José María González-Linares, Nicolás Guil, Wen-Mei W. Hwu
2014 Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14  
Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput.  ...  Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer.  ...  Acknowledgments This project was partly supported by the STARnet Center for  ... 
doi:10.1145/2555243.2555266 dblp:conf/ppopp/SungGGGH14 fatcat:3rot3hcnzrgdjksdwfezafcwha

Conformal Computing: Algebraically connecting the hardware/software boundary using a uniform approach to high-performance computation for software and hardware applications [article]

Lenore R. Mullin, James E. Raynolds
2008 arXiv   pre-print
We present a systematic, algebraically based, design methodology for efficient implementation of computer programs optimized over multiple levels of the processor/memory and network hierarchy.  ...  Extensive discussion and benchmark results are presented for the Fast Fourier Transform and other important algorithms.  ...  That is, the ONF is a specification given: Iteration, Sequence, and Control for each level of processor/memory hierarchy desired.  ... 
arXiv:0803.2386v1 fatcat:flbqoemiwfh4ljar6jnq3ttpum

Speeding up decimal multiplication [article]

Viktor Krapivensky
2020 arXiv   pre-print
We also present a simple cache-efficient algorithm for in-place 2n × n or n × 2n matrix transposition, the need for which arises in the "six-step algorithm" variation of the matrix Fourier algorithm, and  ...  Another finding is that use of two prime moduli instead of three makes sense even considering the worst case of increasing the size of the input, and makes for simpler answer recovery.  ...  Algorithm for bit-reversal permutation See [17] for overview of algorithms for bit-reversal permutation.  ... 
arXiv:2011.11524v4 fatcat:smi7l3qehbfwhi673pyinegu64

Scientific Computing Kernels on the Cell Processor

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick
2007 International journal of parallel programming  
Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture.  ...  First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs.  ...  The authors gratefully thank Bracy Elton and Adrian Tate for their assistance in obtaining X1E FFT performance data, and Eduardo D'Azevedo for providing us with an optimized X1E SpMV implementation.  ... 
doi:10.1007/s10766-007-0034-5 fatcat:e26uq4azkzf4hizdwfu6mt6hqy

Practically efficient methods for performing bit-reversed permutation in C++11 on the x86-64 architecture [article]

Christian Knauth, Boran Adas, Daniel Whitfield, Xuesong Wang, Lydia Ickler, Tim Conrad, Oliver Serang
2017 arXiv   pre-print
The bit-reversed permutation is a famous task in signal processing and is key to efficient implementation of the fast Fourier transform.  ...  approach, which reduces the bit-reversed permutation to smaller bit-reversed permutations and a square matrix transposition.  ...  Acknowledgements We are grateful to Thimo Wellner and Guy Ling for their contributions. This paper grew out of the masters course in Scientific Computing taught by Oliver Serang.  ... 
arXiv:1708.01873v1 fatcat:vx3zpajytrcf7o3hyyk6weozum

Emerging Database Systems in Support of Scientific Data [chapter]

Per Svensson, Peter Boncz, Milena Ivanova, Martin Kersten, Niels Nes, Doron Rotem
2009 Scientific Data Management  
Next, the chapter covers in detail the architecture and design considerations of a particular (open source) vertical database system, called MonetDB.  ...  This is followed by an example of using MonetDB for the SkyServer data, and the query processing improvements it offers.  ...  the resulting column of positions for the bit-vector column can be copied from the appropriate bit-vector for the value that matched the predicate.  ... 
doi:10.1201/9781420069815-c7 fatcat:ft3mckhzr5agfhopo6awmhwk7e

Astronomical Data Preprocessing Implementation Based on FPGA and Data Transformation Strategy for the FAST Telescope as a Giant CPS

Yuefeng Song, Yongxin Zhu, Junjie Hou, Sen Du, Shijin Song
2020 IEEE Access  
This makes Bitshuffle on FPGAs a candidate for meeting the computational and energy efficiency constraints of radio telescopes and provide reference for CPSS facing the same situation.  ...  In the paper, we propose the implementation of this algorithm on Field Programmable Gate Array (FPGA) and present an unique data transformation strategy to turn raw FAST data in classic FITS format into  ...  The process is completely reversible for all components mapped to the new data format.  ... 
doi:10.1109/access.2020.2981816 fatcat:3bzxf6bq2nemtexulru6ttifym
« Previous Showing results 1 — 15 out of 2,539 results