Filters








237 Hits in 5.7 sec

Supporting x86-64 address translation for 100s of GPU lanes

Jason Power, Mark D. Hill, David A. Wood
2014 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)  
However, even a modest GPU might need 100s of translations per cycle (6 CUs * 64 lanes/CU) with memory access patterns designed for throughput more than locality.  ...  For increased programmability, this memory should be uniformly virtualized, necessitating compatible address translation support for GPU memory references.  ...  Acknowledgements This work is supported in part by the National Science Foundation (CCF-1017650, CNS-1117280, CCF-1218323 and CNS-1302260) and a University of Wisconsin Vilas award.  ... 
doi:10.1109/hpca.2014.6835965 dblp:conf/hpca/PowerHW14 fatcat:sonrdgcadbcwtopl2a4hfh72em

State-of-the-Art and Trends for Computing and Interconnect Network Solutions for HPC and AI

A. Tekin, A.Tuncer Durak, C. Piechurski, D. Kaliszan, F. Aylin Sungur, F. Robertsén, P. Gschwandtner
2021 Zenodo  
The present report provides a consolidated view on the current and mid-term technologies (2019-2022+) for two important components of an HPC/AI system: computing (general purpose processor and accelerators  ...  Since 2000, High Performance Computing (HPC) resources have been extremely homogeneous in terms of underlying processors technologies.  ...  Acknowledgements This work was financially supported by the PRACE project funded in part by the EU's Horizon 2020 Research and Innovation programme (2014-2020) under grant agreement 823767.  ... 
doi:10.5281/zenodo.5717283 fatcat:irgzrdxr6ncijcfxsdb3sdodii

State-of-the-Art and Trends for Computing and Interconnect Network Solutions for HPC and AI

A. Tekin, A.Tuncer Durak, C. Piechurski, D. Kaliszan, F. Aylin Sungur, F. Robertsén, P. Gschwandtner
2021 Zenodo  
The present report provides a consolidated view on the current and mid-term technologies (2019-2022+) for two important components of an HPC/AI system: computing (general purpose processor and accelerators  ...  Since 2000, High Performance Computing (HPC) resources have been extremely homogeneous in terms of underlying processors technologies.  ...  Acknowledgements This work was financially supported by the PRACE project funded in part by the EU's Horizon 2020 Research and Innovation programme (2014-2020) under grant agreement 823767.  ... 
doi:10.5281/zenodo.5534079 fatcat:fdknu7w4mfc5foa4gnmt5vqdna

In-Database Processing and In-Memory Analytics [chapter]

Pethuru Raj, Anupama Raman, Dhivya Nagaraj, Siddhartha Duggirala
2015 Computer Communications and Networks  
We also discuss the results of a hybrid query scheduling when interleaving the execution of the SIMD operators between PIM and x86 processing hardware.  ...  However, this is the first experimental study, in the database community, to discuss the trade-offs of execution time and energy consumption between PIM and x86 in the main query execution systems: materialized  ...  The random access shows low data reuse as at most 32 memory addresses from the 64 possible addresses in the SIMD lanes can be accessed at once.  ... 
doi:10.1007/978-3-319-20744-5_8 fatcat:or56nfiaknhcrkzo2wkklkhqmy

Caracal

Rodrigo Domínguez, Dana Schaa, David Kaeli
2011 Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4  
Graphics Processing Units (GPU) have become the platform of choice for accelerating a large range of data parallel and task parallel applications.  ...  Here we target the AMD Evergreen family of GPUs. We discuss the challenges of compatibility and correctness faced by the translator using specific examples.  ...  CUDA Research Centers Program, and by support by the Vice Provost's Office of Research at Northeastern University.  ... 
doi:10.1145/1964179.1964186 dblp:conf/asplos/DominguezSK11 fatcat:p3wi6reknbbajdbtcyic7w5iru

Processing Panorama Video in Real-time

Håkon Kvale Stensland, Vamsidhar Reddy Gaddam, Marius Tennøe, Espen Helgedagsrud, Mikkel Næss, Henrik Kjus Alstad, Carsten Griwodz, Pål Halvorsen, Dag Johansen
2014 International Journal of Semantic Computing (IJSC)  
The P2G framework is designed for multimedia workloads and supports heterogeneous architectures. To demonstrate the feasibility of the framework, we construct a proof-of-concept implementation.  ...  For a very long time, one of the important means of increasing performance was to increase the clock frequency.  ...  The cores have been modified with support for the 64-bit x86 instruction set and support for four-way SMT.  ... 
doi:10.1142/s1793351x14400054 fatcat:hafewx3ekrcfpat2osb67fjugi

Multi2Sim

Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, David Kaeli
2012 Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT '12  
In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an x86 CPU and an AMD Evergreen GPU.  ...  Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMD's OpenCL benchmark suite.  ...  The authors would also like to thank Norman Rubin (AMD) for his advice and feedback on this work.  ... 
doi:10.1145/2370816.2370865 dblp:conf/IEEEpact/UbalJMSK12 fatcat:ixqg7hugsnarxph4vltnywkj2q

AMD Fusion APU: Llano

Alexander Branover, Denis Foley, Maurice Steinman
2012 IEEE Micro  
Acknowledgments We thank the remaining authors of the LN APU presentation at Hot Chips: Antonio Asaro (AMD fellow), Greg Smaus (AMD principal member of technical staff), Ljubisa Bajic (senior manager at  ...  Llano represents the combined effort of many talented AMD engineers across multiple locations in the US, Canada, India, and Germany.  ...  I/O and display capability As Figure 3 shows, Llano supports eight lanes dedicated to PCI Express, eight lanes dedicated to DisplayPort, and 16 lanes that can be used for either PCI Express or Display-Port  ... 
doi:10.1109/mm.2012.2 fatcat:t7p6vuydp5grlm3vs2crktxdyi

The Case for Polymorphic Registers in Dataflow Computing

Cătălin Bogdan Ciobanu, Georgi Gaydadjiev, Christian Pilato, Donatella Sciuto
2017 International journal of parallel programming  
We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-theart NVIDIA Tesla C2050 GPU.  ...  We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9 × 9 or larger mask sizes, even in bandwidth-constrained systems.  ...  Hwu and Nasser Salim Anssari from the University of Illinois at Urbana-Champaign for assisted us with obtaining the NVIDIA Tesla C2050 2D separable convolution results.  ... 
doi:10.1007/s10766-017-0494-1 fatcat:bcttuesbpbhp7jrtcv5b5kl5hi

D9.3.3: Report on prototypes evaluation

Lennart Johnsson, Gilbert Netzer
2013 Zenodo  
DSPs common for embedded systems and with a TDP about one order of magnitude less than x86 CPUs, the emerging heterogeneous CPUs integrating x86 and GPU cores, and traditional GPUs with a novel direct  ...  Prototype efforts assessed the use of FPGAs for function acceleration, the use of CPUs for the mobile market and with a TDP about two orders of magnitude less than typical x86 CPUs for the HPC market,  ...  The parallel efficiency for the x86+GPU cluster is close to 100% for the 8 nodes in the cluster, whereas it is about 75% for the Magny-Cours cluster.  ... 
doi:10.5281/zenodo.6553033 fatcat:nvxbrlq5jzdfhbkh5fde3kpl4e

FAST

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, Pradeep Dubey
2010 Proceedings of the 2010 international conference on Management of data - SIGMOD '10  
FAST eliminates impact of memory latency, and exploits thread-level and datalevel parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second, 5X (CPU) and 1.7X  ...  FAST supports efficient bulk updates by rebuilding index trees in less than 0.1 seconds for datasets as large as 64M keys and naturally integrates compression techniques, overcoming the memory bandwidth  ...  map well to SIMD on modern architectures (CPU and GPU), with native instruction support for 32-bit elements in each SIMD lane.  ... 
doi:10.1145/1807167.1807206 dblp:conf/sigmod/KimCSSNKLBD10 fatcat:cpc26e36xnft3owjv7npmn3z2e

Machines and Algorithms [article]

Peter A Boyle
2017 arXiv   pre-print
I discuss the evolution of computer architectures with a focus on QCD and with reference to the interplay between architecture, engineering, data motion and algorithms.  ...  Right: a passive optical cable carrying 64 bit lanes.  ...  Like previous GPU architectures, GP100 supports full IEEE 754-2008 compliant single precision and double precision arithmetic, including support for the fused multiply-add (FMA) operation and full speed  ... 
arXiv:1702.00208v1 fatcat:er6sxgrduvf5rjirxjknwb43ym

Hyperion: A Case for Unified, Self-Hosting, Zero-CPU Data-Processing Units (DPUs) [article]

Marco Spaziani Brunella and Marco Bonola and Animesh Trivedi
2022 arXiv   pre-print
Since the inception of computing, we have been reliant on CPU-powered architectures.  ...  In this paper, we present the case for Hyperion, its design choices, initial work-in-progress details, and seek feedback from the systems community.  ...  Acknowledgments This work is generously supported by the NWO grant number OCENW.XS3.030, Project Zero: Imagining a Brave CPUfree World!, and the Xilinx University Donation Program.  ... 
arXiv:2205.08882v1 fatcat:otiko5erkzeabm3fu2m4z32mo4

Data-Centric and Data-Aware Frameworks for Fundamentally Efficient Data Handling in Modern Computing Systems [article]

Nastaran Hajinazar
2021 arXiv   pre-print
This thesis studies the root cause of inefficiency in modern computing systems when handling modern applications' data demand, and aims to fundamentally address such inefficiencies, with a focus on two  ...  demand in modern applications, and 2) is built from the ground up to understand, convey, and exploit data properties, to create opportunities for performance and efficiency improvements.  ...  Similar to x86-64 [342] , Our base address translation mechanism stores VBI-to-physical address translation information in multi-level tables.  ... 
arXiv:2109.05881v1 fatcat:iwup66vxsjct3bm5thl35nzyuq

A programming system for xeon phis with runtime SIMD parallelization

Xin Huo, Bin Ren, Gagan Agrawal
2014 Proceedings of the 28th ACM international conference on Supercomputing - ICS '14  
We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reordering and our methods to effectively manage control flow.  ...  The Intel Xeon Phi offers a promising solution to coprocessing, since it is based on the popular x86 instruction set.  ...  In a one dimension matrix, if the address of matrix[i] is aligned by 64 bytes, addresses of its neighbors, matrix[i-1] and matrix[i+1], will not be aligned.  ... 
doi:10.1145/2597652.2597682 dblp:conf/ics/HuoRA14 fatcat:klizb5inbzgi5ipwgzic6n334i
« Previous Showing results 1 — 15 out of 237 results