A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2014; you can also visit the original URL.
The file type is application/pdf
.
Filters
Supporting x86-64 address translation for 100s of GPU lanes
2014
2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
However, even a modest GPU might need 100s of translations per cycle (6 CUs * 64 lanes/CU) with memory access patterns designed for throughput more than locality. ...
For increased programmability, this memory should be uniformly virtualized, necessitating compatible address translation support for GPU memory references. ...
Acknowledgements This work is supported in part by the National Science Foundation (CCF-1017650, CNS-1117280, CCF-1218323 and CNS-1302260) and a University of Wisconsin Vilas award. ...
doi:10.1109/hpca.2014.6835965
dblp:conf/hpca/PowerHW14
fatcat:sonrdgcadbcwtopl2a4hfh72em
State-of-the-Art and Trends for Computing and Interconnect Network Solutions for HPC and AI
2021
Zenodo
The present report provides a consolidated view on the current and mid-term technologies (2019-2022+) for two important components of an HPC/AI system: computing (general purpose processor and accelerators ...
Since 2000, High Performance Computing (HPC) resources have been extremely homogeneous in terms of underlying processors technologies. ...
Acknowledgements This work was financially supported by the PRACE project funded in part by the EU's Horizon 2020 Research and Innovation programme (2014-2020) under grant agreement 823767. ...
doi:10.5281/zenodo.5717283
fatcat:irgzrdxr6ncijcfxsdb3sdodii
State-of-the-Art and Trends for Computing and Interconnect Network Solutions for HPC and AI
2021
Zenodo
The present report provides a consolidated view on the current and mid-term technologies (2019-2022+) for two important components of an HPC/AI system: computing (general purpose processor and accelerators ...
Since 2000, High Performance Computing (HPC) resources have been extremely homogeneous in terms of underlying processors technologies. ...
Acknowledgements This work was financially supported by the PRACE project funded in part by the EU's Horizon 2020 Research and Innovation programme (2014-2020) under grant agreement 823767. ...
doi:10.5281/zenodo.5534079
fatcat:fdknu7w4mfc5foa4gnmt5vqdna
In-Database Processing and In-Memory Analytics
[chapter]
2015
Computer Communications and Networks
We also discuss the results of a hybrid query scheduling when interleaving the execution of the SIMD operators between PIM and x86 processing hardware. ...
However, this is the first experimental study, in the database community, to discuss the trade-offs of execution time and energy consumption between PIM and x86 in the main query execution systems: materialized ...
The random access shows low data reuse as at most 32 memory addresses from the 64 possible addresses in the SIMD lanes can be accessed at once. ...
doi:10.1007/978-3-319-20744-5_8
fatcat:or56nfiaknhcrkzo2wkklkhqmy
Caracal
2011
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units - GPGPU-4
Graphics Processing Units (GPU) have become the platform of choice for accelerating a large range of data parallel and task parallel applications. ...
Here we target the AMD Evergreen family of GPUs. We discuss the challenges of compatibility and correctness faced by the translator using specific examples. ...
CUDA Research Centers Program, and by support by the Vice Provost's Office of Research at Northeastern University. ...
doi:10.1145/1964179.1964186
dblp:conf/asplos/DominguezSK11
fatcat:p3wi6reknbbajdbtcyic7w5iru
Processing Panorama Video in Real-time
2014
International Journal of Semantic Computing (IJSC)
The P2G framework is designed for multimedia workloads and supports heterogeneous architectures. To demonstrate the feasibility of the framework, we construct a proof-of-concept implementation. ...
For a very long time, one of the important means of increasing performance was to increase the clock frequency. ...
The cores have been modified with support for the 64-bit x86 instruction set and support for four-way SMT. ...
doi:10.1142/s1793351x14400054
fatcat:hafewx3ekrcfpat2osb67fjugi
Multi2Sim
2012
Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT '12
In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an x86 CPU and an AMD Evergreen GPU. ...
Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMD's OpenCL benchmark suite. ...
The authors would also like to thank Norman Rubin (AMD) for his advice and feedback on this work. ...
doi:10.1145/2370816.2370865
dblp:conf/IEEEpact/UbalJMSK12
fatcat:ixqg7hugsnarxph4vltnywkj2q
AMD Fusion APU: Llano
2012
IEEE Micro
Acknowledgments We thank the remaining authors of the LN APU presentation at Hot Chips: Antonio Asaro (AMD fellow), Greg Smaus (AMD principal member of technical staff), Ljubisa Bajic (senior manager at ...
Llano represents the combined effort of many talented AMD engineers across multiple locations in the US, Canada, India, and Germany. ...
I/O and display capability As Figure 3 shows, Llano supports eight lanes dedicated to PCI Express, eight lanes dedicated to DisplayPort, and 16 lanes that can be used for either PCI Express or Display-Port ...
doi:10.1109/mm.2012.2
fatcat:t7p6vuydp5grlm3vs2crktxdyi
The Case for Polymorphic Registers in Dataflow Computing
2017
International journal of parallel programming
We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-theart NVIDIA Tesla C2050 GPU. ...
We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9 × 9 or larger mask sizes, even in bandwidth-constrained systems. ...
Hwu and Nasser Salim Anssari from the University of Illinois at Urbana-Champaign for assisted us with obtaining the NVIDIA Tesla C2050 2D separable convolution results. ...
doi:10.1007/s10766-017-0494-1
fatcat:bcttuesbpbhp7jrtcv5b5kl5hi
D9.3.3: Report on prototypes evaluation
2013
Zenodo
DSPs common for embedded systems and with a TDP about one order of magnitude less than x86 CPUs, the emerging heterogeneous CPUs integrating x86 and GPU cores, and traditional GPUs with a novel direct ...
Prototype efforts assessed the use of FPGAs for function acceleration, the use of CPUs for the mobile market and with a TDP about two orders of magnitude less than typical x86 CPUs for the HPC market, ...
The parallel efficiency for the x86+GPU cluster is close to 100% for the 8 nodes in the cluster, whereas it is about 75% for the Magny-Cours cluster. ...
doi:10.5281/zenodo.6553033
fatcat:nvxbrlq5jzdfhbkh5fde3kpl4e
FAST eliminates impact of memory latency, and exploits thread-level and datalevel parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second, 5X (CPU) and 1.7X ...
FAST supports efficient bulk updates by rebuilding index trees in less than 0.1 seconds for datasets as large as 64M keys and naturally integrates compression techniques, overcoming the memory bandwidth ...
map well to SIMD on modern architectures (CPU and GPU), with native instruction support for 32-bit elements in each SIMD lane. ...
doi:10.1145/1807167.1807206
dblp:conf/sigmod/KimCSSNKLBD10
fatcat:cpc26e36xnft3owjv7npmn3z2e
Machines and Algorithms
[article]
2017
arXiv
pre-print
I discuss the evolution of computer architectures with a focus on QCD and with reference to the interplay between architecture, engineering, data motion and algorithms. ...
Right: a passive optical cable carrying 64 bit lanes. ...
Like previous GPU architectures, GP100 supports full IEEE 754-2008 compliant single precision and double precision arithmetic, including support for the fused multiply-add (FMA) operation and full speed ...
arXiv:1702.00208v1
fatcat:er6sxgrduvf5rjirxjknwb43ym
Hyperion: A Case for Unified, Self-Hosting, Zero-CPU Data-Processing Units (DPUs)
[article]
2022
arXiv
pre-print
Since the inception of computing, we have been reliant on CPU-powered architectures. ...
In this paper, we present the case for Hyperion, its design choices, initial work-in-progress details, and seek feedback from the systems community. ...
Acknowledgments This work is generously supported by the NWO grant number OCENW.XS3.030, Project Zero: Imagining a Brave CPUfree World!, and the Xilinx University Donation Program. ...
arXiv:2205.08882v1
fatcat:otiko5erkzeabm3fu2m4z32mo4
Data-Centric and Data-Aware Frameworks for Fundamentally Efficient Data Handling in Modern Computing Systems
[article]
2021
arXiv
pre-print
This thesis studies the root cause of inefficiency in modern computing systems when handling modern applications' data demand, and aims to fundamentally address such inefficiencies, with a focus on two ...
demand in modern applications, and 2) is built from the ground up to understand, convey, and exploit data properties, to create opportunities for performance and efficiency improvements. ...
Similar to x86-64 [342] , Our base address translation mechanism stores VBI-to-physical address translation information in multi-level tables. ...
arXiv:2109.05881v1
fatcat:iwup66vxsjct3bm5thl35nzyuq
A programming system for xeon phis with runtime SIMD parallelization
2014
Proceedings of the 28th ACM international conference on Supercomputing - ICS '14
We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reordering and our methods to effectively manage control flow. ...
The Intel Xeon Phi offers a promising solution to coprocessing, since it is based on the popular x86 instruction set. ...
In a one dimension matrix, if the address of matrix[i] is aligned by 64 bytes, addresses of its neighbors, matrix[i-1] and matrix[i+1], will not be aligned. ...
doi:10.1145/2597652.2597682
dblp:conf/ics/HuoRA14
fatcat:klizb5inbzgi5ipwgzic6n334i
« Previous
Showing results 1 — 15 out of 237 results