82 Hits in 3.4 sec

Effectively Prefetching Remote Memory with Leap [article]

Hasan Al Maruf, Mosharaf Chowdhury
2019 arXiv   pre-print
In this paper, we propose Leap, a prefetching solution for remote memory accesses due to memory disaggregation.  ...  Memory disaggregation over RDMA can improve the performance of memory-constrained applications by replacing disk swapping with remote memory accesses.  ...  We have integrated Leap with two major memory disaggregation systems (namely, Infiniswap and Remote Regions), and Leap improves the median and tail remote page access latencies by up to 104.04× and 22.62  ... 
arXiv:1911.09829v1 fatcat:4jhmagwqhfcbdfb37424id2lmi

Canvas: Isolated and Adaptive Swapping for Multi-Applications on Remote Memory [article]

Chenxi Wang, Yifan Qiao, Haoran Ma, Shi Liu, Yiying Zhang, Wenguang Chen, Ravi Netravali, Miryung Kim, Guoqing Harry Xu
2022 arXiv   pre-print
Remote memory techniques for datacenter applications have recently gained a great deal of popularity. Existing remote memory techniques focus on the efficiency of a single application setting only.  ...  Canvas is a redesigned swap system that fully isolates swap paths for remote-memory applications.  ...  As effective prefetching is paramount to remote-memory performance, Canvas employs a two-tier prefetching design.  ... 
arXiv:2203.09615v1 fatcat:stucu3muffaqvjy6dqmgnyknoq

Systems for Memory Disaggregation: Challenges Opportunities [article]

Anil Yelam
2022 arXiv   pre-print
Memory disaggregation addresses memory imbalance in a cluster by decoupling CPU and memory allocations of applications while also increasing the effective memory capacity for (memory-intensive) applications  ...  We conclude with a discussion on some open questions and potential future directions that can render disaggregation more amenable for adoption.  ...  Leap [6] is an advanced prefetcher for remote paging that monitors page faults and uses the faulted addresses to predict future pages.  ... 
arXiv:2202.02223v1 fatcat:4sht3aonuvg63nxhdr22jv4chi

Knowledge-Based Out-of-Core Algorithms for Data Management in Visualization [article]

David Chisnall, Min Chen, Charles Hansen
2006 EUROVIS 2005: Eurographics / IEEE VGTC Symposium on Visualization  
We carried out our evaluation in conjunction with an example application where rendering multiple point sets in a volume scene graph put a great strain on the rendering algorithm in terms of memory management  ...  Many existing out-of-core algorithms used in visualization are closely coupled with application-specific logic.  ...  An effective out-of-core [SCESL02] , or external memory [Vit01] strategy requires an efficient prefetching algorithm (such as in [VM02] ) in order to prevent disk latency being the limiting factor  ... 
doi:10.2312/vissym/eurovis06/107-114 fatcat:74xzztoz2fgrrk7sg3o2fbkjbq

Building a single-box 100 Gbps software router

Sangjin Han, Keon Jang, KyoungSoo Park, Sue Moon
2010 2010 17th IEEE Workshop on Local & Metropolitan Area Networks (LANMAN)  
Commodity-hardware technology has advanced in great leaps in terms of CPU, memory, and I/O bus speeds.  ...  For the former, we propose reducing per-packet processing overhead with softwarelevel optimizations and buying extra computing power with GPUs.  ...  This is because remote memory access is expensive in NUMA systems in terms of latency and may overload interconnects between nodes.  ... 
doi:10.1109/lanman.2010.5507157 dblp:conf/lanman/HanJPM10 fatcat:bviru6uydbh7zccq47tgstjiu4

Modeling memory concurrency for multi-socket multi-core systems

Anirban Mandal, Rob Fowler, Allan Porterfield
2010 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS)  
We present a detailed experimental multi-socket, multi-core memory study based on the PCHASE benchmark, which can vary memory loads by controlling the number of concurrent memory references per thread.  ...  controller; and limits on the global memory concurrency.  ...  Rather, the effective latency of any particular operation may be less than some base figure if the operation can be overlapped with code execution through the use of prefetching or code scheduling.  ... 
doi:10.1109/ispass.2010.5452064 dblp:conf/ispass/MandalFP10 fatcat:nzjuvo3grrdrnhwkj64upcab4q

Warp speed

Peter D. Barnes, Christopher D. Carothers, David R. Jefferson, Justin M. LaPre
2013 Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation - SIGSIM-PADS '13  
We attribute this to significant cacherelated performance acceleration as we moved to higher scales with fewer LPs per core.  ...  Prompted by historical performance results we propose a new, long term performance metric called Warp Speed that grows logarithmically with the PHOLD event rate.  ...  To help mitigate the lower overall memory bandwidth to FLOP ratio, the Blue Gene/Q provides an L1 cache prefetch engine for each core along with scalable atomic operations.  ... 
doi:10.1145/2486092.2486134 dblp:conf/pads/BarnesCJL13 fatcat:jugkxffdafg25kpwluaqcyqmay

Data cashing in IR systems

P. Simpson, R. Alonso
1987 Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '87  
This paper outlines methods of integrating personal computers (PCs) into large information systems, with emphasis on effective use of the storage and processing capabilities of these computers.  ...  Information retrieval (IR) systems provide individual remote access to centrally managed data.  ...  The slower lines, however, cannot build the cache quickly enough to have much effect. Even with very high locality response times are much worse than with no caching at all (Figure 4 ).  ... 
doi:10.1145/42005.42038 dblp:conf/sigir/SimpsonA87 fatcat:46hapub3ybferiaw5ql6h5svbi

The pressure is on [computer systems research]

K. Kavi, J.C. Browne, A. Tripathi
1999 Computer  
Computer Cover Feature As applications become more demanding, computer systems research must not only redefine traditional roles but also unite diverse disciplines in a common goal: To make quantum leaps  ...  With mobile agents, the user sends the agent only once (with its initial parameters) to the remote server.  ...  The immediately obvious solution is to add more of something, such as on-chip (L1) cache memory, pipelines (providing a higher degree of superscalars), registers, hardware context, prefetch buffers, or  ... 
doi:10.1109/2.738301 fatcat:37juh2i5yjhxtdajris5w5ewvq

Interactive Remote Exploration of Massive Cityscapes [article]

Marco Di Benedetto, Paolo Cignoni, Fabio Ganovelli, Enrico Gobbetti, Fabio Marton, Roberto Scopigno
2009 VAST: International Symposium on Virtual Reality  
Thanks to the constant footprint of BlockMaps, memory management is particularly simple and effective.  ...  In particular, no fragmentation effects occur throughout the memory hierarchy, and data transfers at all levels can be optimized by grouping BlockMaps for tuning message sizes.  ... 
doi:10.2312/vast/vast09/009-016 fatcat:63em2yhfh5hmzfkmuf4wbvcexi

Performance characteristics of MAUI

Justin Teller, Charles B. Silio, Bruce Jacob
2005 Proceedings of the 2005 workshop on Memory system performance - MSP '05  
The MAUI's computational engine performs memory-bound SIMD computations close to the memory system, enabling more efficient memory pipelining.  ...  Because the "intelligence" of the MAUI intelligent memory system architecture is located in the memory-controller, logic and DRAM are not required to be integrated into a single chip, and use of off-the-shelf  ...  Note that for the memory systems shown in Figure 4 : Graph illustrating the effect memory configuration has on the speedup due to the MAUI architecture for the MAUI-one benchmark simulated with a processor  ... 
doi:10.1145/1111583.1111590 dblp:conf/ACMmsp/TellerSJ05 fatcat:j5ckax5325dtzlbpd7ictbc7re

A view of the parallel computing landscape

Krste Asanovic, John Wawrzynek, David Wessel, Katherine Yelick, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen
2009 Communications of the ACM  
To provide a common model of memory across machines with coherent caches, local stores, and relatively slow off-chip memory, we are defining an API based on the idea of logically partitioned shared memory  ...  Our approach aims to allow programmers to quickly turn a cache into an explicitly managed local store and the prefetch engines into explicitly controlled Direct Memory Access engines.  ... 
doi:10.1145/1562764.1562783 fatcat:telznhkcrzgwnm25ifqszd44lu

Processing Data Where It Makes Sense: Enabling In-Memory Computation [article]

Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun
2019 arXiv   pre-print
We discuss at least two promising directions for processing-in-memory (PIM): (1) performing massively-parallel bulk operations in memory by exploiting the analog operational properties of DRAM, with low-cost  ...  As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost.  ...  Despite the leap that LazyPIM [70, 209] represents for memory coherence in computing systems with PIM support, we believe that it is still necessary to explore other solutions for memory coherence that  ... 
arXiv:1903.03988v1 fatcat:l2sl2wqwmrejvfbi3sxrpwasby

Parallel STEPS: Large Scale Stochastic Spatial Reaction-Diffusion Simulation with High Performance Computers

Weiliang Chen, Erik De Schutter
2017 Frontiers in Neuroinformatics  
with 100 processes.  ...  In a more realistic scenario with dynamic calcium influx and data recording, the parallel simulation with 1,000 processes and no load balancing is still 500 times faster than the conventional serial SSA  ...  Simulation with finer mesh achieves much higher speedup in massive parallelization, thanks to the memory caching effect.  ... 
doi:10.3389/fninf.2017.00013 pmid:28239346 pmcid:PMC5301017 fatcat:hctchl5ttndqdmk4n4jz65y7nm

RT-CUDA: A Software Tool for CUDA Code Restructuring

Ayaz H. Khan, Mayez Al-Mouhamed, Muhammed Al-Mulhem, Adel F. Ahmed
2016 International journal of parallel programming  
Update row index calculations with multiple of number of blocks to be merged Task Distribution among all threads with block merging applied Prefetching using Shared Memory To effectively use the shared  ...  memory with 16 KB of L1 cache, or as 16 KB of shared memory with 48 KB of L1 cache.  ...  Generated Jacobi original CUDA kernel Generated Jacobi CUDA kernel after modification Generated Jacobi params file Generated Jacobi params file after modification The modification on the code has no effect  ... 
doi:10.1007/s10766-016-0433-6 fatcat:xxikpkyrkvdizgijskmk2qxkay
« Previous Showing results 1 — 15 out of 82 results