Filters








31 Hits in 0.79 sec

GPU System Calls [article]

Ján Veselý, Arkaprava Basu, Abhishek Bhattacharjee, Gabriel Loh, Mark Oskin, Steven K. Reinhardt
2017 arXiv   pre-print
GPUs are becoming first-class compute citizens and are being tasked to perform increasingly complex work. Modern GPUs increasingly support programmability- enhancing features such as shared virtual memory and hardware cache coherence, enabling them to run a wider variety of programs. But a key aspect of general-purpose programming where GPUs are still found lacking is the ability to invoke system calls. We explore how to directly invoke generic system calls in GPU programs. We examine how
more » ... calls should be meshed with prevailing GPGPU programming models where thousands of threads are organized in a hierarchy of execution groups: Should a system call be invoked at the individual GPU task, or at different execution group levels? What are reasonable ordering semantics for GPU system calls across these hierarchy of execution groups? To study these questions, we implemented GENESYS -- a mechanism to allow GPU pro- grams to invoke system calls in the Linux operating system. Numerous subtle changes to Linux were necessary, as the existing kernel assumes that only CPUs invoke system calls. We analyze the performance of GENESYS using micro-benchmarks and three applications that exercise the filesystem, networking, and memory allocation subsystems of the kernel. We conclude by analyzing the suitability of all of Linux's system calls for the GPU.
arXiv:1705.06965v2 fatcat:eejpt3clujap3nlbkzm323s434

BadgerTrap

Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift
2014 SIGARCH Computer Architecture News  
Arkaprava Basu's contribution to the tool occurred while at University of Wisconsin-Madison.  ...  Basu et al. [1] used an earlier version of the tool that later evolved into BadgerTrap.  ... 
doi:10.1145/2669594.2669599 fatcat:aytmbxdw3nf2dpm37lkfbr7hfi

Karma

Arkaprava Basu, Jayaram Bobba, Mark D. Hill
2011 Proceedings of the international conference on Supercomputing - ICS '11  
Recent research in deterministic record-replay seeks to ease debugging, security, and fault tolerance on otherwise nondeterministic multicore systems. The important challenge of handling shared memory races (that can occur on any memory reference) can be made more efficient with hardware support. Recent proposals record how long threads run in isolation on top of snooping coherence (IMRR), implicit transactions (DeLorean), or directory coherence (Rerun). As core counts scale, Rerun's
more » ... ased parallel record gets more attractive, but its nearly sequential replay becomes unacceptably slow. This paper proposes Karma for both scalable recording and replay. Karma builds an episodic memory race recorder using a conventional directory cache coherence protocol and records the order of the episodes as a directed acyclic graph. Karma also enables extension of episodes even after some conflicts. During replay, Karma uses wakeup messages to trigger a partially ordered parallel episode replay. Results with several commercial workloads on a 16-core system show that Karma can achieve replay speed (a) within 19%-28% of native execution speed without record-replay and (b) four times faster than even an idealized Rerun replay. Additional results explore tradeoffs between log size and replay speed.
doi:10.1145/1995896.1995950 dblp:conf/ics/BasuBH11 fatcat:gveqwsqb4vgghfegwn6wf7bela

Faastlane: Accelerating Function-as-a-Service Workflows

Swaroop Kotni, Ajay Nayak, Vinod Ganapathy, Arkaprava Basu
2021 USENIX Annual Technical Conference  
In FaaS workflows, a set of functions implement application logic by interacting and exchanging data among themselves. Contemporary FaaS platforms execute each function of a workflow in separate containers. When functions in a workflow interact, the resulting latency slows execution. Faastlane minimizes function interaction latency by striving to execute functions of a workflow as threads within a single process of a container instance, which eases data sharing via simple load/store
more » ... . For FaaS workflows that operate on sensitive data, Faastlane provides lightweight thread-level isolation domains using Intel Memory Protection Keys (MPK). While threads ease sharing, implementations of languages such as Python and Node.js (widely used in FaaS applications) disallow concurrent execution of threads. Faastlane dynamically identifies opportunities for parallelism in FaaS workflows and fork processes (instead of threads) or spawns new container instances to concurrently execute parallel functions of a workflow. We implemented Faastlane atop Apache OpenWhisk and show that it accelerates workflow instances by up to 15×, and reduces function interaction latency by up to 99.95% compared to OpenWhisk.
dblp:conf/usenix/KotniNGB21 fatcat:qxibm654yvg4roeelm2pckytgi

Reducing memory reference energy with opportunistic virtual caching

Arkaprava Basu, Mark D. Hill, Michael M. Swift
2012 SIGARCH Computer Architecture News  
Most modern cores perform a highly-associative translation look aside buffer (TLB) lookup on every memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only
more » ... L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.
doi:10.1145/2366231.2337194 fatcat:v2mxbwbyr5he5e3tasvjq6ons4

Efficient virtual memory for big memory servers

Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, Michael M. Swift
2013 SIGARCH Computer Architecture News  
Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory. To remove the TLB miss overhead for big-memory
more » ... s, we propose mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of the virtual address space. Direct segments use minimal hardware-base, limit and offset registers per core-to map contiguous virtual memory regions directly to contiguous physical memory. They eliminate the possibility of TLB misses for key data structures such as database buffer pools and in-memory key-value stores. Memory mapped by a direct segment may be converted back to paging when needed. We prototype direct-segment software support for x86-64 in Linux and emulate direct-segment hardware. For our workloads, direct segments eliminate almost all TLB misses and reduce the execution time wasted on TLB misses to less than 0.5%.
doi:10.1145/2508148.2485943 fatcat:vix4kkpe5veefmas7inuv72uay

Reducing memory reference energy with opportunistic virtual caching

Arkaprava Basu, Mark D. Hill, Michael M. Swift
2012 2012 39th Annual International Symposium on Computer Architecture (ISCA)  
Most modern cores perform a highly-associative translation look aside buffer (TLB) lookup on every memory access. These designs often hide the TLB lookup latency by overlapping it with L1 cache access, but this overlap does not hide the power dissipated by TLB lookups. It can even exacerbate the power dissipation by requiring higher associativity L1 cache. With today's concern for power dissipation, designs could instead adopt a virtual L1 cache, wherein TLB access power is dissipated only
more » ... L1 cache misses. Unfortunately, virtual caches have compatibility issues, such as supporting writeable synonyms and x86's physical page table walker. This work proposes an Opportunistic Virtual Cache (OVC) that exposes virtual caching as a dynamic optimization by allowing some memory blocks to be cached with virtual addresses and others with physical addresses. OVC relies on small OS changes to signal which pages can use virtual caching (e.g., no writeable synonyms), but defaults to physical caching for compatibility. We show OVC's promise with analysis that finds virtual cache problems exist, but are dynamically rare. We change 240 lines in Linux 2.6.28 to enable OVC. On experiments with Parsec and commercial workloads, the resulting system saves 94-99% of TLB lookup energy and nearly 23% of L1 cache dynamic lookup energy.
doi:10.1109/isca.2012.6237026 dblp:conf/isca/BasuHS12 fatcat:x3qswkwwlvanbg2ztfoysnatcm

FreshCache: Statically and dynamically exploiting dataless ways

Arkaprava Basu, Derek R. Hower, Mark D. Hill, Michael M. Swift
2013 2013 IEEE 31st International Conference on Computer Design (ICCD)  
Last level caches (LLCs) account for a substantial fraction of the area and power budget in many modern processors. Two recent trends -dwindling die yield that falls off sharply with larger chips and increasing static power -make a strong case for a fresh look at LLC design. Inclusive caches are particularly interesting because many, if not most, commercially successful processors use inclusion to ease coherence at a cost of some data being stale or redundant. LLC designs can be improved
more » ... lly (at design time) or dynamically (at runtime). The "static dataless ways," removes the databut not tag-from some cache ways to save energy and area without complicating inclusive-LLC coherence. A dynamic version ("dynamic dataless ways") could dynamically turn off data, but not tags, effectively adapting the classic selective cache ways idea to save energy in LLC but not area. Our data show that (a) all our benchmarks benefit from dataless ways, but (b) the best number of dataless ways varies by workload. Thus, a pure static dataless design leaves energy-saving opportunity on the table, while a pure dynamic dataless design misses area-saving opportunity. To surpass both pure static and dynamic approaches, we develop the FreshCache LLC design that both statically and dynamically exploits dataless ways, including repurposing a predictor to adapt the number of dynamic dataless ways as well as detailed cache management policies. Results show that FreshCache saves more energy than static dataless ways alone (e.g., 72% vs. 9% of LLC) and more area by dynamic dataless ways only (e.g., 8% vs. 0% of LLC).
doi:10.1109/iccd.2013.6657055 dblp:conf/iccd/BasuHHS13 fatcat:74nmpmttqnaevd7lcl7lt6rw3m

Efficient virtual memory for big memory servers

Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, Michael M. Swift
2013 Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA '13  
Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory. To remove the TLB miss overhead for big-memory
more » ... s, we propose mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of the virtual address space. Direct segments use minimal hardware-base, limit and offset registers per core-to map contiguous virtual memory regions directly to contiguous physical memory. They eliminate the possibility of TLB misses for key data structures such as database buffer pools and in-memory key-value stores. Memory mapped by a direct segment may be converted back to paging when needed. We prototype direct-segment software support for x86-64 in Linux and emulate direct-segment hardware. For our workloads, direct segments eliminate almost all TLB misses and reduce the execution time wasted on TLB misses to less than 0.5%.
doi:10.1145/2485922.2485943 dblp:conf/isca/BasuGCHS13 fatcat:2p7dghs7g5axrn7dh2tttcufoe

Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks

Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift
2014 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture  
Arkaprava Basu's contribution to this work occurred while at UW-Madison. 2. Why is a TLB miss costlier? 4. Configurations 3.  ... 
doi:10.1109/micro.2014.37 dblp:conf/micro/GandhiBHS14 fatcat:ncrdyszrhzgqxjzu2vgp4x6ieq

Scavenger: A New Last Level Cache Architecture with Global Block Priority

Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri, Jose Martinez
2007 Microarchitecture (MICRO), Proceedings of the Annual International Symposium on  
Addresses suffering from cache misses typically exhibit repetitive patterns due to the temporal locality inherent in the access stream. However, we observe that the number of intervening misses at the last-level cache between the eviction of a particular block and its reuse can be very large, preventing traditional victim caching mechanisms from exploiting this repeating behavior. In this paper, we present Scavenger, a new architecture for last-level caches. Scavenger divides the total storage
more » ... udget into a conventional cache and a novel victim file architecture, which employs a skewed Bloom filter in conjunction with a pipelined priority heap to identify and retain the blocks that most frequently missed in the conventional part of the cache in the recent past. When compared against a baseline configuration with a 1MB 8-way L2 cache, a Scavenger configuration with a 512kB 8-way conventional cache and a 512kB victim file achieves an IPC improvement of up to 63% and on average (geometric mean) 14.2% for nine memory-bound SPEC 2000 applications. On a larger set of sixteen SPEC 2000 applications, Scavenger achieves an average speedup of 8%.
doi:10.1109/micro.2007.4408273 fatcat:cgcf5aeekff7jcg5drqwpawvvi

Scavenger: A New Last Level Cache Architecture with Global Block Priority

Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri, Jose Martinez
2007 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007)  
Addresses suffering from cache misses typically exhibit repetitive patterns due to the temporal locality inherent in the access stream. However, we observe that the number of intervening misses at the last-level cache between the eviction of a particular block and its reuse can be very large, preventing traditional victim caching mechanisms from exploiting this repeating behavior. In this paper, we present Scavenger, a new architecture for last-level caches. Scavenger divides the total storage
more » ... udget into a conventional cache and a novel victim file architecture, which employs a skewed Bloom filter in conjunction with a pipelined priority heap to identify and retain the blocks that most frequently missed in the conventional part of the cache in the recent past. When compared against a baseline configuration with a 1MB 8-way L2 cache, a Scavenger configuration with a 512kB 8-way conventional cache and a 512kB victim file achieves an IPC improvement of up to 63% and on average (geometric mean) 14.2% for nine memory-bound SPEC 2000 applications. On a larger set of sixteen SPEC 2000 applications, Scavenger achieves an average speedup of 8%
doi:10.1109/micro.2007.42 dblp:conf/micro/BasuKKCM07 fatcat:4wuxz6mqcjh4hhywwemfy2x26i

The gem5 simulator

Nathan Binkert, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, David A. Wood, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi (+4 others)
2011 SIGARCH Computer Architecture News  
A novel method to protect a system against errors resulting from soft errors occurring in the virtual address (VA) storing structures such as translation lookaside buffers (TLB), physical register file (PRF) and the program counter (PC) is proposed in this paper. The work is motivated by showing how soft errors impact the structures that store virtual page numbers (VPN). A solution is proposed by employing linear block encoding methods to be used as a virtual addressing scheme at link time.
more » ... g the encoding scheme to assign VPNs for VAs, it is shown that the system can tolerate soft errors using software with the help of the discussed decoding techniques applied to the page fault handler. The proposed solution can be used on all of the architectures using virtually indexed addressing. The main contribution of this paper is the decreasing of AVF for data TLB by 42.5%, instruction TLB by 40.3%, PC by 69.2% and PRF by 33.3%. ! Address Symbol Encoded Address 0x4001d8 rela iplt start 0x200cc1d8
doi:10.1145/2024716.2024718 fatcat:4rj2ut4pyve5dacs5ostiwshji

Trident: Harnessing Architectural Resources for All Page Sizes in x86 Processors

Venkat Sri Sai Ram, Ashish Panwar, Arkaprava Basu
2021 MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture  
Arkaprava is supported by a Young Investigator Fellowship by Pratiksha Trust, Bangalore.  ... 
doi:10.1145/3466752.3480062 fatcat:s7aninoyrvc2phvqdg57ykyrjq

Heterogeneous system coherence for integrated CPU-GPU systems

Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, David A. Wood
2013 Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-46  
Basu et al. extend region coherence to directory-based systems [8] . This work proposes a dual-granular directory design that tracks both block-and region-level permissions.  ... 
doi:10.1145/2540708.2540747 dblp:conf/micro/PowerBGPBHRW13 fatcat:tfbw4k74j5avzppy2zkk6rkxte
« Previous Showing results 1 — 15 out of 31 results