Filters








2,935 Hits in 5.0 sec

Cache-Efficient Fork-Processing Patterns on Large Graphs [article]

Shengliang Lu, Shixuan Sun, Johns Paul, Yuchen Li, Bingsheng He
2021 pre-print
As large graph processing emerges, we observe a costly fork-processing pattern (FPP) common in many graph algorithms.  ...  In this paper, we propose ForkGraph, a cache-efficient FPP processing system on multi-core architectures.  ...  To improve the efficiency of handling FPPs, we develop Fork-Graph, a cache-efficient system for processing FPPs for in-memory graphs on multi-core machines.  ... 
doi:10.1145/3448016.3457253 arXiv:2103.14915v1 fatcat:uif6fkyxyrb2vev2llllroqi2m

Executing Optimized Irregular Applications Using Task Graphs within Existing Parallel Models

Christopher D. Krieger, Michelle Mills Strout, Jonathan Roelofs, Amanreet Bajwa
2012 2012 SC Companion: High Performance Computing, Networking Storage and Analysis  
We present performance and scalability results for 8 and 40 core shared memory systems on a sparse matrix iterative solver and a molecular dynamics benchmark.  ...  These optimizations result in asynchronous parallelism that can be represented by arbitrary task graphs.  ...  DE-FC-0206-ER-25774, as part of its SciDAC program, a Department of Energy Early Career Grant DE-SC0003956, by a National Science Foundation CAREER grant CCF 0746693, and by the Department of Energy CACHE  ... 
doi:10.1109/sc.companion.2012.43 dblp:conf/sc/KriegerSRB12 fatcat:6yfyqol2knajpix5fj5k4pegoi

Data Oblivious Algorithms for Multicores [article]

Vijaya Ramachandran, Elaine Shi
2021 arXiv   pre-print
We first show that data-oblivious sorting can be accomplished by a binary fork-join algorithm with optimal total work and optimal (cache-oblivious) cache complexity, and in O(log n log log n) span (i.e  ...  Using our sorting algorithm as a core primitive, we show how to data-obliviously simulate general PRAM algorithms in the binary fork-join model with non-trivial efficiency.  ...  For graph problems, n is the number of vertices, and m = Ω(n) is the number of edges. We compare with insecure cache-efficient CREW binary fork-join algorithms.  ... 
arXiv:2008.00332v2 fatcat:fpj5e7tqjfbe7audobaqj7xv3y

Relating layered queueing networks and process algebra models

Mirco Tribastone
2010 Proceedings of the first joint WOSP/SIPEW international conference on Performance engineering - WOSP/SIPEW '10  
This paper presents a process-algebraic interpretation of the Layered Queueing Network model.  ...  The semantics of layered multi-class servers, resource contention, multiplicity of threads and processors are mapped into a model described in the stochastic process algebra PEPA.  ...  When one of such activities is chosen, the process behaves as the initial state of the main flow of the execution graph corresponding to that entry.  ... 
doi:10.1145/1712605.1712634 dblp:conf/wosp/Tribastone10 fatcat:zv7nkmafzzhgrpmpxcbe73ysce

Parallel triangle counting in massive streaming graphs

Kanat Tangwongsan, A. Pavan, Srikanta Tirthapura
2013 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13  
Driven by these applications and the trend that modern graph datasets are both large and dynamic, we present the design and implementation of a fast and cache-efficient parallel algorithm for estimating  ...  In these applications, modern graphs of interest tend to both large and dynamic.  ...  It can also be used to process large static graphs, reaping the benefits of parallelism and small memory footprints.  ... 
doi:10.1145/2505515.2505741 dblp:conf/cikm/TangwongsanPT13 fatcat:ic5ijt6g6vdjhjsmzhwxascihq

Parallel program performance prediction using deterministic task graph analysis

Vikram S. Adve, Mary K. Vernon
2004 ACM Transactions on Computer Systems  
For the applications we have examined, we find that the deterministic task graph model is very efficient to evaluate even for programs with large and complex task graphs.  ...  First, an experimental evaluation shows that our analysis technique is accurate and efficient for a variety of shared-memory programs, including programs with large and/or complex task graphs, sophisticated  ...  Overall, we found that the deterministic task graph model is quite efficient for programs with moderately large and complex task graphs.  ... 
doi:10.1145/966785.966788 fatcat:z4ojdrx6infnvaj4zgkcljzne4

Parallel Triangle Counting in Massive Streaming Graphs [article]

Kanat Tangwongsan, A. Pavan, Srikanta Tirthapura
2013 arXiv   pre-print
Driven by these applications and the trend that modern graph datasets are both large and dynamic, we present the design and implementation of a fast and cache-efficient parallel algorithm for estimating  ...  By leveraging the paralell cache-oblivious framework, it makes efficient use of the memory hierarchy of modern multicore machines without needing to know its specific parameters.  ...  It can also be used to process large static graphs, reaping the benefits of parallelism and small memory footprints.  ... 
arXiv:1308.2166v1 fatcat:rw5amayb3nexndocrkfb2j67ni

A Virtual Cache for Overlapped Memory Accesses of Path ORAM

Naoki Fujieda, Ryo Yamauchi, Hiroki Fujita, Shuichi Ichikawa
2017 International Journal of Networking and Computing  
This paper presents last path caching, which removes the redundancy of Path ORAM with a simpler protocol than an existing method called Fork Path ORAM.  ...  According to our evaluation with a prototyped FPGA implementation, the number of LUTs used with the last path caching was 1.4%-7.8% smaller than Fork Path ORAM.  ...  This paper also points out that Fork Path ORAM has an disadvantage on security: the derived access pattern may reflect the original access pattern in a specific condition.  ... 
doi:10.15803/ijnc.7.2_106 fatcat:tnan77d32veznibnoitxnygf54

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Sungpack Hong, Tayo Oguntebi, Kunle Olukotun
2011 2011 International Conference on Parallel Architectures and Compilation Techniques  
In graph-based applications, a systematic exploration of the graph such as a breadth-first search (BFS) often serves as a key component in the processing of their massive data sets.  ...  Such a hybrid approach provides the best performance for each graph size while avoiding poor worst-case performance on high-diameter graphs.  ...  However, efficient processing of large graphs is still considered challenging [6] ; one reason is the natural random memory access patterns exhibited in graph traversal.  ... 
doi:10.1109/pact.2011.14 dblp:conf/IEEEpact/HongOO11 fatcat:56n4bqxpv5hkjlioutdhq7xj3u

Scheduling FFT computation on SMP and multicore systems

Ayaz Ali, Lennart Johnsson, Jaspal Subhlok
2007 Proceedings of the 21st annual international conference on Supercomputing - ICS '07  
We evaluate the performance of OpenMP and PThreads implementations of FFT on a number of latest architectures.  ...  In this paper, we develop heuristics to simplify the generation of better schedules for parallel FFT computations on CMP/SMP systems.  ...  The efficiency on Xeon Woodcrest was the lowest: 14% for moderately large sizes and 9% for very large sizes.  ... 
doi:10.1145/1274971.1275011 dblp:conf/ics/AliJS07 fatcat:ngkx3wztgzanvgg3ybakgh4bva

Operating system benchmarking in the wake of lmbench

Aaron B. Brown, Margo I. Seltzer
1997 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '97  
technology, and memory-bus and cache coherency protocols) can essentially nullify the performance benefits of the aggressive execution core and sophisticated on-chip memory system of a modem processor  ...  Our analysis shows that off-chip memory system design continues to influence operating system performance in a significant way and that key design decisions (such as suboptimal choices of DRAM and cache  ...  /&en&:08 measures three methods of process invocation: a simple fork, a fork and exec, and process invocation via the shell.  ... 
doi:10.1145/258612.258690 dblp:conf/sigmetrics/BrownS97 fatcat:yt5uohe44zhftgwk6a6iib5pxi

Operating system benchmarking in the wake of lmbench

Aaron B. Brown, Margo I. Seltzer
1997 Performance Evaluation Review  
technology, and memory-bus and cache coherency protocols) can essentially nullify the performance benefits of the aggressive execution core and sophisticated on-chip memory system of a modem processor  ...  Our analysis shows that off-chip memory system design continues to influence operating system performance in a significant way and that key design decisions (such as suboptimal choices of DRAM and cache  ...  /&en&:08 measures three methods of process invocation: a simple fork, a fork and exec, and process invocation via the shell.  ... 
doi:10.1145/258623.258690 fatcat:vgsbgckfarcs7gsqjqtalp7awy

Scalable Graph Exploration on Multicore Processors

Virat Agarwal, Fabrizio Petrini, Davide Pasetto, David A. Bader
2010 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis  
similar graph on a Cray MTA-2 with 40 processors and (3) 5 times faster than 256 BlueGene/L processors on a graph with average degree 50.  ...  Our performance on several benchmark problems representative of the power-law graphs found in real-world problems reaches processing rates that are competitive with supercomputing results in the recent  ...  Searching large graphs poses difficult challenges, because the potentially vast data set is combined with the lack of spatial and temporal locality in the access pattern.  ... 
doi:10.1109/sc.2010.46 dblp:conf/sc/AgarwalPPB10 fatcat:7pqcphe3pjb2bieecp4hsomwmy

Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems [article]

Vivek Seshadri
2016 arXiv   pre-print
For these access patterns, GS-DRAM achieves near-ideal bandwidth and cache utilization, without increasing the latency of fetching data from memory.  ...  In addition to improving the efficiency of bulk data coherence, DBI has several applications including high-performance memory scheduling, efficient cache lookup bypassing, and enabling heterogeneous ECC  ...  Similarly, in graph processing, operations that update individual nodes in the graph have different access patterns than those that traverse the graph.  ... 
arXiv:1605.06483v1 fatcat:5pa4zmkbdzgulim2jsqjkry3pu

Synthesizing synchronous elastic flow networks

Greg Hoover, Forrest Brewer
2008 Proceedings of the conference on Design, automation and test in Europe - DATE '08  
We present the language syntax, semantics and synthesis techniques illustrated by the design of a latency tolerant cache controller.  ...  Figure 4 . 4 Compiled behavior graph from the simple cache controller specification in Algorithm 2.  ...  Our aim is to automate the process of creating a distributed token network which efficiently manages the behavior exploring architectural changes in the design.  ... 
doi:10.1145/1403375.1403449 fatcat:agpv7llburgurlj5ohwwkvovwq
« Previous Showing results 1 — 15 out of 2,935 results