Filters








276 Hits in 2.3 sec

A Decoupled KILO-Instruction Processor

M. Pericas, A. Cristal, R. Gonzalez, D.A. Jimenez, M. Valero
The Twelfth International Symposium on High-Performance Computer Architecture, 2006.  
In this paper we propose a decoupled microarchitecture that executes low latency instructions on a Cache Processor and high latency instructions on a Memory Processor.  ...  Building processors with large instruction windows has been proposed as a mechanism for overcoming the memory wall, but finding a feasible and implementable design has been an elusive goal.  ...  Daniel A. Jiménez is supported by NSF Grant CCR-0311091 as well as a grant from the Ministerio de Educación y Ciencia of Spain, SB2003-0357. In addition, we would like to thank James C.  ... 
doi:10.1109/hpca.2006.1598112 dblp:conf/hpca/PericasCGJV06 fatcat:hadimequkjfhnbx3gayqhd7nem

Exploiting Execution Locality with a Decoupled Kilo-Instruction Processor [chapter]

Miquel Pericàs, Adrian Cristal, Ruben González, Daniel A. Jiménez, Mateo Valero
High-Performance Computing  
This Decoupled Kilo-Instruction Processor (D-KIP) is very effective in recovering lost potential performance.  ...  Theoretically, increasing the size of the instruction window would allow much longer latencies to be hidden.  ...  Daniel A. Jiménez is supported by NSF Grant CCF-0545898.  ... 
doi:10.1007/978-3-540-77704-5_5 dblp:conf/ishpc/PericasCGJV05 fatcat:hdvvounmebeyfb2lqxwqud6ciq

Full-system timing-first simulation

Carl J. Mauer, Mark D. Hill, David A. Wood
2002 Performance Evaluation Review  
) up to 36% (16 processors) Full-System Timing-First Simulation Carl Mauer Performance Comparison • Absolute simulation performance comparison -In kilo-instructions committed per second (KIPS)  ...  Full-System Timing-First Simulation Carl Mauer Related Work Name Dynamic Full System Out-of- Order Multi- processor Decoupled Multiscalar Simulator [6] Yes Yes FastSim [24] Yes Yes  ... 
doi:10.1145/511399.511349 fatcat:kd5ka4oovba4ravjhhkpe3khgm

SQRL

Snehasish Kumar, Arrvindh Shriraman, Vijayalakshmi Srinivasan, Dan Lin, Jordon Phillips
2014 Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14  
The collector runs ahead of the PEs in a decoupled fashion and gathers data objects into the LLC.  ...  Unfortunately, the narrow load/store interfaces of general-purpose processors are not efficient for data structure traversals leading to wasteful instructions, low memory level parallelism, and energy  ...  Specialized co-processor approaches [3, 10, 22] and kilo-instruction processors [20] alike can use SQRL to increase the memory level parallelism in an energy-efficient manner. • SQRL targets supporting  ... 
doi:10.1145/2628071.2628118 dblp:conf/IEEEpact/KumarSS0P14 fatcat:kpikhzfx7zh37io3hofzkce2ii

Micro BTB: A High Performance and Lightweight Last-Level Branch Target Buffer for Servers [article]

Vishal Gupta
2021 arXiv   pre-print
Recent industry trend shows usage of large BTBs (100s of KB per core) that provide performance closer to the ideal BTB along with a decoupled front-end that provides efficient fetch-directed L1I instruction  ...  We observe that not all branch instructions require a full branch target address. Instead, we can store the branch target as a branch offset, relative to the branch instruction.  ...  Recent industry trend shows that modern processors employ a decoupled front-end [5] - [8] with a multi-level BTB design.  ... 
arXiv:2106.04205v2 fatcat:spy73o5abbd45jmuutnccjhaqq

Full-system timing-first simulation

Carl J. Mauer, Mark D. Hill, David A. Wood
2002 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '02  
Furthermore, we define an approach, called timing-first simulation, that uses an augmented timing simulator to execute instructions important to performance in conjunction with a functional simulator to  ...  To manage simulator complexity, this paper advocates decoupled simulator organizations that separate functional and performance concerns.  ...  For example, each processor in a four processor system would execute approximately 50 million instructions.  ... 
doi:10.1145/511334.511349 dblp:conf/sigmetrics/MauerHW02 fatcat:thhf66fncfcr7f56k264bpym3q

Full-system timing-first simulation

Carl J. Mauer, Mark D. Hill, David A. Wood
2002 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '02  
Furthermore, we define an approach, called timing-first simulation, that uses an augmented timing simulator to execute instructions important to performance in conjunction with a functional simulator to  ...  To manage simulator complexity, this paper advocates decoupled simulator organizations that separate functional and performance concerns.  ...  For example, each processor in a four processor system would execute approximately 50 million instructions.  ... 
doi:10.1145/511348.511349 fatcat:eyrv6ydcjzetrlkzzhvw4ffwou

Scalable and Flexible heterogeneous multi-core system

Rashmi, Dr. Dinesh
2012 International Journal of Advanced Computer Science and Applications  
Micro architecture contains a set of small and fast cache processors which execute high locality code.  ...  A network of small in-order memory engines use low locality code to improve performance by using instruction level parallelism (ILP).  ...  THE DECOUPLED KILO-INSTRUCTION PROCESSOR (D-KIP) In the D-KIP, two cores used to implement an application.  ... 
doi:10.14569/ijacsa.2012.031227 fatcat:m4vqub3x2fc7jlwm7dsb47ngzy

DeSC

Tae Jun Ham, Juan L. Aragón, Margaret Martonosi
2015 Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48  
We propose Decoupled Supply-Compute (DeSC) as a way to attack memory bottlenecks automatically, while maintaining good portability and low complexity.  ...  Across the evaluated workloads, DeSC o↵ers an average of 2.04x speedup over baseline (on homogeneous CMPs) and 1.56x speedup when a DeSC data supplier feeds data to a hardware accelerator.  ...  Tae Jun Ham was supported in part by a Samsung Fellowship. Prof. Aragón was supported by a fellowship from the Spanish MEC under grant "Subprograma Estatal de Movilidad del Profesorado 2015".  ... 
doi:10.1145/2830772.2830800 dblp:conf/micro/HamAM15 fatcat:eo7ko5m3ivcofokpnjshdqs3vu

A regulated transitive reduction (RTR) for longer memory race recording

Min Xu, Mark D. Hill, Rastislav Bodik
2006 Proceedings of the 12th international conference on Architectural support for programming languages and operating systems - ASPLOS-XII  
We propose a new partial order  ...  Memory race recording is a key technology for multithreaded deterministic replay.  ...  We report the average log growth rate of RTR/CMP in MegaBytes/core/second and Bytes/kilo-instructions.  ... 
doi:10.1145/1168857.1168865 dblp:conf/asplos/XuHB06 fatcat:vkew2esr45fdpmypiwzfgtzmbm

Scaling towards kilo-core processors with asymmetric high-radix topologies

N. Abeyratne, R. Das, Qingkun Li, K. Sewell, B. Giridhar, R. G. Dreslinski, D. Blaauw, T. Mudge
2013 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)  
In this paper, we explore the challenges in scaling on-chip networks towards kilo-core processors.  ...  To address both local and global communication optimizations independently, we decouple the interconnect design using asymmetric high-radix topologies.  ...  Processors with 10 to 100 cores [1, 2, 3, 4, 6] are already in the market today, and a processor with 1000 cores (kilo-core) may soon be a reality.While off-chip interconnection networks for 100s of  ... 
doi:10.1109/hpca.2013.6522344 dblp:conf/hpca/AbeyratneDLSGDBM13 fatcat:azgwgm33qffgnekwnmqcgk264m

Chameleon

Dong Hyuk Woo, Joshua B. Fryman, Allan D. Knies, Hsien-Hsin S. Lee
2010 ACM Transactions on Architecture and Code Optimization (TACO)  
PEs can also communicate with each other through a mesh network. 1 Communication is fully software controlled by a communication instruction, which allows each PE to transfer 64-bit or 128-bit data to  ...  Because all communication is fully controlled by explicit instructions and because all execution is fully orchestrated by the host processor, communication patterns are completely deterministic and exempt  ...  To execute instructions in the PE array, the host processor needs to broadcast three-wide 96-bit VLIW instructions to the PEs via an instruction bus (IBus) shown in Figure 1 .  ... 
doi:10.1145/1736065.1736068 fatcat:zumt57x4y5fhjjc3ljkrn7lflu

Efficient Data Supply for Parallel Heterogeneous Architectures

Tae Jnu Ham, Juan L. Aragón, Margaret Martonosi
2019 ACM Transactions on Architecture and Code Optimization (TACO)  
Drawing from the early decoupled access-execute (DAE) approach [44, 45] , recent works evolve and adapt such ideas for modern processors [8, 15, 16, 22, 25, 37] .  ...  Until now, most works on decoupled data supply systems have primarily focused on them in single-threaded contexts: a single DDS unit and a single CU operating as a pair.  ...  The kilo-instruction processor [10] , Bolt [21] , waiting instruction buffer [31] , continual flow pipeline [46] , EMC [17] , and several other previous works [1, 6, 40, 41] explored the potential  ... 
doi:10.1145/3310332 fatcat:ojgjxw2fmnczfik5hmpbaik6cy

Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM

Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, Onur Mutlu
2015 2015 International Conference on Parallel Architecture and Compilation (PACT)  
To this end, we propose a hardware-software cooperative data transfer mechanism, Decoupled DMA (DDMA) that provides a specialized low-cost memory channel for IO accesses.  ...  By effectively decoupling accesses for CPU-GPU communication and in-memory communication from CPU accesses, our DDMA-based design achieves significant performance improvement across a wide variety of system  ...  Donghyuk Lee is supported in part by a Ph.D. scholarship from Samsung and the John and Claire Bertucci Graduate Fellowship.  ... 
doi:10.1109/pact.2015.51 dblp:conf/IEEEpact/LeeSACM15 fatcat:sm7bb67vnneyrkqerox66ck7ve

A Two-Level Load/Store Queue Based on Execution Locality

Miquel Pericàs, Adrian Cristal, Francisco J. Cazorla, Ruben González, Alex Veidenbaum, Daniel A. Jiménez, Mateo Valero
2008 2008 International Symposium on Computer Architecture  
Such cores will need to address the memory-wall by implementing kilo-instruction windows.  ...  By exploiting locality among loads and stores, ELSQ outperforms even an idealized central LSQ when implemented on top of a decoupled processor design.  ...  This concept has been used to propose decoupled processor designs. A first core executes these high locality instructions just after decode.  ... 
doi:10.1109/isca.2008.10 dblp:conf/isca/PericasCCGVJV08 fatcat:q4v55zqrcberhanqvurr4kc3xm
« Previous Showing results 1 — 15 out of 276 results