Filters








617 Hits in 7.9 sec

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, William J. Dally
2012 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture  
Therefore, we propose a unified local memory which can dynamically change the partitioning among registers, cache, and scratchpad on a perapplication basis.  ...  These threads require substantial on-chip storage for registers, cache, and scratchpad memory.  ...  This research was funded in part by DARPA contract HR0011-10-9-0008 and NSF grant CCF-0936700.  ... 
doi:10.1109/micro.2012.18 dblp:conf/micro/GebhartKKKD12 fatcat:z5zusxqalnbb7j2bvpgjrjhkbe

High-Performance Throughput Computing

S. Chaudhry, P. Caprioli, S. Yip, M. Tremblay
2005 IEEE Micro  
For a T-threaded core, we assume T copies of the register file and T copies of the internal processor registers.  ...  In Figure 6 , the y-axis shows the number of additional misses overlapped to memory from a 2-Mbyte unified L2 cache.  ... 
doi:10.1109/mm.2005.49 fatcat:wq3ukuhpg5gubkia7bdjwz2nfu

A Comparative Study of Heterogeneous Processor Simulators

Shagufta S., Muhammad Aleem, Muhammad Arshad, Muhammad Azhar
2016 International Journal of Computer Applications  
In 1970's, Gordon Moore perceived that the number of transistors in a processor would double after every 18 months.  ...  In this study, we present a detailed comparative analysis of gem5-gpu, gem5, and multi2sim simulators.  ...  Each compute unit contains a fetch and decode register file, execution lanes, and scratch-pad memory, and coalesce [2] . This architecture also shows onchip and off-chip memories.  ... 
doi:10.5120/ijca2016911316 fatcat:t7532ev45nhu7m3pt4r4fwqjna

Application driven embedded system design

Karthik Ramani, Al Davis
2007 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems - CASES '07  
The key to increasing performance without a commensurate increase in power consumption in modern processors lies in increasing both parallelism and core specialization.  ...  The resulting core running the compiled code delivers a 1.65x throughput improvement over a high performance processor (Pentium 4) while simultaneously achieving an 80x energy-delay improvement over an  ...  In a traditional super-scalar processor, instructions are fetched, decoded, issued and retired. Function units receive operands from a register file and return results to the register file.  ... 
doi:10.1145/1289881.1289902 dblp:conf/cases/RamaniD07 fatcat:ybbpppvjcjgkrivcw54lz66buu

Federation

Michael Boyer, David Tarjan, Kevin Skadron
2010 ACM Transactions on Architecture and Code Optimization (TACO)  
We reuse the large register file in the multi-threaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables.  ...  For applications or phases with more limited parallelism, we describe creating an out-of-order processor on the fly, by federating two neighboring in-order cores.  ...  The unified register file, a portion of the issue queue, and the store buffer are mapped onto multiple banks in the existing in-order register file.  ... 
doi:10.1145/1880043.1880046 fatcat:x6atfaij3bb2hc25owjck3c57a

OUTRIDER

Neal Clayton Crago, Sanjay Jeram Patel
2011 SIGARCH Computer Architecture News  
The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture  ...  We present Outrider, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads.  ...  Johnson, William Tuohy, Wooil Kim, and the anonymous referees for their feedback.  ... 
doi:10.1145/2024723.2000079 fatcat:2ny5ydqgmffkvglkm2b2v6fxka

OUTRIDER

Neal Clayton Crago, Sanjay Jeram Patel
2011 Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11  
The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture  ...  We present Outrider, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads.  ...  Johnson, William Tuohy, Wooil Kim, and the anonymous referees for their feedback.  ... 
doi:10.1145/2000064.2000079 dblp:conf/isca/CragoP11 fatcat:w56fto3w4vgoxamabvgcrkb2z4

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities

Dennis Abts, John Kim
2011 Synthesis Lectures on Computer Architecture  
Acknowledgments First we would like to thank Mark Hill and Michael Morgan for having invited us to write a synthesis lecture and for their support. Many thanks to reviews from Tor M. Aamodt  ...  Although a register file cache was originally proposed to reduce the register file access time [29] , in GPUs the register file cache is proposed to reduce register reads and writes.  ...  Register File Cache To reduce the pressure on the register file, Gebhart et al. propose a register file cache [41] .  ... 
doi:10.2200/s00341ed1v01y201103cac014 fatcat:rjpziqdnezdrnhfiygrg3jdz4m

Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)

Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, Wen-mei Hwu
2012 Synthesis Lectures on Computer Architecture  
Acknowledgments First we would like to thank Mark Hill and Michael Morgan for having invited us to write a synthesis lecture and for their support. Many thanks to reviews from Tor M. Aamodt  ...  Although a register file cache was originally proposed to reduce the register file access time [29] , in GPUs the register file cache is proposed to reduce register reads and writes.  ...  Register File Cache To reduce the pressure on the register file, Gebhart et al. propose a register file cache [41] .  ... 
doi:10.2200/s00451ed1v01y201209cac020 fatcat:ll4uas6lmjbcll5zqzomhcv5vq

Shire: Making FPGA-accelerated Middlebox Development More Pleasant [article]

Moein Khazraee, Alex Forencich, George Papen, Alex C. Snoeren, Aaron Schulman
2022 arXiv   pre-print
We show the benefits of Shire framework by building a firewall based on a large blacklist and porting the Pigasus IDS pattern-matching accelerator in less than a month.  ...  This separation of concerns allows hardware developers to focus on optimizing custom accelerators while freeing software programmers to reuse, configure, and debug accelerators in a fashion akin to software  ...  a match it raises a flag in a register.  ... 
arXiv:2201.08978v1 fatcat:sofxiih4qjddjafvgujils2ruq

Performance characterization of a Quad Pentium Pro SMP using OLTP workloads

Kimberly Keeton, David A. Patterson, Yong Qiang He, Roger C. Raphael, Walter E. Baker
1998 SIGARCH Computer Architecture News  
advances in processor design.  ...  We find that caches are effective at reducing processor traffic to memory; even larger caches would be helpful to satisfy more data requests.  ...  Acknowledgments We thank Seckin Unlu of Intel for his help in deciphering the Pentium Pro hardware counters and interpreting experimental results.  ... 
doi:10.1145/279361.279364 fatcat:krdbxhrf5jfvzhi6akmvpxca4q

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, Kevin Skadron
2012 ACM Transactions on Computer Systems  
We consider both a hardware-managed caching scheme and a softwaremanaged scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy.  ...  Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency.  ...  This research was funded in part by DARPA contract HR0011-10-9-0008 and NSF grant CCF-0936700.  ... 
doi:10.1145/2166879.2166882 fatcat:cwh624dhdbbcffra6mr6kkorgu

The Floating-Point Unit of the Jaguar x86 Core

J. Rupley, J. King, E. Quinnell, F. Galloway, K. Patton, P. Seidel, J. Dinh, Hai Bui, A. Bhowmik
2013 2013 IEEE 21st Symposium on Computer Arithmetic  
The FPU issues to the execution units with a dedicated out-of-order, dual-issue scheduler. Execution units source operands from a synthesized physical register file (PRF) and bypass network.  ...  The verification of the unit required complex pseudo-random and formal verification techniques. The Jaguar FPU is built in a 28nm CMOS process.  ...  floating-point consultation; Todd Swanson and Xiang Wu for their verification efforts early in the project; and finally all of the Jaguar core design team for continuing to achieve small miracles.  ... 
doi:10.1109/arith.2013.24 dblp:conf/arith/RupleyKQGPSDBB13 fatcat:whygoiiacrdg3cwoik4asvhyhq

Mamba: A scalable communication centric multi-threaded processor architecture

Gregory A. Chadwick, Simon W. Moore
2012 2012 IEEE 30th International Conference on Computer Design (ICCD)  
Communication is also a key issue in multi-core architecture.  ...  However a fine-grained approach implies many interworking threads and the overhead of synchronising and scheduling these threads can eradicate any scalability advantages a fine-grained program may have  ...  Acknowledgments The authors would like to acknowledge Arnab Banerjee for implementing the interconnect architecture used by the Mamba system and would like to thank Paul Fox, Timothy Jones and Theo Markettos  ... 
doi:10.1109/iccd.2012.6378652 dblp:conf/iccd/ChadwickM12 fatcat:peugqonstvbtlgb64iq67aw344

The Amoeba distributed operating system — A status report

Andrew S Tanenbaum, M Frans Kaashoek, Robbert van Renesse, Henri E Bal
1991 Computer Communications  
Each processor pool consists of a substantial number of CPUs, each with its own local memory and its own network connection.  ...  The pool processor model is more flexible, and provides for a better sharing of resources. The second element in our architecture is the workstation.  ...  In addition, Leendert van Doorn provided valuable feedback about the paper.  ... 
doi:10.1016/0140-3664(91)90058-9 fatcat:4xsejwxxvrdhphbjaw6gmzifom
« Previous Showing results 1 — 15 out of 617 results