A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2016; you can also visit the original URL.
The file type is application/pdf
.
Filters
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor
2012
2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Therefore, we propose a unified local memory which can dynamically change the partitioning among registers, cache, and scratchpad on a perapplication basis. ...
These threads require substantial on-chip storage for registers, cache, and scratchpad memory. ...
This research was funded in part by DARPA contract HR0011-10-9-0008 and NSF grant CCF-0936700. ...
doi:10.1109/micro.2012.18
dblp:conf/micro/GebhartKKKD12
fatcat:z5zusxqalnbb7j2bvpgjrjhkbe
High-Performance Throughput Computing
2005
IEEE Micro
For a T-threaded core, we assume T copies of the register file and T copies of the internal processor registers. ...
In Figure 6 , the y-axis shows the number of additional misses overlapped to memory from a 2-Mbyte unified L2 cache. ...
doi:10.1109/mm.2005.49
fatcat:wq3ukuhpg5gubkia7bdjwz2nfu
A Comparative Study of Heterogeneous Processor Simulators
2016
International Journal of Computer Applications
In 1970's, Gordon Moore perceived that the number of transistors in a processor would double after every 18 months. ...
In this study, we present a detailed comparative analysis of gem5-gpu, gem5, and multi2sim simulators. ...
Each compute unit contains a fetch and decode register file, execution lanes, and scratch-pad memory, and coalesce [2] . This architecture also shows onchip and off-chip memories. ...
doi:10.5120/ijca2016911316
fatcat:t7532ev45nhu7m3pt4r4fwqjna
Application driven embedded system design
2007
Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems - CASES '07
The key to increasing performance without a commensurate increase in power consumption in modern processors lies in increasing both parallelism and core specialization. ...
The resulting core running the compiled code delivers a 1.65x throughput improvement over a high performance processor (Pentium 4) while simultaneously achieving an 80x energy-delay improvement over an ...
In a traditional super-scalar processor, instructions are fetched, decoded, issued and retired. Function units receive operands from a register file and return results to the register file. ...
doi:10.1145/1289881.1289902
dblp:conf/cases/RamaniD07
fatcat:ybbpppvjcjgkrivcw54lz66buu
Federation
2010
ACM Transactions on Architecture and Code Optimization (TACO)
We reuse the large register file in the multi-threaded cores to implement some out-of-order structures and reengineer other large, associative structures into simpler lookup tables. ...
For applications or phases with more limited parallelism, we describe creating an out-of-order processor on the fly, by federating two neighboring in-order cores. ...
The unified register file, a portion of the issue queue, and the store buffer are mapped onto multiple banks in the existing in-order register file. ...
doi:10.1145/1880043.1880046
fatcat:x6atfaij3bb2hc25owjck3c57a
OUTRIDER
2011
SIGARCH Computer Architecture News
The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture ...
We present Outrider, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. ...
Johnson, William Tuohy, Wooil Kim, and the anonymous referees for their feedback. ...
doi:10.1145/2024723.2000079
fatcat:2ny5ydqgmffkvglkm2b2v6fxka
OUTRIDER
2011
Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11
The key insight is that by decoupling the instruction streams, the processor pipeline can tolerate memory latency in a way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture ...
We present Outrider, an architecture for throughput-oriented processors that provides memory latency tolerance to improve performance on highly threaded workloads. ...
Johnson, William Tuohy, Wooil Kim, and the anonymous referees for their feedback. ...
doi:10.1145/2000064.2000079
dblp:conf/isca/CragoP11
fatcat:w56fto3w4vgoxamabvgcrkb2z4
High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities
2011
Synthesis Lectures on Computer Architecture
Acknowledgments First we would like to thank Mark Hill and Michael Morgan for having invited us to write a synthesis lecture and for their support. Many thanks to reviews from Tor M. Aamodt ...
Although a register file cache was originally proposed to reduce the register file access time [29] , in GPUs the register file cache is proposed to reduce register reads and writes. ...
Register File Cache To reduce the pressure on the register file, Gebhart et al. propose a register file cache [41] . ...
doi:10.2200/s00341ed1v01y201103cac014
fatcat:rjpziqdnezdrnhfiygrg3jdz4m
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
2012
Synthesis Lectures on Computer Architecture
Acknowledgments First we would like to thank Mark Hill and Michael Morgan for having invited us to write a synthesis lecture and for their support. Many thanks to reviews from Tor M. Aamodt ...
Although a register file cache was originally proposed to reduce the register file access time [29] , in GPUs the register file cache is proposed to reduce register reads and writes. ...
Register File Cache To reduce the pressure on the register file, Gebhart et al. propose a register file cache [41] . ...
doi:10.2200/s00451ed1v01y201209cac020
fatcat:ll4uas6lmjbcll5zqzomhcv5vq
Shire: Making FPGA-accelerated Middlebox Development More Pleasant
[article]
2022
arXiv
pre-print
We show the benefits of Shire framework by building a firewall based on a large blacklist and porting the Pigasus IDS pattern-matching accelerator in less than a month. ...
This separation of concerns allows hardware developers to focus on optimizing custom accelerators while freeing software programmers to reuse, configure, and debug accelerators in a fashion akin to software ...
a match it raises a flag in a register. ...
arXiv:2201.08978v1
fatcat:sofxiih4qjddjafvgujils2ruq
Performance characterization of a Quad Pentium Pro SMP using OLTP workloads
1998
SIGARCH Computer Architecture News
advances in processor design. ...
We find that caches are effective at reducing processor traffic to memory; even larger caches would be helpful to satisfy more data requests. ...
Acknowledgments We thank Seckin Unlu of Intel for his help in deciphering the Pentium Pro hardware counters and interpreting experimental results. ...
doi:10.1145/279361.279364
fatcat:krdbxhrf5jfvzhi6akmvpxca4q
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors
2012
ACM Transactions on Computer Systems
We consider both a hardware-managed caching scheme and a softwaremanaged scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. ...
Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. ...
This research was funded in part by DARPA contract HR0011-10-9-0008 and NSF grant CCF-0936700. ...
doi:10.1145/2166879.2166882
fatcat:cwh624dhdbbcffra6mr6kkorgu
The Floating-Point Unit of the Jaguar x86 Core
2013
2013 IEEE 21st Symposium on Computer Arithmetic
The FPU issues to the execution units with a dedicated out-of-order, dual-issue scheduler. Execution units source operands from a synthesized physical register file (PRF) and bypass network. ...
The verification of the unit required complex pseudo-random and formal verification techniques. The Jaguar FPU is built in a 28nm CMOS process. ...
floating-point consultation; Todd Swanson and Xiang Wu for their verification efforts early in the project; and finally all of the Jaguar core design team for continuing to achieve small miracles. ...
doi:10.1109/arith.2013.24
dblp:conf/arith/RupleyKQGPSDBB13
fatcat:whygoiiacrdg3cwoik4asvhyhq
Mamba: A scalable communication centric multi-threaded processor architecture
2012
2012 IEEE 30th International Conference on Computer Design (ICCD)
Communication is also a key issue in multi-core architecture. ...
However a fine-grained approach implies many interworking threads and the overhead of synchronising and scheduling these threads can eradicate any scalability advantages a fine-grained program may have ...
Acknowledgments The authors would like to acknowledge Arnab Banerjee for implementing the interconnect architecture used by the Mamba system and would like to thank Paul Fox, Timothy Jones and Theo Markettos ...
doi:10.1109/iccd.2012.6378652
dblp:conf/iccd/ChadwickM12
fatcat:peugqonstvbtlgb64iq67aw344
The Amoeba distributed operating system — A status report
1991
Computer Communications
Each processor pool consists of a substantial number of CPUs, each with its own local memory and its own network connection. ...
The pool processor model is more flexible, and provides for a better sharing of resources. The second element in our architecture is the workstation. ...
In addition, Leendert van Doorn provided valuable feedback about the paper. ...
doi:10.1016/0140-3664(91)90058-9
fatcat:4xsejwxxvrdhphbjaw6gmzifom
« Previous
Showing results 1 — 15 out of 617 results