A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2006; you can also visit the original URL.
The file type is application/pdf
.
Filters
sequential execution stream and speculatively executes them in parallel on multiple processor cores. ...
Multiplex exploits the similarities between implicit and explicit threading, and provides a unified support for the two threading models without additional hardware. ...
MULTIPLEX: UNIFYING EXPLICIT/ IMPLICIT THREADING In this paper, we propose Multiplex, an architecture that unifies explicit and implicit threading on a chip multiprocessor. ...
doi:10.1145/377792.377863
dblp:conf/ics/OoiKPEFV01
fatcat:fuke25apgnbn7eqhxod62zsz3m
Multi-Threaded Processors
[chapter]
2011
Encyclopedia of Parallel Computing
The chip multiprocessor integrates two or more complete processors on a single chip. Every unit of a processor is duplicated and used independently of its copies on the chip. ...
The instruction-level parallelism found in a conventional instruction stream is limited. Studies have shown the limits of processor utilization even for today's superscalar microprocessors. ...
Here thread-level parallelism is utilized, typically in combination with thread-level speculation [15] . ...
doi:10.1007/978-0-387-09766-4_423
fatcat:heb3n2cfwnbi5nvxv5kvxd2xgm
Multithreaded Processors
2002
Computer journal
The chip multiprocessor integrates two or more complete processors on a single chip. Every unit of a processor is duplicated and used independently of its copies on the chip. ...
The instruction-level parallelism found in a conventional instruction stream is limited. Studies have shown the limits of processor utilization even for today's superscalar microprocessors. ...
Here thread-level parallelism is utilized, typically in combination with thread-level speculation [15] . ...
doi:10.1093/comjnl/45.3.320
fatcat:hlkkabuhrzhkrmuyqomzfmc6zm
A survey of processors with explicit multithreading
2003
ACM Computing Surveys
The contexts of two or more threads of control are often stored in separate on-chip register sets. ...
Underutilization of a superscalar processor due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where a processor can issue multiple instructions from multiple ...
speculative multithreading), and chip multiprocessors. ...
doi:10.1145/641865.641867
fatcat:u6x7jdmkfvexnm3culskjsoxwi
A single-chip multiprocessor
1997
Computer
Additionally, having all eight of the CPUs on a single chip allows designers to exploit thread-level parallelism even when threads communicate frequently. ...
The processor core dynamically allocates instruction fetch and execution resources among the different threads on a cycle-by-cycle basis to find as much thread-level and instruction-level parallelism as ...
doi:10.1109/2.612253
fatcat:l645n6krxnaphalnk5w6pogwye
Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency
2007
Synthesis Lectures on Computer Architecture
if on a single chip
SMP
Realistic
4
25
20
multiprocessor, if
on a board
CHIP MULTIPROCESSOR ARCHITECTURE
CHIP MULTIPROCESSOR ARCHITECTURE FIGURE 4.8: Overall speedup obtained in different ...
Unlike conventional uniprocessors, multicore chips can use TLP, and can therefore also take advantage of threads to utilize parallelism from the traditional large-grain task and process level parallelism ...
Olukotun led the Stanford Hydra project which developed the first chip multiprocessor (multicore chip) with support for thread-level speculation. ...
doi:10.2200/s00093ed1v01y200707cac003
fatcat:qyjilavdhfcmlnc46l5sxg7ssq
A Survey on Hardware and Software Support for Thread Level Parallelism
[article]
2016
arXiv
pre-print
Todays computers are built upon multiple processing cores and run applications consisting of a large number of threads, making runtime thread management a complex process. ...
To support growing massive parallelism, functional components and also the capabilities of current processors are changing and continue to do so. ...
The aim of Cilk is to help programmers to build applications optimized for a maximum level of parallelism on shared-memory multiprocessors (SMPs). ...
arXiv:1603.09274v3
fatcat:75isdvgp5zbhplocook6273sq4
High-Performance Energy-Efficient Multicore Embedded Computing
2012
IEEE Transactions on Parallel and Distributed Systems
With Moore's law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multicore to exploit this high-transistor density for high performance. ...
The increase in on-chip transistor density exacerbates power/thermal issues in embedded systems, which necessitates novel hardware/software power/thermal management techniques to meet the ever-increasing ...
ACKNOWLEDGMENTS This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the US National Science Foundation (NSF) (CNS-0953447 and CNS-0905308). ...
doi:10.1109/tpds.2011.214
fatcat:vagqmojdsjevvc2u2ewqrcjjpq
Factored operating systems (fos)
2009
ACM SIGOPS Operating Systems Review
The next decade will afford us computer chips with 100's to 1,000's of cores on a single piece of silicon. ...
Contemporary operating systems have been designed to operate on a single core or small number of cores and hence are not well suited to manage and provide operating system services at such large scale. ...
Acknowledgments This work is funded by DARPA, Quanta Computing, Google, and the NSF. We thank Robert Morris and Frans Kaashoek for feedback on this work. ...
doi:10.1145/1531793.1531805
fatcat:vdak4y4dt5cavlcqj7s7q4p3bu
A Survey of Coarse-Grained Reconfigurable Architecture and Design
2019
ACM Computing Surveys
This article reviews the architecture and design of CGRAs thoroughly for the purpose of exploiting their full potential. First, a novel multidimensional taxonomy is proposed. ...
As general-purpose processors have hit the power wall and chip fabrication cost escalates alarmingly, coarsegrained reconfigurable architectures (CGRAs) are attracting increasing interest from both academia ...
Fig. 10 . 10 (a) Conventional processor system, (b) expanding the on-chip memory size, (c) PIM at memory interface, and (d) distributed PIM at memory array. ...
doi:10.1145/3357375
fatcat:pqi4d33i6bg45a6llswhwd44qi
Core-Selectability in Chip Multiprocessors
2009
2009 18th International Conference on Parallel Architectures and Compilation Techniques
The centralized structures necessary for the extraction of instruction-level parallelism (ILP) are consuming progressively smaller portions of the total die area of chip multiprocessors (CMP). ...
In addition, it can provide significantly greater throughput to multiprogrammed workloads by providing the potential for the system to transform into a heterogeneous design. ...
CCF-0811707, and funding from Intel and IBM. ...
doi:10.1109/pact.2009.44
dblp:conf/IEEEpact/Najaf-abadiCR09
fatcat:qvlgvgvhdrd5rkxh23xatavjv4
A case for FAME
2010
Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10
The study demonstrates FAME's capabilities: we run a modern parallel benchmark suite on a research operating system, simulate 64-core target architectures with multi-level memory hierarchy timing models ...
To clear up misconceptions about FPGA-based simulation methodologies, we propose a FAME taxonomy to distinguish the costperformance of variations on these ideas. ...
Thanks to the ROS implementers for their assistance on the RAMP Gold port, and to Kevin Klues in particular for providing page coloring support to facilitate our case study. ...
doi:10.1145/1815961.1815999
dblp:conf/isca/TanWCBAP10
fatcat:4u2ves3qn5ckdhgocyzfplx4z4
A case for FAME
2010
SIGARCH Computer Architecture News
The study demonstrates FAME's capabilities: we run a modern parallel benchmark suite on a research operating system, simulate 64-core target architectures with multi-level memory hierarchy timing models ...
To clear up misconceptions about FPGA-based simulation methodologies, we propose a FAME taxonomy to distinguish the costperformance of variations on these ideas. ...
Thanks to the ROS implementers for their assistance on the RAMP Gold port, and to Kevin Klues in particular for providing page coloring support to facilitate our case study. ...
doi:10.1145/1816038.1815999
fatcat:v3bebnwebzdmzdutz4d345rzdq
Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads
2007
2007 IEEE International Symposium on Performance Analysis of Systems & Software
The proposed protocol extensions yield speedup of 1.45 for parallel reduction and 1.29 for matrix transpose on a 16-node DSM multiprocessor when compared to non-active memory baseline systems and achieve ...
In this paper we make the important observation that on a traditional flexible distributed shared memory (DSM) multiprocessor node, equipped with a coherence protocol thread context as in SMTp or a simple ...
(DSM) multiprocessor with nodes capable of running one or two application threads, thereby allowing us to experiment with 16-and 32-way threaded parallel applications. ...
doi:10.1109/ispass.2007.363754
dblp:conf/ispass/KalamkarCH07
fatcat:c4ycfckyozfnho2jxcbllzv7a4
Achieving Programming Model Abstractions for Reconfigurable Computing
2008
IEEE Transactions on Very Large Scale Integration (vlsi) Systems
This enables all platform components to be abstracted into a unified multiprocessor architecture platform. ...
Presently accepted hybrid CPU/FPGA computational models-and access to these computational models via high level languages-focus on programming language extensions to increase accessibility and portability ...
Two hardware threads running in parallel show 7.1 speedup when compared to two software threads time-multiplexing on the CPU. ...
doi:10.1109/tvlsi.2007.912106
fatcat:qepiyqik7nh67df27xjddlzzcm
« Previous
Showing results 1 — 15 out of 119 results