Filters








21,155 Hits in 4.1 sec

Instruction pre-processing in trace processors

Q. Jacobson, J.E. Smith
1999 Proceedings Fifth International Symposium on High-Performance Computer Architecture  
In trace processors, a sequential program is partitioned at run time into "traces." A trace is an encapsulation of a dynamic sequence of instructions.  ...  Traces are "pre-processed" to optimize the instructions for execution together. We propose three specific optimizations: instruction scheduling, constant propagation, and instruction collapsing.  ...  That is, a trace of instructions can be optimized via a "pre-process" phase prior to being placed in the trace cache.  ... 
doi:10.1109/hpca.1999.744347 dblp:conf/hpca/JacobsonS99 fatcat:igbwi5hnczgtzm5jm4tayutqwy

A case for shared instruction cache on chip multiprocessors running OLTP

Partha Kundu, Murali Annavaram, Trung Diep, John Shen
2004 SIGARCH Computer Architecture News  
In fact, in a CMP system, a n I-cache shared between multiple processors incurs similar miss rate as a dedicated I-cache per processor where the per processor I-cache has the same capacity as the shared  ...  Based on these observations, this paper makes the case for a shared I-cache organization in a CMP, instead of the traditional approach of using a dedicated I-cache per processor.  ...  A 2-P and a 4-P cluster is studied by post-processing the instruction trace records from two and four of the eight processors in the trace, respectively.  ... 
doi:10.1145/1024295.1024297 fatcat:jfknco5jbfcy3hcpn7luajusna

A case for shared instruction cache on chip multiprocessors running OLTP

Partha Kundu, Murali Annavaram, Trung Diep, John Shen
2003 Proceedings of the 2003 workshop on MEmory performance DEaling with Applications , systems and architecture - MEDEA '03  
In fact, in a CMP system, a n I-cache shared between multiple processors incurs similar miss rate as a dedicated I-cache per processor where the per processor I-cache has the same capacity as the shared  ...  Based on these observations, this paper makes the case for a shared I-cache organization in a CMP, instead of the traditional approach of using a dedicated I-cache per processor.  ...  A 2-P and a 4-P cluster is studied by post-processing the instruction trace records from two and four of the eight processors in the trace, respectively.  ... 
doi:10.1145/1152923.1024297 fatcat:5wxozscxabfrbiivwyp4a2wfgu

Implementing virtual secure circuit using a custom-instruction approach

Zhimin Chen, Ambuj Sinha, Patrick Schaumont
2010 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems - CASES '10  
The emulation is done by introducing two simple complementary instructions to the processor and applying a secure programming style.  ...  Experiments on a prototype implementation demonstrated that the new countermeasure considerably increases the difficulty of the attacks by 20 times, which is in the same order as the improvement achieved  ...  Besides the balanced instructions, the processor also has another set of instructions that perform the pre-charge operations in the same way as the DRP circuits.  ... 
doi:10.1145/1878921.1878933 dblp:conf/cases/ChenSS10 fatcat:thy3ceorg5bvteb5pudoh2simi

Accurately approximating superscalar processor performance from traces

Kiyeon Lee, Shayne Evans, Sangyeun Cho
2009 2009 IEEE International Symposium on Performance Analysis of Systems and Software  
The dynamic nature of superscalar processors combined with the static nature of traces can lead to large inaccuracies in the results, especially when traces contain only a subset of executed instructions  ...  Our work forms a basis for fast, accurate, and configurable multicore processor simulation using a pre-determined processor core design.  ...  Acknowledgment This work was supported in part by NSF grant CCF-0702236 and an A. Richard Newton Graduate Scholarship from the 45th Design Automation Conf. (DAC).  ... 
doi:10.1109/ispass.2009.4919655 dblp:conf/ispass/LeeEC09 fatcat:hhq4ihnt7jd63dscz3b7xwh7ky

Automated design of application specific superscalar processors

Tejas S. Karkhanis, James E. Smith
2007 SIGARCH Computer Architecture News  
Analytical modeling is applied to the automated design of application-specific superscalar processors.  ...  The output is the set of out-of-order superscalar processors that are Pareto-optimal with respect to performance-energy-area.  ...  Finally, the synthetic instruction trace is simulated with a cycle accurate processor simulator.  ... 
doi:10.1145/1273440.1250712 fatcat:7apekedwtvf5zbkyvbbdb6expa

A survey of new research directions in microprocessors

J. Šilc, T. Ungerer, B. Robic
2000 Microprocessors and microsystems  
Multiscalar and trace processors define several processing cores that speculatively execute different parts of a sequential program in parallel.  ...  Multiscalar processors use a compiler to partition the program segments, whereas a trace processor uses a trace cache to generate dynamically trace segments for the processing cores.  ...  Blocks of instructions are pre-processed before being put in the trace cache, which greatly simplifies processing after they are fetched.  ... 
doi:10.1016/s0141-9331(00)00072-7 fatcat:55y6n4wzijaeppl3l5qp6x2koa

Impact of configurability and extensibility on IPSec protocol execution on embedded processors

R. Potlapally, S. Ravi, A. Raghunathan, R.B. Lee, N.K. Jha
2006 19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems Design (VLSID'06)  
A promising approach for improving performance in embedded systems is to use application-specific instruction set processors that are designed based on configurable and extensible processors.  ...  as instruction and data cache sizes, processor-memory interface width, write buffers, etc., and (b) extending the base instruction set of the processor using custom instructions for both cryptographic  ...  The methodology consists of three phases: (i) gathering real-world IPSec traces 1 in a data collection phase, (ii) using the traces to analyze the performance of IPSec on the Xtensa processor through instruction-set  ... 
doi:10.1109/vlsid.2006.102 dblp:conf/vlsid/PotlapallyRRLJ06 fatcat:67fiirtkefg3de3dqsaliq52la

Automated design of application specific superscalar processors

Tejas S. Karkhanis, James E. Smith
2007 Proceedings of the 34th annual international symposium on Computer architecture - ISCA '07  
Analytical modeling is applied to the automated design of application-specific superscalar processors.  ...  The output is the set of out-of-order superscalar processors that are Pareto-optimal with respect to performance-energy-area.  ...  Finally, the synthetic instruction trace is simulated with a cycle accurate processor simulator.  ... 
doi:10.1145/1250662.1250712 dblp:conf/isca/KarkhanisS07 fatcat:pmvkhmofira2rjnkj46lboxfl4

SLAP: A Split Latency Adaptive VLIW pipeline architecture which enables on-the-fly variable SIMD vector-length [article]

Ashish Shrivastava and Alan Gatherer and Tong Sun and Sushma Wokhlu and Alex Chandra
2021 arXiv   pre-print
The VLIW architecture, the de facto signal processing engine, suffers badly from a breakdown in lockstep execution of scalar and vector instructions.  ...  Various techniques were deployed to improve average memory access latencies, such as speculative pre-fetching and branch-prediction, often leading to high variance in execution time which is unacceptable  ...  It is therefore difficult to hand optimize the loops or successfully use speculative pre-fetching techniques as shown in 1.b.  ... 
arXiv:2102.13301v1 fatcat:j6wol5bddbhwddwhac5rll54rq

Towards optimized packet processing for multithreaded network processor

Yeim-Kuan Chang, Fang-Chen Kuo
2010 2010 International Conference on High Performance Switching and Routing  
In this paper, we investigate several optimization issues and programming techniques that should be considered by the developers to achieve higher packet processing rate on network processors.  ...  How to efficiently program multiple processing elements and utilize various memory modules as well as the hardware resources on network processors are always challenges.  ...  Rule and Trace for Packet Classification To benchmark these techniques, we use ClassBench [9] To simulate the search process of HBPS without constructing the needed data structure first, we pre-built  ... 
doi:10.1109/hpsr.2010.5580281 dblp:conf/hpsr/ChangK10 fatcat:4ygf46jjavfurpgvxilpjsfcsy

Trace preconstruction

Quinn Jacobson, James E. Smith
2000 SIGARCH Computer Architecture News  
The trace preconstruction mechanism observes the processor's instruction dispatch stream to detect opportunities for jumping ahead of the processor.  ...  After doing so, the preconstruction mechanism fetches static instructions from the predicted future region of the program, and constructs a set of traces in advance of when they are needed.  ...  The trace processor has 4 processing elements each with a window of 16 instructions (one trace length) for a total window size of 64 instructions.  ... 
doi:10.1145/342001.339653 fatcat:qokalsl5xzfmreloy6x2lwiaka

Instruction Trace Compression for Rapid Instruction Cache Simulation

Andhi Janapsatya, Aleksandar Ignjatovic, Sri Parameswaran, Joerg Henkel
2007 2007 Design, Automation & Test in Europe Conference & Exhibition  
Modern Application Specific Instruction Set Processors (ASIPs) have Although compression allows the reduction of the program trace file customizable caches, where the size, associativity and line size  ...  Our experimental results trace files is a time consuming process.  ...  Subsequently, a set of stored the right branch corresponds to the trace shown in Figure 3 (a), and pre-processed (compressed) traces is reused many times for different the right part of the right branch  ... 
doi:10.1109/date.2007.364389 dblp:conf/date/JanapsatyaIPH07 fatcat:ilg7vb2tgzfxhil6tj7fawi4jm

Trace preconstruction

Quinn Jacobson, James E. Smith
2000 Proceedings of the 27th annual international symposium on Computer architecture - ISCA '00  
The trace preconstruction mechanism observes the processor's instruction dispatch stream to detect opportunities for jumping ahead of the processor.  ...  After doing so, the preconstruction mechanism fetches static instructions from the predicted future region of the program, and constructs a set of traces in advance of when they are needed.  ...  The trace processor has 4 processing elements each with a window of 16 instructions (one trace length) for a total window size of 64 instructions.  ... 
doi:10.1145/339647.339653 fatcat:zmo6qxkldbdpfpgp7bhqms4sui

Execution and Cache Performance of the Scheduled Dataflow Architecture

Joseph Arul, Krishna Kavi, Roberto Giorgi
2000 Journal of universal computer science (Online)  
This paper presents an evaluation of our Scheduled Dataflow (SDF) Processor. Recent focus in the field of new processor architectures is mainly on VLIW (e.g.  ...  Then we investigated the expected cache-memory performance by collecting address traces from programs and using a trace-driven cache simulator (Dinero-IV). We present these results in this paper.  ...  This research is supported in part by NSF grants: CCR 9796310, EIA 9729889, EIA 9820147.  ... 
doi:10.3217/jucs-006-10-0948 dblp:journals/jucs/KaviAG00 fatcat:ayzfhn4jyra4jbhra6jpulqz6m
« Previous Showing results 1 — 15 out of 21,155 results