Filters








1,806 Hits in 7.6 sec

Trace Substitution [chapter]

Hans Vandierendonck, Hans Logie, Koen De Bosschere
2003 Lecture Notes in Computer Science  
We show that trace substitution consistently improves the fetch bandwidth with 0.2 instructions per access.  ...  On a trace cache miss, trace substitution overrides the predicted trace with a cached trace. If the substitution is correct, the fetch bandwidth increases.  ...  Hans Vandierendonck is supported by the Flemish Institute for the Promotion of Scientific-Technological Research in the Industry (IWT).  ... 
doi:10.1007/978-3-540-45209-6_80 fatcat:jcktlpwwevftlas3gg27yxgfae

Memory optimization of dynamic binary translators for embedded systems

Apala Guha, Kim Hazelwood, Mary Lou Soffa
2012 ACM Transactions on Architecture and Code Optimization (TACO)  
For unified cache flushing, we developed a pseudo LRU heuristic to determine which traces to preserve across flushes.  ...  We consider the class of translation-based DBTs and their sources of memory demand -cached translated code, cached auxiliary code and DBT data structures.  ...  Packing long-lived traces into a smaller area offers improved instruction locality.  ... 
doi:10.1145/2355585.2355595 fatcat:hlpbrne63fbcldcanuwjxwdgce

Optimising power efficiency in trace cache fetch unit

J. Hu, N. Vijaykrishnan, M.J. Irwin, M. Kandemir
2007 IET Computers & Digital Techniques  
Our study shows that conventional trace caches (CTC) may increase power consumption in the fetch unit because of the simultaneous access to both the trace cache and the instruction cache, and sequential  ...  SLTC uses both compiler and hardware support to selectively control trace cache lookup and update.  ...  Another two techniques, branch promotion and trace packing were examined by Patel et al. [11] .  ... 
doi:10.1049/iet-cdt:20060170 fatcat:zm2dw64aw5fxjnsadtbayzdoqa

Do Trace Cache, Value Prediction and Prefetching Improve SMT Throughput? [chapter]

Chen-Yong Cher, Il Park, T. N. VijayKumar
2006 Lecture Notes in Computer Science  
SMT's sharing of the instruction storage (i.e., trace cache or i-cache), physical registers, and issue queue impacts the effectiveness of trace cache, value prediction, and prefetching, respectively.  ...  While trace cache, value prediction, and prefetching have been shown to be effective in the single-threaded superscalar, there has been no analysis of these techniques in a Simultaneously Multithreaded  ...  [19] proposed branch promotion and trace packing for improving trace cache bandwidth. To achieve better utilization of trace cache space, Black et al.  ... 
doi:10.1007/11682127_17 fatcat:xsb65e4pcnb37kh3jye2xrjrza

On-the-fly structure splitting for heap objects

Zhenjiang Wang, Chenggang Wu, Pen-Chung Yew, Jianjun Li, Di Xu
2012 ACM Transactions on Architecture and Code Optimization (TACO)  
Sophisticated techniques are needed more than ever to improve an application's spatial and temporal locality.  ...  Technology, Chinese Academy of Sciences With the advent of multicore systems, the gap between processor speed and memory latency has grown worse because of their complex interconnect.  ...  This explains why they have significant performance improvements. 181.mcf, 429.mcf and 472.moldyn all have higher counts on "not promoted" loops and traces.  ... 
doi:10.1145/2086696.2086705 fatcat:tsy2yrsrafdqtdwamvo3bvyily

A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1

E.L. Boyd, W. Azeem, Hsien-Hsin Lee Hsien-Hsin Lee, Tien-Pao Shih Tien-Pao Shih, Shih-Hao Hung Shih-Hao Hung, E.S. Davidson
1994 1994 International Conference on Parallel Processing Vol. 3  
allows parallel code to be instrumented to produce a memory reference trace), and K-Cache (which simulates inter-cache communications based on a memory reference trace).  ...  Comparing delivered performance with bounds focuses attention on areas for improvement and indicates how much improvement might be attainable.  ...  K-Trace and K-Cache K-Trace instruments the assembly code of an application and generates memory traces of the code on the KSR1.  ... 
doi:10.1109/icpp.1994.30 dblp:conf/icpp/BoydALSHD94 fatcat:mq3wqmpcpngdpmlpzt2dj2gdne

Superspeculative microarchitecture for beyond AD 2000

M.H. Lipasti, J.P. Shen
1997 Computer  
The experimental, superspeculative microarchitecture Superflow has a potential performance of 9.0 instructions per cycle and realizable performance of 7.3 IPC for the SPEC95 integer suite, without requiring  ...  This research also benefited from discussions with other MIG members: Bryan Black, Yuan Chou, Andrew Huang, Chris Newburn, and Derek Noonburg. Chris Wilkerson coined the name Superflow.  ...  Acknowledgements ONR grants N00014-96-1-0928 and N00014-96-1-0347 and Intel Corp. supported this research.  ... 
doi:10.1109/2.612250 fatcat:ezukhsogtvcnjfga5hqpj5jcsi

Detection of P53 Consensus Sequence: A Novel String Matching With Classes Algorithm

Gıyasettin ÖZCAN
2016 Uludağ University Journal of The Faculty of Engineering  
For efficient solution, we consider bitwise string matching algorithms with classes and present a novel search technique which is based on 64-bit packed variables.  ...  We compare algorithm performance and three architectures with various level CPU parallelism.  ...  Recently, CPU hardware technology presented new cache, branch prediction, and prefetch techniques. Such improvements could revise the performance outputs of the string matching algorithms.  ... 
doi:10.17482/uujfe.21385 doaj:1bb624e9c49b4ec2b80b76cca1b55357 fatcat:dkwel55jhvhftoucnzsftxnidi

Accelerate Cycle-Level Full-System Simulation of Multi-Core RISC-V Systems with Binary Translation [article]

Xuan Guo, Robert Mullins
2020 arXiv   pre-print
Cycle-level simulations of RISC-V multi-core processors are possible at more than 20 MIPS, a useful middle ground in terms of accuracy and performance with simulation speeds nearly 100 times those of more  ...  Its functional simulation mode outperforms QEMU and, if desired, it is possible to switch between functional and timing modes at run-time.  ...  . branches or memory accesses, and later replay the trace against a specific model.  ... 
arXiv:2005.11357v1 fatcat:6n6odq53ujgrhjlouzt6cjft7y

P53 KONSENSÜS SEKANSININ YAKALANMASI: SINIF ÖZELLİKLİ YENİ BİR SEKANS EŞLEŞTİRME ALGORİTMASI

Gıyasettin ÖZCAN
2016 Uludağ University Journal of The Faculty of Engineering  
For efficient solution, we consider bitwise string matching algorithms with classes and present a novel search technique which is based on 64-bit packed variables.  ...  In order to prevent obstacles based on variable length of the pattern, we search right and left side indexes of P53 and reduce search space.  ...  Recently, CPU hardware technology presented new cache, branch prediction, and prefetch techniques. Such improvements could revise the performance outputs of the string matching algorithms.  ... 
doi:10.17482/uumfd.273970 fatcat:m3lphrbkyzeo5iwo35enhauvky

BOLT: A Practical Binary Optimizer for Data Centers and Beyond [article]

Maksim Panchenko, Rafael Auler, Bill Nell, Guilherme Ottoni
2018 arXiv   pre-print
This has motivated recent investigation of practical techniques to improve code layout at both compile time and link time.  ...  Utilizing sample-based profiling, BOLT boosts the performance of real-world applications even for highly optimized binaries built with both feedback-driven optimizations (FDO) and link-time optimizations  ...  We would also like to thank Sergey Pupyrev for his work on improving the basic block layout algorithms used by BOLT.  ... 
arXiv:1807.06735v2 fatcat:kqysvek7ajeyfhcubo4inexnia

The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-ready 1.7GHz 64bit RISC-V Core in 22nm FDSOI Technology [article]

Florian Zaruba, Luca Benini
2019 arXiv   pre-print
Our analysis indicates that ISA heterogeneity and simpler cores with a few critical instruction extensions (e.g. packed SIMD) can significantly boost a RISC-V core's compute energy efficiency.  ...  Furthermore, openness promotes the availability of various open-source and commercial implementations.  ...  ACKNOWLEDGMENTS The authors would like to thank Michael Schaffner and Fabian Schuiki for comments that greatly improved the manuscript.  ... 
arXiv:1904.05442v1 fatcat:qawowriv7zawxio4ayy76lngfe

Drndalo: Lightweight Control Flow Obfuscation Through Minimal Processor/Compiler Co-Design [article]

Novak Boskov, Mihailo Isakov, Michel A. Kinsy
2019 arXiv   pre-print
We evaluate the security of Drndalo by training classifiers on pairs of obfuscated and unobfuscated binaries.  ...  However, the same technique may be employed by an attacker to analyze the original binaries in order to reverse engineer them and extract exploitable weaknesses.  ...  with side effects to make them transparent.  ... 
arXiv:1912.01560v1 fatcat:34pudt4opngi5nw2x6tfzzkoqe

Boosting mobile GPU performance with a decoupled access/execute fragment processor

José-María Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis
2012 SIGARCH Computer Architecture News  
A Tegra with perfect caches can provide up to 285% over the non-threaded version.  ...  The CPU/GPU component is only left with a shrinking fraction of the power budget, since most of the energy is consumed by the screen and the antenna.  ...  [6] propose the dynamic formation of warps to deal with diverging branch outcomes. Tarjan et al.  ... 
doi:10.1145/2366231.2337169 fatcat:f7j676py7jhehcl55u2v2k3b5u

Boosting mobile GPU performance with a decoupled access/execute fragment processor

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis
2012 2012 39th Annual International Symposium on Computer Architecture (ISCA)  
A Tegra with perfect caches can provide up to 285% over the non-threaded version.  ...  The CPU/GPU component is only left with a shrinking fraction of the power budget, since most of the energy is consumed by the screen and the antenna.  ...  [6] propose the dynamic formation of warps to deal with diverging branch outcomes. Tarjan et al.  ... 
doi:10.1109/isca.2012.6237008 dblp:conf/isca/ArnauPX12 fatcat:wu7qgj5rqnh2rhftcvo5xdenge
« Previous Showing results 1 — 15 out of 1,806 results