12,000 Hits in 3.7 sec

Store Vectors for Scalable Memory Dependence Prediction and Scheduling

S. Subramaniam, G.H. Loh
The Twelfth International Symposium on High-Performance Computer Architecture, 2006.  
In this paper, we use the idea of dependency vectors from matrix schedulers for non-memory instructions, and adapt them to implement a new dependence prediction algorithm.  ...  For applications that experience frequent memory ordering violations, our "store vector" prediction algorithm delivers an 8.4% speedup over blind speculation (compared to 8.5% for perfect dependence prediction  ...  Acknowledgments This research was sponsored in part by equipment and funding donations from Intel Corporation. We are grateful for the constructive feedback provided by the anonymous reviewers.  ... 
doi:10.1109/hpca.2006.1598113 dblp:conf/hpca/SubramaniamL06 fatcat:eb7n45dcobhhzfsyib6s46cuw4

Multi-objective Prediction based Task Scheduling Method in Cloud Computing

2019 International journal of recent technology and engineering  
Our major goal is to predict optimal node for task scheduling which satisfies objectives like resource utilization and load balancing with accuracy.  ...  Cloud Computing is Internet based computing where one can store and access their personal resources from any computer through Internet.  ...  The scheduler S will be called Z Scalable if lim P m→∞ then N l → C( Unbiased Scheduling (P5): If prediction of scheduling node is not depending up on no of decision vector then the method supports  ... 
doi:10.35940/ijrte.d9702.118419 fatcat:kreyhbrfpjfkjisuvxwknrrp7y

Reducing CPU-GPU Interferences to Improve CPU Performance in Heterogeneous Architectures

Hao Wen, Wei Zhang
2020 Journal of Computing Science and Engineering  
• Twice as many targets • Much more effective storage for history • Much longer history for data dependent behaviors Decoded Uop Cache ~1.5 Kuops Branch Prediction Unit 32k L1 Instruction Cache  ...  Cluster • Memory Unit can service two memory requests per cycle -16 bytes load and 16 bytes store per cycle Memory Control 32kx8-way L1 Data Cache 32 bytes/cycle Load Store Address Store  ... 
doi:10.5626/jcse.2020.14.4.131 fatcat:kxyaponoenc2zgas2mzgdbnqy4

Scalable Rate-Distortion-Computation Hardware Accelerator for MCTF and ME

Yi-hau Chen, Ching-yeh Chen, Chih-chi Cheng, Liang-gee Chen
2006 2006 IEEE International Conference on Multimedia and Expo  
Motion-Compensated Temporal Filtering (MCTF) is an innovative prediction scheme for video coding and the core technology of scalable extension of H.264/AVC.  ...  With the frame-level searching range data reuse and frame-interleaved MB pipelining scheme, external memory bandwidth are reduced 33%, and 10 Kbits buffer are saved.  ...  MVP L and MVP R represent the motion vectors from the left and right neighbor frames for the prediction stage, respectively, and so present MVU L and MVU R for the update stage.  ... 
doi:10.1109/icme.2006.262512 dblp:conf/icmcs/ChenCCC06 fatcat:6u73jrhe5fawle2kzypwrdugju

An Overview of H.264 Hardware Encoder Architectures Including Low-Power Features

Ngoc-Mai Nguyen, Duy-Hieu Bui, Nam-Khanh Dang, Edith Beigne, Suzanne Lesecq, Pascal Vivet, Xuan-Tu Tran
2014 REV Journal on Electronics and Communications  
For its efficiency, the H.264 is expected to encode real-time and/or high-definition video. However, the H.264 standard also requires highly complex and long lasting computation.  ...  Besides, with the revolution of portable devices, multimedia chips for mobile environments are more and more developed.  ...  After having inter-prediction results, TQIF memory can be invalidated to store new transformed results for inter module.  ... 
doi:10.21553/rev-jec.72 fatcat:us45zrwuxff3tpf32ms2wn5tse

Architecture scalability of parallel vector computers with a shared memory

E. Dekker
1998 IEEE transactions on computers  
Based on a model of a parallel vector computer with a shared memory, its scalability properties are derived.  ...  Index Terms-Architecture scalability, parallel vector computers, shared memory, sustainable peak performance, theoretical peak performance. ----------3 ----------0018-9340/98/$10.00 © 1998 IEEE ²²²²²²²²²²²²²²²²  ...  ACKNOWLEDGMENTS The author would like to thank the referees for their valuable comments.  ... 
doi:10.1109/12.677257 fatcat:yz73ntpeb5ajnfumbom5tuuely

Available Task-Level Parallelism on the Cell BE

Alejandro Rico, Alex Ramirez, Mateo Valero
2009 Scientific Programming  
We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.  ...  High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming  ...  Perez and Pieter Bellens from the Barcelona Supercomputing Center and Felipe Cabarcas from the Universitat Politecnica de Catalunya for their collaboration on the study of CellSs.  ... 
doi:10.1155/2009/741282 fatcat:lkqbeei4ovfchbowk7et2kdp2q

A 135 MHz 542 k Gates High Throughput H.264/AVC Scalable High Profile Decoder

Gwo-Long Li, Yu-Chen Chen, Yuan-Hsin Liao, Po-Yuan Hsu, Meng-Hsun Wen, Tian-Sheuan Chang
2012 IEEE transactions on circuits and systems for video technology (Print)  
For decoding flow, this paper proposes an one-pass macroblock-based quality layer decoding flow for SNR scalability and 71% of external memory bandwidth and 66% of macroblock processing cycles can be saved  ...  vector generator to save area cost and decoding time.  ...  First, the memory storage issue due to additional SVC data dependency is addressed in [3] and [4] .  ... 
doi:10.1109/tcsvt.2011.2171213 fatcat:d5pxven3wnbptny32jmijodflm

Task-based parallel H.264 video encoding for explicit communication architectures

Michail Alvanos, George Tzenakis, Dimitrios S. Nikolopoulos, Angelos Bilas
2011 2011 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation  
We use c264, an end-to-end H.264 video encoder for the Cell processor based on x264, to show that exploiting finegrain parallelism remains challenging and requires significant advancement in runtime support  ...  Future multi-core processors will necessitate exploitation of fine-grain, architecture-independent parallelism from applications to utilize many cores with relatively small local memories.  ...  dependence checking and issue across antidiagonals; (iii) Memory includes memory optimizations and dynamic scheduling.  ... 
doi:10.1109/samos.2011.6045464 dblp:conf/samos/AlvanosTNB11 fatcat:c2c4hminvzfldd7mnbo77fsvy4

Strategies for improving performance and energy efficiency on a many-core

Elkin Garcia, Guang Gao
2013 Proceedings of the ACM International Conference on Computing Frontiers - CF '13  
The research proposed here will provide an analysis of these new scenarios, proposing new methodologies and solutions that leverage these new challenges in order to increase the performance and energy  ...  This new environment has prompted the development of new techniques that seek finer granularity and a greater interplay in the sharing of resources.  ...  ACKNOWLEDGEMENTS This work has been made possible by the generous support of the NSF through research grants CCF-0833122, CCF-0925863, CCF-0937907, CNS-0720531, and OCI-0904534.  ... 
doi:10.1145/2482767.2482779 dblp:conf/cf/GarciaG13 fatcat:n2syt3shvndjxazsdjyggacrka

A Highly Scalable Parallel Implementation of H.264 [chapter]

Arnaldo Azevedo, Ben Juurlink, Cor Meenderinck, Andrei Terechko, Jan Hoogerbrugge, Mauricio Alvarez, Alex Ramirez, Mateo Valero
2011 Lecture Notes in Computer Science  
The demand for computational power increases continuously in the consumer market as it forecasts new applications such as Ultra High Definition (UHD) video [1], 3D TV [2] , and real-time High Definition  ...  In previous work [3] we have proposed the 3D-Wave parallelization strategy for H.264 video decoding.  ...  The authors would like to thank Anirban Lahiri from NXP for his collaboration on the experiments.  ... 
doi:10.1007/978-3-642-24568-8_6 fatcat:wcngzpflerci5aszahnj7cusme

The SARC Architecture

Alex Ramirez, Felipe Cabarcas, Ben Juurlink, Mauricio Alvarez Mesa, Friman Sanchez, Arnaldo Azevedo, Cor Meenderinck, Catalin Ciobanu, Sebastian Isaza, Gerogi Gaydadjiev
2010 IEEE Micro  
However, chip multiprocessors (CMPs) often struggle with programmability and scalability issues such as cache coherency and offchip memory bandwidth and latency.  ...  StarSs also allows annotating the task input and output operands, thereby enabling the runtime system to reason about intertask data dependencies when scheduling tasks and data transfers.  ...  For 16 vector lanes, we estimate the efficiency at 84 percent. The reason for this behavior is that we did not use any blocking technique to overlap the local store loads and stores with computation.  ... 
doi:10.1109/mm.2010.79 fatcat:xle4zkaarnbdvlryyq7f674544

Parallel H.264 Decoding on an Embedded Multicore Processor [chapter]

Arnaldo Azevedo, Cor Meenderinck, Ben Juurlink, Andrei Terechko, Jan Hoogerbrugge, Mauricio Alvarez, Alex Ramirez
2009 Lecture Notes in Computer Science  
The results show that our policies combat memory and latency issues with a negligible effect on the performance scalability.  ...  This strategy is based on the observation that inter-frame dependencies have a limited spatial range.  ...  The authors would like to thank Anirban Lahiri from NXP for his collaboration on the experiments.  ... 
doi:10.1007/978-3-540-92990-1_29 fatcat:lxvhhruv4jawhhmkelw3lukpua

SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting Input and Output Sparsity [article]

Jingyang Zhu, Jingbo Jiang, Xizi Chen, Chi-Ying Tsui
2017 arXiv   pre-print
SparseNN is a scalable architecture with distributed memories and processing elements connected through a dedicated on-chip network.  ...  The large computation and memory requirements pose a challenge to the hardware design.  ...  The scalability with rank and the predicted sparsity are better than the traditional truncated SVD scheme.  ... 
arXiv:1711.01263v1 fatcat:ttorqjphzzg5xmrcasorbevbw4


2006 Parallel Processing Letters  
The model is fine grain and provides synchronisation in a distributed register file, making it a promising candidate for scalable chip-multiprocessors.  ...  This paper explores the model's opportunity to provide the simultaneous issue of instructions, required for chip multiprocessors, and discusses the issues of scalability with regard to support structures  ...  Pointers for global and local contexts are required as before but in addition, a pointer must be stored for each thread in order to access its dependent registers and a bit must be stored to define whether  ... 
doi:10.1142/s0129626406002587 fatcat:4khonednobh65av53v7q7d64j4
« Previous Showing results 1 — 15 out of 12,000 results