Filters








18,391 Hits in 5.3 sec

MLP Aware Scheduling Techniques in Multithreaded Processors [article]

Murthy Durbhakula
2019 arXiv   pre-print
By observing the MLP available in each thread and by balancing it with available MLP resources in the system the OS will come up with a new schedule of threads for the next quantum that could potentially  ...  In this paper I propose a MLP aware operating system (OS) scheduling algorithm for Multithreaded Multi-core processors.  ...  In this paper I propose Operating System scheduling policies which take memory-level parallelism (MLP) into account while scheduling threads on multithreaded multi-core processors.  ... 
arXiv:1908.04236v1 fatcat:uazkitkg2zduzeaxc2d2raltcu

Heterogeneous multi-core architectures with dynamically reconfigurable processors for wireless communication

Wei Han, Ying Yi, Xin Zhao, Mark Muir, Tughrul Arslan, Ahmet T. Erdogan
2009 2009 IEEE 7th Symposium on Application Specific Processors  
Both WiMAX transmitter and receiver are partitioned and mapped on to the proposed heterogeneous architectures.  ...  In this paper, we introduce new heterogeneous multi-core architectures using coarse-grained dynamically reconfigurable processors.  ...  When the workload is balanced, the area optimization just removes those redundant ICs in the DR cores.  ... 
doi:10.1109/sasp.2009.5226347 dblp:conf/sasp/HanYZMAE09 fatcat:byctjj5fzrekxbgkm2qg73tjaa

Low-overhead load-balanced scheduling for sparse tensor computations

Muthu Baskaran, Benoit Meister, Richard Lethin
2014 2014 IEEE High Performance Extreme Computing Conference (HPEC)  
We achieve around 4-5x improvement in performance over existing parallel approaches and observe "scalable" parallel performance on modern multicore systems with up to 32 processor cores.  ...  Irregular computations over large-scale sparse data are prevalent in critical data applications and they have significant room for improvement on modern computer systems from the aspects of parallelism  ...  Further, the irregular codes are usually memorybound and spend lot of time in memory accesses.  ... 
doi:10.1109/hpec.2014.7041006 dblp:conf/hpec/BaskaranML14 fatcat:g7srlfdtrvgltda4fiqrrsmwbi

Code layout optimizations for transaction processing workloads

Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P. Geoffrey Lowney, Mateo Valero
2001 Proceedings of the 28th annual international symposium on Computer architecture - ISCA '01  
Finally, we show that better code layout can also provide substantial improvements in the behavior of other memory system components such as the instruction TLB and the unified second-level cache.  ...  However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads.  ...  Acknowledgments We would like to thank Jennifer Anderson for her early involvement in this work. We also thank the anonymous reviewers for their comments.  ... 
doi:10.1145/379240.379260 dblp:conf/isca/RamirezBGCLLV01 fatcat:4gowbzd2rrhodhyclvikketqmm

Code layout optimizations for transaction processing workloads

Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P. Geoffrey Lowney, Mateo Valero
2001 SIGARCH Computer Architecture News  
Finally, we show that better code layout can also provide substantial improvements in the behavior of other memory system components such as the instruction TLB and the unified second-level cache.  ...  However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads.  ...  Acknowledgments We would like to thank Jennifer Anderson for her early involvement in this work. We also thank the anonymous reviewers for their comments.  ... 
doi:10.1145/384285.379260 fatcat:p3k6jgq7wzhgrc7mpo5ynsdkcu

Parallelizing Complex Streaming Applications on Distributed Scratchpad Memory Multicore Architecture

Shin-Kai Chen, Cheng-Yu Hung, Ching-Chih Chen, Chih-Wei Liu
2013 International journal of parallel programming  
It is difficult to exploit all available capabilities and achieve maximal throughput, due to the combined complexity of inter-processor communication, synchronization, and workload balancing.  ...  In this study, we developed an efficient design flow for parallelizing multimedia applications on a distributed scratchpad memory multicore architecture.  ...  Acknowledgments This work was supported in part by the Nation Science Council, Taiwan, under Grant NSC-102-2220-E-009-013-and Ministry of Economic Affairs, Taiwan, under Grant MOEA-101-EC-17-A-02-S1-202  ... 
doi:10.1007/s10766-013-0256-7 fatcat:f5gwz3str5a2jlnjyhqfshf4my

Accomodating Diversity in CMPs with Heterogeneous Frequencies [chapter]

Major Bhadauria, Vince Weaver, Sally A. McKee
2009 Lecture Notes in Computer Science  
[16] achieve scalable speedups with different processors working in unison by extending OpenMP and hand optimizing codes. Wong et al.  ...  We measure execution time to quantify improvement in delay on high-performance, multithreaded scientific codes.  ... 
doi:10.1007/978-3-540-92990-1_19 fatcat:ptrr36gzczfd5dmsbkcwddsgtu

Improving the scalabiliy of neutron cross-section lookup codes on multicore NUMA system [article]

Kazutomo Yoshii, John Tramm, Andrew Siegel, Pete Beckman
2019 arXiv   pre-print
In addition to the NUMA optimization we evaluate a page-size optimization to XSBench and observe a 1.5x performance improvement, compared with a nonoptimized one.  ...  memory access (NUMA) systems.  ...  , and perform publicly and display publicly, by or on behalf of the Government.  ... 
arXiv:1909.03632v1 fatcat:tlg5i6pxg5f3lfgvwwhearbwqu

BOPS, Not FLOPS! A New Metric and Roofline Performance Model For Datacenter Computing [article]

Lei Wang, Jianfeng Zhan, Wanling Gao, KaiYong Yang, ZiHan Jiang, Rui Ren, Xiwen He, Chunjie Luo
2019 arXiv   pre-print
One is the BOPS based system evaluation, we illustrate that BOPS can compare performance of workloads from multiple domains. The other is BOPS based optimizations.  ...  We perform experiments with seventeen DC workloads on three typical Intel processors platforms.  ...  The optimization can be shown in Table 9 , we improve the OI of MMK from 3.1 to 3.2, and BOPS from 2.2 G to 2.4 G; To reduce data movement cost, We replace the default malloc algorithm with the Jemalloc  ... 
arXiv:1801.09212v4 fatcat:cr7mbjx4zfbnrcnce4jhhfsudm

Locality-conscious workload assignment for array-based computations in MPSOC architectures

Feihui Li, Mahmut Kandemir
2005 Proceedings of the 42nd annual conference on Design automation - DAC '05  
Programming MPSOCs can be challenging as several potentially conflicting issues such as data locality, parallelism and load balance across processors should be considered simultaneously.  ...  An important characteristic of the proposed approach is that, in deciding the workloads of the processors (i.e., in parallelizing the application) it considers all the loop nests in the application simultaneously  ...  Memory optimizations for embedded systems were addressed, among others, by Shiue and Chakrabarti [17] .  ... 
doi:10.1145/1065579.1065609 dblp:conf/dac/LiK05 fatcat:swqiqss4cffv7a5avy7dwydqoe

Locality-conscious workload assignment for array-based computations in MPSOC architectures

Feihui Li, M. Kandemir
2005 Proceedings. 42nd Design Automation Conference, 2005.  
Programming MPSOCs can be challenging as several potentially conflicting issues such as data locality, parallelism and load balance across processors should be considered simultaneously.  ...  An important characteristic of the proposed approach is that, in deciding the workloads of the processors (i.e., in parallelizing the application) it considers all the loop nests in the application simultaneously  ...  Memory optimizations for embedded systems were addressed, among others, by Shiue and Chakrabarti [17] .  ... 
doi:10.1109/dac.2005.193780 fatcat:yn7oi5z2ozaqvb2y52znxbicxy

Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks

Deokho Kim, Karam Park, Won W. Ro
2011 Sensors  
While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into  ...  The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine.  ...  The performance improvement by well balanced workload is tested in the next subsection.  ... 
doi:10.3390/s110807908 pmid:22164053 pmcid:PMC3231739 fatcat:mtrzhnyffbh6zga2vv2jqn2lt4

Performance of database workloads on shared-memory systems with out-of-order processors

Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V. Adve, Luiz André Barroso
1998 Proceedings of the eighth international conference on Architectural support for programming languages and operating systems - ASPLOS-VIII  
However, most current system designs have been optimized to perform well on scientific and engineering workloads.  ...  This paper examines the behavior of database workloads on shared-memory multiprocessors with aggressive out-of-order processors, and considers simple optimizations that can provide further performance  ...  Acknowledgements This paper benefited from discussions with Norm Jouppi, Jack Lo, and Dan Scales, and from comments by the anonymous reviewers.  ... 
doi:10.1145/291069.291067 dblp:conf/asplos/RanganathanGAB98 fatcat:x5qbk25rdzg45gsfimyiwuxmy4

Study and evaluation of an Irregular Graph Algorithm on Multicore and GPU Processor Architectures [article]

Varun Nagpal
2016 arXiv   pre-print
Since the gap between processor and memory performance continues to exist, difficulty to hide and decrease this gap is one of the important factors which results in poor performance of these applications  ...  Such applications have very little computation and unpredictable memory access patterns making them memory-bound in contrast to compute-bound applications.  ...  refined load balancing strategy improves balancing of workload.  ... 
arXiv:1603.02655v1 fatcat:nklt3op66vdfdpmd3ygeckhwla

Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study [chapter]

Enzo Rucci, Armando De Giusti, Marcelo Naiouf
2018 Communications in Computer and Information Science  
While optimizing applications on CPUs, GPUs and first Xeon Phi's has been largely studied in the last years, the new features in Knights Landing processors require the revision of programming and optimization  ...  In this work, we selected the Floyd-Warshall algorithm as a representative case study of graph and memory-bound applications.  ...  Acknowledgments The authors thank the ArTeCS Group from Universidad Complutense de Madrid for letting use their Xeon Phi KNL system.  ... 
doi:10.1007/978-3-319-75214-3_5 fatcat:oj6fqtz5azco5fnp2zqbj7ud4m
« Previous Showing results 1 — 15 out of 18,391 results