Filters








105,339 Hits in 5.3 sec

Parallelization and vectorization of ROOT fitting classes

X Valls Pla, L Moneta
2018 Journal of Physics, Conference Series  
We report on the improvements obtained by adding the support for SIMD vectorization and multithreaded parallelization when fitting ROOT histograms and datasets represented by ROOT trees.  ...  These different parallelization tools are applied together when parallelizing the minimization process for solving fitting problems.  ...  In both cases, the map function accumulates the results of the objective function evaluation into one partial result per thread.  ... 
doi:10.1088/1742-6596/1085/3/032024 fatcat:hnbjjx4t35hixhdg4cpowyfv5q

Integrating program optimizations and transformations with the scheduling of instruction level parallelism [chapter]

David A. Berson, Pohua Chang, Rajiv Gupta, Mary Lou Soffa
1997 Lecture Notes in Computer Science  
In this paper we present an integrated approach to scheduling that enables the selective application of optimizations and restructuring transformations by the scheduler when it determines their application  ...  In particular, optimizations for redundancy elimination and restructuring transformations for increasing parallelism axe often accompanied with an increase in register pressure.  ...  The maximum number of resources required to exploit all exposed parallelism is given by the minimum number of allocation chains that cover the ReuseR DAG.  ... 
doi:10.1007/bfb0017254 fatcat:bqiapdr5wrarrhnbumionm4sxq

Software Carry-Save: A Case Study for Instruction-Level Parallelism [chapter]

David Defour, Florent de Dinechin
2003 Lecture Notes in Computer Science  
We show that a combination of today's processors, today's compilers, and algorithms written in C using a data representation which exposes parallelism, is able to outperform the reference GMP library which  ...  We observe that the gain is related to a better use of the processor's instruction parallelism.  ...  Acknowledgements The support of Intel and HP through the donation of an Itanium based machine is gratefully acknowledged. Some experiments were also performed thanks to the HP TestDrive program.  ... 
doi:10.1007/978-3-540-45145-7_18 fatcat:egmnt4vpyjbqtjyelovnvybdwy

A New Generation of Task-Parallel Algorithms for Matrix Inversion in Many-Threaded CPUs

Sandra Catalán, Francisco D. Igual, Rafael Rodríguez-Sánchez, José R. Herrero, Enrique S. Quintana-Ortí
2021 Proceedings of the 12th International Workshop on Programming Models and Applications for Multicores and Manycores  
Our algorithms perform a partitioning of the matrix operand into two levels of tasks: The matrix is first divided vertically, by column blocks (or panels), in order to accommodate the standard partial  ...  The results of the experimental evaluation show the performance benefits of the advanced tasking algorithms on an Intel Xeon Gold processor with 20 cores.  ...  Acknowledgments This research was sponsored by projects RTI2018-093684-B-I00 and TIN2017-82972-R of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; and  ... 
doi:10.1145/3448290.3448563 fatcat:ozde23q3gvarhomrkmeytsxjn4

Coarse-grain speculation for emerging processors

Hari K. Pyla
2011 Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion - SPLASH '11  
Approach This work equips programmers with a powerful tool for exploiting parallelism by means of coarse-grain speculation.  ...  ., codeblocks, methods, algorithms) offers a promising programming model for exploiting parallelism for many hard-toparallelize applications.  ...  Approach This work equips programmers with a powerful tool for exploiting parallelism by means of coarse-grain speculation.  ... 
doi:10.1145/2048147.2048215 dblp:conf/oopsla/Pyla11a fatcat:msvxbpqz3jbhrfq5y467wg5xee

A hybrid implementation of Hamming weight

Enric Morancho
2014 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing  
While some implementations expose just scalar parallelism, others expose vector parallelism.  ...  This implementation will be useful on platforms that can exploit both kinds of parallelism simultaneously.  ...  ACKNOWLEDGMENT We thankfully acknowledge the support of the Spanish Ministry of Education (TIN2012-34557) and the European Commission in the context of the HiPEAC3 Network of Excellence (FP7/ICT 287759  ... 
doi:10.1109/pdp.2014.26 dblp:conf/pdp/Llena14 fatcat:aqpmb7kszzfdbaca2ovobffe3m

Exploiting task and data parallelism in ILUPACK's preconditioned CG solver on NUMA architectures and many-core accelerators

José I. Aliaga, Rosa M. Badia, Maria Barreda, Matthias Bollhöfer, Ernesto Dufrechou, Pablo Ezzatti, Enrique S. Quintana-Ortí
2016 Parallel Computing  
For the graphics processor we exploit data parallelism by off-loading the computationally expensive kernels to the accelerator while keeping the numeric semantics of the sequential case.  ...  For the conventional x86 architectures, our approach exploits task parallelism via the OmpSs runtime as well as a messagepassing implementation based on MPI, respectively yielding a dynamic and static  ...  Badia was supported by project TIN2012-34557 of MINECO and EU FEDER, and by the Generalitat de Catalunya (contract 2009-SGR-980).  ... 
doi:10.1016/j.parco.2015.12.004 fatcat:y7orzgrtfjhzzjwucyvyi6dbhq

Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver

Adrian Munera, Sara Royuela, Germán Llort, Estanislao Mercadal, Franck Wartel, Eduardo Quiñones
2020 49th International Conference on Parallel Processing - ICPP  
importance to fully exploit the performance capabilities of parallel embedded architectures.  ...  This paper contributes to the state-of-the-art of analysis tools for embedded systems by: (1) analyzing the particular constraints of embedded systems compared to HPC systems (e.g., static setting, restricted  ...  ACKNOWLEDGMENTS This work has been partially funded from the HP4S (High Performance Parallel Payload Processing for Space) project under the ESA-ESTEC ITI contract № 4000124124/18/NL/CRS.  ... 
doi:10.1145/3404397.3404440 dblp:conf/icpp/MuneraRLMWQ20 fatcat:y4yd5pegwnhqdkg7t2u6drpnxq

Optimizing scientific application loops on stream processors

Li Wang, Xuejun Yang, Jingling Xue, Yu Deng, Xiaobo Yan, Tao Tang, Quan Hoang Nguyen
2008 Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems - LCTES '08  
We evaluate the performance of our compiler framework by actually running nine representative scientific computing kernels on our FT64 stream processor.  ...  Then the three SRF management tasks are solved in a unified manner via graph coloring: (1) placing streams in the SRF, (2) exploiting stream use, and (3) maximizing parallelism.  ...  Stream scheduling captures the sizes and access patterns of stream accesses and the data flow between kernels by partial evaluation.  ... 
doi:10.1145/1375657.1375679 dblp:conf/lctrts/WangYXDYTN08 fatcat:3ifb65yrzzafjobur4chskyyzy

Optimizing scientific application loops on stream processors

Li Wang, Xuejun Yang, Jingling Xue, Yu Deng, Xiaobo Yan, Tao Tang, Quan Hoang Nguyen
2008 SIGPLAN notices  
We evaluate the performance of our compiler framework by actually running nine representative scientific computing kernels on our FT64 stream processor.  ...  Then the three SRF management tasks are solved in a unified manner via graph coloring: (1) placing streams in the SRF, (2) exploiting stream use, and (3) maximizing parallelism.  ...  Stream scheduling captures the sizes and access patterns of stream accesses and the data flow between kernels by partial evaluation.  ... 
doi:10.1145/1379023.1375679 fatcat:omdqm3svlfhppldqfrxkplwtd4

A multigrain Delaunay mesh generation method for multicore SMT-based architectures

Christos D. Antonopoulos, Filip Blagojevic, Andrey N. Chernikov, Nikos P. Chrisochoides, Dimitrios S. Nikolopoulos
2009 Journal of Parallel and Distributed Computing  
The exploitation of medium-grain parallelism allows performance improvement at the single node level.  ...  We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level, medium-grain at the cavity level and fine-grain at the element level.  ...  Acknowledgments This work was supported in part by the following NSF grants: EIA-9972853, ACI-0085963, EIA-0203974, ACI-0312980, Career award CCF-0346867, CNS-0521381 and DOE grant DE-FG02-05ER2568.  ... 
doi:10.1016/j.jpdc.2009.03.009 fatcat:ytds5g2b6jgn3m5mrvxnhgyivi

Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation

Juan Fumero, Michel Steuwer, Lukas Stadler, Christophe Dubach
2017 Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments - VEE '17  
Computer systems are increasingly featuring powerful parallel devices with the advent of many-core CPUs and GPUs.  ...  However, exploiting heterogeneous hardware requires the use of low-level programming language approaches such as OpenCL, which is incredibly challenging, even for advanced programmers.  ...  Acknowledgments The authors would also like to thank the anonymous reviewers as well as Roland Schatz, Stefan Marr and Gilles Duboscq for fruitful discussions.  ... 
doi:10.1145/3050748.3050761 dblp:conf/vee/FumeroSSD17 fatcat:xh7trtbg6rhubaby3j2xoyqtqy

Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation

Juan Fumero, Michel Steuwer, Lukas Stadler, Christophe Dubach
2017 SIGPLAN notices  
Computer systems are increasingly featuring powerful parallel devices with the advent of many-core CPUs and GPUs.  ...  However, exploiting heterogeneous hardware requires the use of low-level programming language approaches such as OpenCL, which is incredibly challenging, even for advanced programmers.  ...  Acknowledgments The authors would also like to thank the anonymous reviewers as well as Roland Schatz, Stefan Marr and Gilles Duboscq for fruitful discussions.  ... 
doi:10.1145/3140607.3050761 fatcat:svzfaag4evg3hlfnosicgcabvy

A Cluster-as-Accelerator Approach for SPMD-Free Data Parallelism

Maurizio Drocco, Claudia Misale, Marco Aldinucci
2016 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)  
We implemented the proposed approach in SkeDaTo, a prototyping C++ library of data-parallel skeletons exploiting cluster-as-accelerator at the bottom layer of the runtime software stack.  ...  under the perspective of the upcoming exascale era.  ...  ACKNOWLEDGMENT This work has been partially supported by the EU FP7 REPARA project (no. 609666), the EU H2020 Rephrase project (no. 644235) and the 2015-2016 IBM Ph.D. Scholarship program.  ... 
doi:10.1109/pdp.2016.97 dblp:conf/pdp/DroccoMA16 fatcat:bbbvx77tcbhhbhrtdtppjltyfy

Recognising the Capacities of Dynamic Reconfiguration for the QoS Assurance of Running Systems in Concurrent and Parallel Environments

Wei Li, Maolin Tang
2012 2012 Sixth International Symposium on Theoretical Aspects of Software Engineering  
This paper reinvestigates our evaluations, extending them into concurrent and parallel environments by abstracting hardware and software conditions to design an evaluation context.  ...  The rationale is that the impaired QoS caused by inappropriate use of dynamic approaches is unacceptable for such running systems. To predict in advance the impact, the challenge is two-fold.  ...  ACKNOWLEDGEMENT The authors would like to thank Dr. Ian Peake from RMIT University Australia for his proof-reading and other useful comments for the writing of this paper.  ... 
doi:10.1109/tase.2012.10 dblp:conf/tase/LiT12 fatcat:o5gemrl22vbvbiw3j63u4og6hq
« Previous Showing results 1 — 15 out of 105,339 results