Filters








81,297 Hits in 4.7 sec

A scalable and flexible data synchronization scheme for embedded HW-SW shared-memory systems

Om Prakash Gangwal, André Nieuwland, Paul Lippens
2001 Proceedings of the 14th international symposium on Systems synthesis - ISSS '01  
In this scheme, synchronization primitives are chosen such that they can be implemented efficiently in both hardware and software on distributed shared memory architectures, without the need for atomic  ...  This paper describes the implementation of a data-synchronization scheme that can be used in the functional description and hardware realization of algorithms for heterogeneous multi-processor architectures  ...  We clearly separate synchronization from data transportation since in a shared memory architecture no copying of data is required.  ... 
doi:10.1145/500002.500003 fatcat:dhuvxzosfvamxo7oqrhlgrghxm

A scalable and flexible data synchronization scheme for embedded HW-SW shared-memory systems

Om Prakash Gangwal, André Nieuwland, Paul Lippens
2001 Proceedings of the 14th international symposium on Systems synthesis - ISSS '01  
In this scheme, synchronization primitives are chosen such that they can be implemented efficiently in both hardware and software on distributed shared memory architectures, without the need for atomic  ...  This paper describes the implementation of a data-synchronization scheme that can be used in the functional description and hardware realization of algorithms for heterogeneous multi-processor architectures  ...  We clearly separate synchronization from data transportation since in a shared memory architecture no copying of data is required.  ... 
doi:10.1145/500001.500003 fatcat:6jdsnn4gpjdahnqodahxhiszzy

Generation of Heterogeneous Distributed Architectures for Memory-Intensive Applications Through High-Level Synthesis

Chao Huang, Srivaths Ravi, Anand Raghunathan, Niraj K. Jha
2007 IEEE Transactions on Very Large Scale Integration (vlsi) Systems  
We use a combination of clustering and min-cut style partitioning techniques to yield distributed architectures, based on simulation profiling while considering various factors including data access locality  ...  Synthesis should, therefore, be capable of determining a partitioned architecture, wherein array data and computations may have to be heterogeneously distributed for achieving the best performance speed-up  ...  Their work has motivated our research on memory data organization and optimization.  ... 
doi:10.1109/tvlsi.2007.904096 fatcat:czc256r4zfc7hbe44ir6smrqwu

Massive Parallel Join in NUMA Architecture

Wei He, Minqi Zhou, Xueqing Gong, Xiaofeng He
2013 2013 IEEE International Congress on Big Data  
IEEE International Congress on Big Data 978-0-7695-5006-0/13 $26.00  ...  Compared to traditional on-disk database, IMDB has advantages such as faster access to storage and simpler internal optimization algorithms.  ...  In SMP architecture, threads can communicate through shared memory, thus the optimized join algorithms for SMP need to consider more about the processor synchronization cost when accessing shared memory  ... 
doi:10.1109/bigdata.congress.2013.37 dblp:conf/bigdata/HeZGH13 fatcat:uwwmzrfkjzebpnv2ren4cxzrpm

Long DNA Sequence Comparison on Multicore Architectures [chapter]

Friman Sánchez, Felipe Cabarcas, Alex Ramirez, Mateo Valero
2010 Lecture Notes in Computer Science  
We analyze two different SW implementations on the CellBE and use simulation tools to study the performance scalability in a multicore architecture.  ...  We study the memory organization that delivers the maximum bandwidth with the minimum cost.  ...  TFig. 1 . 1 block(b,k) Time required to process a block of size b * k (a) Data dependency (b) Different optimal regions (c) Computation distribution Fig. 2 . 2 (a) SPEs store data in memory.  ... 
doi:10.1007/978-3-642-15291-7_24 fatcat:vv2w3yjanjhjrjxrxgzlkayvda

Instruction set extensions for photonic synchronous coalesced accesses

Paul Keltcher, David Whelihan, Jeffrey Hughes
2013 2013 IEEE High Performance Extreme Computing Conference (HPEC)  
on modern architectures.  ...  This operation is described, and its ISA implications explored in the context of the distributed matrix transpose, which exhibits a high degree of data non-locality, and is difficult to efficiently parallelize  ...  Related work, specifically how existing parallel architectures work with distributed data, is discussed in section V, followed by conclusions in section VI. II.  ... 
doi:10.1109/hpec.2013.6670326 dblp:conf/hpec/KeltcherWH13 fatcat:y7fki3y375fsvpdumbgldwy4ze

Re-engineering the ant colony optimization for CMP architectures

José M. Cecilia, José M. García
2019 Journal of Supercomputing  
Moreover, parallel efficiency is provided for all targeted architectures, finding that core load imbalance, memory bandwidth limitations, and NUMA effects on data placement are some of the key factors  ...  In the latter case, the parallel efficiency is affected by the synchronization frequency, which also affects the quality of the solution found by the distributed implementation.  ...  NUMA architectures have a different memory latency depending on the NUMA node accessing the data, and may also vary depending on the consistency state of the accessed data.  ... 
doi:10.1007/s11227-019-02869-8 fatcat:aajgzsgk3rddrpbrnbbnjckrse

A dataflow-like programming model for future hybrid clusters

Jens Breitbart
2013 International Journal of Networking and Computing  
in case the memory consistency model is not optimal.  ...  Broadcast, scatter and gather are modeled based on data distribution among the nodes, whereas reduction and scan follow a combining PRAM approach of having multiple threads write to the same memory location  ...  The synchronization size can differ for different data and the optimal synchronization size depends on the algorithm and hardware used.  ... 
doi:10.15803/ijnc.3.1_15 fatcat:hzcymccayzfs7dt3t6ukkcg274

Architecture optimizations for synchronization and communication on chip multiprocessors

Sevin Fide, Stephen Jenks
2008 Proceedings, International Parallel and Distributed Processing Symposium (IPDPS)  
running on CMPs Problems  Synchronization Overhead  Spin Waits  Memory Bandwidth Bottleneck  Many Simultaneous Accesses  Cache Pollution  Data Evictions from Shared Cache  Demand-Based  ...  Data Transfers  Depend on Coherence Mechanisms 5 Conventional Parallel Programming  Data parallelism by splitting data across multiple threads  Memory interface is overburdened  Performance  ... 
doi:10.1109/ipdps.2008.4536357 dblp:conf/ipps/FideJ08 fatcat:lclrjbqwlfbvxpvyukfpbhg3e4

Scalable distributed memory embedded system with a low-cost hardware message passing interface

Ha-young Jeong, Won Hur, Yong-surk Lee
2009 IEICE Electronics Express  
In this paper, we propose a scalable distributed memory system with a low-cost hardware message-passing interface.  ...  The proposed interface improves the communication performance between nodes to decrease the overhead synchronization with a receiver reservation technique.  ...  On distributed memory architecture there are synchronization issue between receive and send signal due to an imperfect synchronization.  ... 
doi:10.1587/elex.6.837 fatcat:4nml3maoa5hpdemsgpxgqfxp2m

From algorithm and architecture specifications to automatic generation of distributed real-time executives: a seamless flow of graphs transformations

T. Grandpierre, Y. Sorel
2003 First ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2003. MEMOCODE '03. Proceedings.  
We present an original architecture model which allows to perform accurate sequencer modeling, memory allocation, and heterogeneous inter-processor communications for both modes shared memory and message  ...  This paper presents a seamless flow of transformations which performs dedicated distributed executive generation from a high level specification of a pair: algorithm, architecture.  ...  Thanks to our architecture model, it is possible to cover a large amount of architectures based on various memory and communication networks.  ... 
doi:10.1109/memcod.2003.1210097 dblp:conf/memocode/GrandpierreS03 fatcat:ahxuoranenh7rix3xduw7ojmpe

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Jiayuan Meng, Kevin Skadron
2010 International journal of parallel programming  
Both communication and synchronization may incur significant overhead on parallel architectures with shared memory.  ...  However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed  ...  The concept of computation replication involved in ghost zones is related to data replication and distribution in the context of distributed memory systems [4] [23] , which are used to wisely distribute  ... 
doi:10.1007/s10766-010-0142-5 fatcat:7ygnx3qccbbllivzjfrtvwx4di

Automatic Dsp Cache Memory Management And Fast Prototyping For Multiprocessor Image Applications

O. Deforges, Jean Franois Nezan, Mickael Raulet, Fabrice Urban
2006 Zenodo  
The parallel aspect of multicomponent architectures raise problems in terms of application distribution: handmade data transfers and synchronizations quickly become very complex and result in lost time  ...  Moreover, when external memory is used without cache, data localisation has a great impact on performance. The distribution of data between external or internal memory is crutial.  ... 
doi:10.5281/zenodo.39900 fatcat:skf3b52qkzc5lhrfus3hrw3d74

P-sync: A Photonically Enabled Architecture for Efficient Non-local Data Access

David Whelihan, Jeffrey J. Hughes, Scott M. Sawyer, Eric Robinson, Michael Wolf, Sanjeev Mohindra, Julie Mullen, Anna Klein, Michelle Beard, Nadya T. Bliss, Johnnie Chan, Robert Hendry (+2 others)
2013 2013 IEEE 27th International Symposium on Parallel and Distributed Processing  
This paper describes a novel synchronized global photonic bus and system architecture called P-sync that uses photonics' distance independence to greatly improve performance on many important applications  ...  The architecture is evaluated in the context of a non-local yet common application: the distributed Fast Fourier Transform.  ...  of optimized architectures for the user code, optimized generated code, and results from a run on target architectures.  ... 
doi:10.1109/ipdps.2013.56 dblp:conf/ipps/WhelihanHSRWMMKBBCHBC13 fatcat:3qw2gowypndh7h32jynxnga7b4

Using Rtos In The Aaa Methodology Automatic Executive Generation

O. Deforges, Jean Franois Nezan, Mickael Raulet, Ghislain Roquier
2006 Zenodo  
One of them is generic and do not depend on the algorithm. It supports the architecture specification such as memory allocations, sequence synchronizations and also inter-operator transfers.  ...  The optimization problem aims to select the most efficient one between them (real-time constraints, architecture ressources. . .).  ... 
doi:10.5281/zenodo.39917 fatcat:a5wcm3qcwzdojimyx7lwkvhd4y
« Previous Showing results 1 — 15 out of 81,297 results