89 Hits in 6.9 sec

Supporting non-contiguous processor allocation in mesh-based CMPs using virtual point-to-point links

M Asadinia, M Modarressi, A Tavakkol, H Sarbazi-Azad
2011 2011 Design, Automation & Test in Europe  
In this work, we benefit from the advantages of non-contiguous processor allocation mechanisms, by allowing the tasks of the input application mapped onto disjoint regions (sub-meshes) and then virtually  ...  In this paper, we propose a processor allocation mechanism for run-time assignment of a set of communicating tasks of input applications onto the processing nodes of a Chip Multiprocessor (CMP), when the  ...  Lifting the contiguity condition in non-contiguous allocation is expected to increase processor utilization.  ... 
doi:10.1109/date.2011.5763072 dblp:conf/date/AsadiniaMTS11 fatcat:sexhjoibe5evfavim6x67fmvfa

Energy characteristic of a processor allocator and a network-on-chip

Dawid Zydek, Henry Selvaraj, Grzegorz Borowik, Tadeusz Łuba
2011 International Journal of Applied Mathematics and Computer Science  
Simulation results show that a PA with an IFF allocation algorithm for mesh systems and a torus-based NoC with express-virtual-channel flow control are very energy efficient.  ...  Energy characteristic of a processor allocator and a network-on-chip Energy consumption in a Chip MultiProcessor (CMP) is one of the most important costs.  ...  Acknowledgment This work has been supported by the European Union in the framework of the European Social Fund through the Warsaw University of Technology Development Programme, and by the Ministry of  ... 
doi:10.2478/v10006-011-0029-7 fatcat:cby3q2ucczczzpdknrm5lwk2ly

Scalable locality-conscious multithreaded memory allocation

Scott Schneider, Christos D. Antonopoulos, Dimitrios S. Nikolopoulos
2006 Proceedings of the 2006 international symposium on Memory management - ISMM '06  
Streamflow introduces an innovative design which uses only synchronization-free operations in the most common case of local allocations and deallocations, while requiring minimal, non-blocking synchronization  ...  a shared-memory system with four two-way SMT processors-four state-of-the-art multiprocessor allocators by sizeable margins in our experiments.  ...  Acknowledgments This work is supported by the National Science Foundation (Grants CCR-0346867 and ACI-0312980) the U.S.  ... 
doi:10.1145/1133956.1133968 dblp:conf/iwmm/SchneiderAN06 fatcat:m44spg3warga5dacwcbpmycz6q

Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors

Qingda Lu, Christophe Alias, Uday Bondhugula, Thomas Henretty, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, Tin-fook Ngai
2009 2009 18th International Conference on Parallel Architectures and Compilation Techniques  
Using a polyhedral model, the program's localizability is determined by analysis of its index set and array reference functions, followed by non-canonical data layout transformation to reduce non-local  ...  With increasing numbers of cores, future CMPs (Chip Multi-Processors) are likely to have a tiled architecture with a portion of shared L2 cache on each tile and a bankinterleaved distribution of the address  ...  ACKNOWLEDGMENT This research is supported in part by the U.S. National Science Foundation through awards 0403342, 0811781 and 0509467.  ... 
doi:10.1109/pact.2009.36 dblp:conf/IEEEpact/LuABHKRRSCLN09 fatcat:vsvqsjncfna6bkfn4caivb2ghe


Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Chunyue Liu, Glenn Reinman
2012 Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design - ISLPED '12  
In this paper we propose a Buffer-in-NUCA (BiN) scheme with the following contributions: (1) a dynamic interval-based global allocation method to assign spaces to accelerators that can best utilize the  ...  Moreover, no prior work considers the space fragmentation problem in a shared buffer, especially when allocating buffers in a non-uniform cache architecture (NUCA) with distributed cache banks.  ...  This makes physically non-contiguous spaces in NUCA appear to be contiguous, analogous to a typical OS-managed virtual memory.  ... 
doi:10.1145/2333660.2333715 dblp:conf/islped/CongGGLR12 fatcat:mc5upgzoqfe5hl52bavuifexqm

Design tradeoffs for tiled CMP on-chip networks

James Balfour, William J. Dally
2014 25th Anniversary International Conference on Supercomputing Anniversary Volume -  
Drawing on insights from our analysis, we present a concentrated mesh topology with replicated subnetworks and express channels which provides a 24% improvement in area efficiency and a 48% improvement  ...  in energy efficiency over other networks evaluated in this study.  ...  Acknowledgments We wish to acknowledge the contributions of Rebecca Schultz and thank the reviewers for insightful comments.  ... 
doi:10.1145/2591635.2667187 fatcat:mbz2ftfiobe6no5jovyvvmaf54

Design tradeoffs for tiled CMP on-chip networks

James Balfour, William J. Dally
2006 Proceedings of the 20th annual international conference on Supercomputing - ICS '06  
Drawing on insights from our analysis, we present a concentrated mesh topology with replicated subnetworks and express channels which provides a 24% improvement in area efficiency and a 48% improvement  ...  in energy efficiency over other networks evaluated in this study.  ...  Acknowledgments We wish to acknowledge the contributions of Rebecca Schultz and thank the reviewers for insightful comments.  ... 
doi:10.1145/1183401.1183430 dblp:conf/ics/BalfourD06 fatcat:yynugxhzxbfj5lgssikysmgnhm

Network-on-Chip with Long-Range Wireless Links for High-Throughput Scientific Computation

Turbo Majumder, Partha Pratim Pande, Ananth Kalyanaraman
2013 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum  
ACKNOWLEDGMENT This work was supported by NSF grant IIS-0916463.  ...  for long-range links arises when allocated nodes are non-contiguous and lie in neighboring quadrants, the mean distance between which is equal to the diameter, as shown in Fig. 3 (b) .  ...  We model the NoC-based multicore platform as a co-processor connected using a PCIe interface.  ... 
doi:10.1109/ipdpsw.2013.72 dblp:conf/ipps/MajumderPK13 fatcat:nazbyn5jr5fl3e5wb2pmtt3sdi

ALLARM: Optimizing sparse directories for thread-local data

Amitabha Roy, Timothy M. Jones
2014 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014  
In this paper we show how the memory allocation strategy for non-uniform memory access (NUMA) systems can be exploited to remove any coherencerelated traffic for thread-local data, as well removing the  ...  Our strategy is to allocate directory state only on a miss from a node in a different affinity domain from the directory. We call this ALLocAte on Remote Miss, or ALLARM.  ...  A purely software-based is Fensch et al.'s [14] proposed coherence scheme that uses OS support for coherence.  ... 
doi:10.7873/date.2014.091 dblp:conf/date/RoyJ14 fatcat:aput6bfzhnf7bclixsmfop4jvi

Distance-aware round-robin mapping for large NUCA caches

Alberto Ros, Marcelo Cintra, Manuel E. Acacio, Jose M. Garcia
2009 2009 International Conference on High Performance Computing (HiPC)  
We also show that the private cache indexing commonly used in many-core architectures is not the most appropriate for OS-managed distance-aware mapping policies, and propose to employ different bits for  ...  Our policy tries to map the pages accessed by a core to its closest (local) bank, like in a firsttouch policy.  ...  Alberto Ros is supported by a research grant from Spanish MEC under the FPU national plan (AP2004-3735).  ... 
doi:10.1109/hipc.2009.5433220 dblp:conf/hipc/RosCAG09 fatcat:scsn6jbwcvesddpspej3bphrda

A scalable micro wireless interconnect structure for CMPs

Suk-Bok Lee, Lixia Zhang, Jason Cong, Sai-Wang Tam, Ioannis Pefkianakis, Songwu Lu, M. Frank Chang, Chuanxiong Guo, Glenn Reinman, Chunyi Peng, Mishali Naik
2009 Proceedings of the 15th annual international conference on Mobile computing and networking - MobiCom '09  
It makes the case for using a two-tier hybrid wireless/wired architecture to interconnect hundreds to thousands of cores in chip multiprocessors (CMPs), where current interconnect technologies face severe  ...  Our simulations show that our protocol suite can reduce the observed latency by 20% to 45%, and consumes power that is comparable to or less than current 2-D wired mesh designs.  ...  This work was supported in part by SRC grant #1796, and the U.S. Army Research Laboratory and the U.K. Ministry of Defense under Agreement Number W911NF-06-3-0001.  ... 
doi:10.1145/1614320.1614345 dblp:conf/mobicom/LeeTPLCGRPNZC09 fatcat:fkelyqnjzndzxenh4uthc2dutu

DASC-DIR: a low-overhead coherence directory for many-core processors

Alberto Ros, Manuel E. Acacio
2014 Journal of Supercomputing  
Current trends point towards future many-core processors being implemented using the hardware-managed, implicitly-addressed, coherent caches memory model.  ...  Communication between cores is performed by writing to and reading from shared memory, and a scalable point-to-point interconnection network is in charge of transmitting messages.  ...  Acknowledgements This work has been supported by the Spanish MINECO, as well as European Commission FEDER funds, under grant "TIN2012-38341-C04-03", and also by the "Fundación Séneca-Agencia de Ciencia  ... 
doi:10.1007/s11227-014-1325-4 fatcat:ae3zlcyoxvfpbcna4oklxon6im

Explicit Communication and Synchronization in SARC

Manolis Katevenis, Vassilis Papaefstathiou, Stamatis Kavadias, Dionisios Pnevmatikatos, Federico Silla, Dimitrios Nikolopoulos
2010 IEEE Micro  
In order to hide IPC latency, when using implicit communication, we need large issue windows in out-of-order-execution processors, or sophisticated data prefetchers, or both.  ...  The fully virtualized and protected user-level API is based on specially marked lines in the scratchpad space that respond as command buffers, counters, or queues.  ...  ACKNOWLEDGMENTS This work was supported by the European Commission in the context of the project SARC (FP6 IP #27648) and partially by the projects UNiSIX (Marie-Curie #509595), I-Cores (IRG #224759) and  ... 
doi:10.1109/mm.2010.77 fatcat:jzsphc2sqrgpfh6rxdgvuswv5y

Performance and power optimization through data compression in Network-on-Chip architectures

Reetuparna Das, Asit K. Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Ravishankar Iyer, Mazin S. Yousif, Chita R. Das
2008 High-Performance Computer Architecture  
Thus, the NoC plays a critical role in optimizing the performance and power consumption of such non-uniform cache-based multicore architectures.  ...  These performance benefits in the interconnect translate up to 17% reduction in CPI.  ...  Instead, we use variable-line cache set in order to support cache compression.  ... 
doi:10.1109/hpca.2008.4658641 dblp:conf/hpca/DasMNPNIYD08 fatcat:su5ocscq6bhajeqrlqjueltxmm

The NoX router

Mitchell Hayenga, Mikko Lipasti
2011 Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-44 '11  
This paper proposes the use of a novel coding-based crossbar architecture to perform packet arbitration in parallel with switch traversal.  ...  The new NoX router is compared to traditional sequential and speculative single cycle router implementations on a 64node CMP mesh.  ...  A 64-node, 8x8 mesh network with 2mm 64-bit interconnection links is assumed in all simulations.  ... 
doi:10.1145/2155620.2155626 dblp:conf/micro/HayengaL11 fatcat:f4eiqclvrfc6jes3u4rn6z2sq4
« Previous Showing results 1 — 15 out of 89 results