1,458 Hits in 5.4 sec

AISC: Approximate Instruction Set Computer [article]

Alexandra Ferreron, Jesus Alastruey-Benede, Dario Suarez-Gracia, Ulya R. Karpuzcu
2018 arXiv   pre-print
This paper makes the case for a single-ISA heterogeneous computing platform, AISC, where each compute engine (be it a core or an accelerator) supports a different subset of the very same ISA. An ISA subset may not be functionally complete, but the union of the (per compute engine) subsets renders a functionally complete, platform-wide single ISA. Tailoring the microarchitecture of each compute engine to the subset of the ISA that it supports can easily reduce hardware complexity. At the same
more » ... e, the energy efficiency of computing can improve by exploiting algorithmic noise tolerance: by mapping code sequences that can tolerate (any potential inaccuracy induced by) the incomplete ISA-subsets to the corresponding compute engines.
arXiv:1803.06955v1 fatcat:lw2arxhptzegbbo7tiqmkrluv4

Analytical Model of Memory-Bound Applications Compiled with High Level Synthesis [article]

Maria A. Dávila-Guzmán and Rubén Gran Tejero and María Villarroya-Gaudó and Darío Suárez Gracia
2020 arXiv   pre-print
The increasing demand of dedicated accelerators to improve energy efficiency and performance has highlighted FPGAs as a promising option to deliver both. However, programming FPGAs in hardware description languages requires long time and effort to achieve optimal results, which discourages many programmers from adopting this technology. High Level Synthesis tools improve the accessibility to FPGAs, but the optimization process is still time expensive due to the large compilation time, between
more » ... nutes and days, required to generate a single bitstream. Whereas placing and routing take most of this time, the RTL pipeline and memory organization are known in seconds. This early information about the organization of the upcoming bitstream is enough to provide an accurate and fast performance model. This paper presents a performance analytical model for HLS designs focused on memory bound applications. With a careful analysis of the generated memory architecture and DRAM organization, the model predicts the execution time with a maximum error of 9.2 previous works, our predictions reduce on average at least 2× the estimation error.
arXiv:2003.13054v1 fatcat:mfajkur5gzgupaypcnru34ue3m

Reducción Del Efecto Bti En El Banco De Registros De Las Gpu

Alejandro Valero, Francisco Candel, Salvador Petit, Darío Suárez-Gracia, Julio Sahuquillo
2017 Zenodo  
En este caso incorporamos el estado Apagado, el cual se refiere a la cantidad de tiempo que una celda se mantiene apagada gracias a la aplicación de la propuesta.  ...  b) Máximo duty cycle '1' Departamento de Informática e Ingeniería de Sistemas, Instituto Universitario de Investigación en Ingeniería de Aragón, Universidad de Zaragoza, e-mails: {alvabre, dario  ... 
doi:10.5281/zenodo.897724 fatcat:owg33qfuo5cmtnmfcyezi2sahy

Thermal Intelligent Control DVFS for Cyber-physical Systems

Pablo Hernandez Almudi, Eduardo Montijano, Darío Suárez Gracia
2022 Jornada de Jóvenes Investigadores del I3A  
This work explores the use of well-known control techniques to tackle the heat dissipation problem in embedded devices. These types of devices lack of good solutions for managing heat and the existing ones may lead to unpredictable performance variations. We will show a way of solving the problem with classic control theory and workload identification.
doi:10.26754/jjii3a.20227005 fatcat:i7t43vc25zhcjlx3jl6q755sxy

DC-Patch: A Microarchitectural Fault Patching Technique for GPU Register Files

Alejandro Valero, Dario Suarez-Gracia, Ruben Gran-Tejero
2020 IEEE Access  
The ever-increasing parallelism demand of General-Purpose Graphics Processing Unit (GPGPU) applications pushes toward larger and more energy-hungry register files in successive GPU generations. Reducing the supply voltage beyond its safe limit is an effective way to improve the energy efficiency of register files. However, at these operating voltages, the reliability of the circuit is compromised. This work aims to tolerate permanent faults from process variations in large GPU register files
more » ... rating below the safe supply voltage limit. To do so, this paper proposes a microarchitectural patching technique, DC-Patch, exploiting the inherent data redundancy of applications to compress registers at run-time with neither compiler assistance nor instruction set modifications. Instead of disabling an entire faulty register file entry, DC-Patch leverages the reliable cells within a faulty entry to store compressed register values. Experimental results show that, with more than a third of faulty register entries, DC-Patch ensures a reliable operation of the register file and reduces the energy consumption by 47% with respect to a conventional register file working at nominal supply voltage. The energy savings are 21% compared to a voltage noise smoothing scheme operating at the safe supply voltage limit. These benefits are obtained with less than 2 and 6% impact on the system performance and area, respectively.
doi:10.1109/access.2020.3025899 fatcat:bvwhzkmssjhodd7ji3zour3cc4

Exploración De Métodos Formales Para La Gestión De La Energía Y La Temperatura En Sistemas Operativos

Pablo Hernández Almudi, Eduardo Montijano, Darío Suárez Gracia
2018 Zenodo  
A su vez, gracias al uso de la biblioteca matemática estas multiplicaciones se realizan de manera simultánea en varios cores. B.  ...  10: Resultado ejecución K=100k Ti=180 DIIS-I3A-Univ. de Zaragoza, e-mail: 2 DIIS-I3A-Univ. de Zaragoza, e-mail: 3 DIIS-I3A-Univ. de Zaragoza, HiPEAC, e-mail: dario  ... 
doi:10.5281/zenodo.1403431 fatcat:iv4z2vdxijfh7gqds3mbj4l7qm

Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous Systems [article]

Jose Nunez-Yanez, Mohammad Hosseinabady, Moslem Amiri, Andrés Rodríguez, Rafael Asenjo, Angeles Navarro, Rubén Gran-Tejero, Darío Suárez-Gracia
2018 arXiv   pre-print
In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and programmable CPUs. Two platforms with different architectures are considered, and a single C/C++ source code is used in both of them for the CPU and FPGA resources. Instead of simply using the hardware accelerator to offload a task from the CPU, we propose a
more » ... ler that dynamically distributes the tasks among all the resources to fully exploit all computing devices while minimizing load unbalance. The multi-architecture study compares an ARMV7 and ARMV8 implementation with different number and type of CPU cores and also different FPGA micro-architecture and size. We measure that both platforms benefit from having the CPU cores assist FPGA execution at the same level of energy requirements.
arXiv:1802.03316v1 fatcat:ijjbrz6mdjg63muvrkfnftcml4

RRCD: Redirección de Registros Basada en Compresión de Datos para Tolerar FallosPermanentes en una GPU [article]

Yamilka Toca-Díaz, Alejandro Valero, Rubén Gran-Tejero, Darío Suárez-Gracia
2021 arXiv   pre-print
Servicios Informáticos de Geocuba, Universidad de Camagüey, Cuba, e-mail: 2 Dpto. de Informática e Ingeniería de Sistemas, Universidad de Zaragoza, España, e-mail: {alvabre,rgran,dario  ...  En este contexto, el trabajo previo GR-Guard identifica entradas fiables que contienen datos inútiles en tiempo de compilación y redirige accesos defectuosos a dichas entradas en tiempo de ejecución gracias  ... 
arXiv:2105.03859v1 fatcat:3c7woexlk5akvmixw2jl5zhe2q

Revisiting LP-NUCA Energy Consumption

Darío Suárez Gracia, Alexandra Ferrerón, Luis Montesano Del Campo, Teresa Monreal Arnal, Víctor Viñals Yúfera
2014 ACM Transactions on Architecture and Code Optimization (TACO)  
árez Gracia et al. Fig. 3 . 3 Fig. 3. Dynamic energy breakdown of the LP-NUCA activities with parallel tile cache access. árez Gracia et al. árez Gracia et al. Fig. 4 . 4 Fig. 4.  ...  árez Gracia et al. Fig. 8 . 8 Fig. 8. ADR organization. Size does not indicate the complexity. árez Gracia et al. Fig. 10 . 10 Fig.10.  ... 
doi:10.1145/2632217 fatcat:dfuzo6cvybcjtihd6bnz7lphfm

RRCD: Redirección de Registros Basada en Compresión de Datos para Tolerar Fallos Permanentes en una GPU

Yamilka Toca-Díaz, Alejandro Valero, Rubén Gran-Tejero, Darío Suárez-Gracia
2021 Zenodo  
Servicios Informáticos de Geocuba, Universidad de Camagüey, Cuba, e-mail: 2 Dpto. de Informática e Ingeniería de Sistemas, Universidad de Zaragoza, España, e-mail: {alvabre,rgran,dario  ...  En este contexto, el trabajo previo GR-Guard identifica entradas fiables que contienen datos inútiles en tiempo de compilación y redirige accesos defectuosos a dichas entradas en tiempo de ejecución gracias  ... 
doi:10.5281/zenodo.4934744 fatcat:v3aqblq54jfjjfs2lmlgd4egra

An Aging-Aware GPU Register File Design Based on Data Redundancy

Alejandro Valero, Francisco Candel, Dario Suarez-Gracia, Salvador Petit, Julio Sahuquillo
2018 IEEE transactions on computers  
Suárez-Gracia are with the Departmento de Informática e Ingeniería de Sistemas, Instituto Universitario de Ingeniería de Aragón, Universidad de Zaragoza, Spain.  ...  E-mails: {alvabre, dario} • F. Candel, S. Petit, and J. Sahuquillo are with the Department of Computer Engineering, Universitat Politècnica de València, Spain.  ... 
doi:10.1109/tc.2018.2849376 fatcat:tmhnfv7f2rcdvpbn7jl54hm36a

Analysis of network-on-chip topologies for cost-efficient chip multiprocessors

Marta Ortín-Obón, Darío Suárez-Gracia, María Villarroya-Gaudó, Cruz Izu, Víctor Viñals-Yúfera
2016 Microprocessors and microsystems  
After the embargo period  via non-commercial hosting platforms such as their institutional repository  via commercial sites with which Elsevier has an agreement In all cases accepted manuscripts should:  link to the formal publication via its DOI  bear a CC-BY-NC-ND licensethis is easy to do, click here to find out how  if aggregated with other manuscripts, for example in a repository or other site, be shared in alignment with our hosting policy  not be added to or enhanced in any way to
more » ... ppear more like, or to substitute for, the published journal article Embargo 0141-9331 Microprocessors and Microsystems 24 months 2 Abstract As chip multiprocessors accommodate a growing number of cores, they demand interconnection networks that simultaneously provide low latency, high bandwidth, and low power. Our goal is to provide a comprehensive study of the interactions between the interconnection network and the memory hierarchy to enable a better co-design of both components. We explore the implications of the interconnect choice on overall performance by comparing the behaviour of three topologies (mesh, torus, and ring) and their concentrated versions. Simply choosing the concentrated mesh over the ring improves performance by over 40% in a 64-core chip. The key strength of this work is the holistic analysis of the network-on-chip and the memory hierarchy. Experiments are carried out with a full-system simulator that carefully models the processors (single and multithreaded), memory hierarchy, and interconnection network, and executes realistic parallel and multiprogrammed workloads. We corroborate conclusions from several previous works: network diameter is critical, the concentrated mesh offers the best area-energy-delay trade-off, and traffic is very light and highly unbalanced. We also provide interesting insights about applicationspecific features that are hidden when studying only average results. We include a fairness analysis for multiprogrammed applications, and refute the idea of the memory controller placement greatly affecting performance. 2 Marta Ortín-Obón et al. application traces that do not entirely capture the behaviour of a real execution [10, 25, 30, 6] . This work simulates both parallel and multiprogrammed workloads with real applications, carefully modelling all the components above-mentioned. This allows us to study the effect of the interconnection network configuration on the whole system and the real interactions between the memory subsystem and the interconnect. We revisit the comparison of several topologies with our detailed simulation framework to update the results, validate or refute previous conclusions, and complete them with further analysis. We present an analysis of three topologies with varying degrees of complexity, performance, power, and area: mesh, torus, and ring. We model CMPs with 16 and 64 single-threaded cores, including a configuration with 16 4-threaded cores, and explore the effect of modifying the location and number of memory controllers. Our goal is to draw meaningful conclusions on the studied network configurations and study the details, pointing out the best choice from an integrated performance, area, and energy standpoint. The rest of this document is organized as follows: Section 2 presents the related work; Section 3 describes the CMP architecture and the interconnection network configuration; Section 4 introduces the methodology followed in this work; Section 5 describes the qualitative analysis of the topologies; Section 6 explains our simulation results, and Section 7 concludes the paper. 2 Related work Several publications have highlighted the impact of the network on performance, energy, and chip area. However, only a few papers focus on the comparison of interconnection network configurations. Balfour and Dally present an analysis of how different topologies affect performance, area, and energy efficiency [6]. However, they do not model the memory subsystem, only use synthetic traffic patterns, and do not consider simple topologies like the ring. Gilabert et al. focus on physical synthesis of several networks, but do not simulate real applications or systems larger than 16 cores [16]. Villanueva et al. highlight the importance of a comprehensive simulation framework and present results of the execution of real parallel applications and its close relationship with cache behaviour [41]. Sanchez et al. explore the implications of interconnection network design for CMPs [36] . We complement their results including a simple topology (ring), multiprogrammed workloads, traffic distribution analysis, the effect of memory controller placement, and the influence of the network topology on fairness. Many papers propose alternatives to conventional router architectures, topologies, and flow control methods on isolation. However, they do not consider the impact on the overall system and back up the results with network-only simulations of synthetic traffic and traces. Carara et al. revisit circuitswitching which, as opposed to packet-switching, allows to reduce buffer size, and guarantees throughput and latency [10]; Walter et al. try to avoid hotspots on systems on chip by implementing a distributed access regulation technique that fairly allocates resources for certain modules [42]; Mishra et al. propose an heterogeneous on-chip interconnect that allocates more resources for routers suffering higher traffic but they only get good results with a mesh topology [33]; Koibuchi et al. detect that adding random links to a ring topology results in big performance gains, although they only experiment with a network simulator [25] . All these studies either do not model the whole system, do not include a significant variety of real workloads, or do not experiment with different topologies. Also, most of them only include network-related metrics and fail to report on overall performance, or elaborate conclusions based on IPC (instructions per cycle), which has been reported to be unsuitable for parallel applications [47] . Another approach consists on designing the network considering the behaviour of the memory subsystem and the coherence protocol. Yoon et al. propose an architecture with parallel physical networks with narrower links and smaller routers that eliminates virtual channels [45] . Seiculescu et al. propose to use two dedicated networks: one for requests and one for replies [37] . Lodde et al. introduce a smaller network for invalidation messages, but only test their design with memory access traces [30] . Agarwal et al. propose embedding small in-network coherence filters inside on-chip routers to dynamically track sharing patterns and eliminate broadcast messages [5] . These studies try to improve the performance of the most commonly used networks, but do not venture with less conventional topologies. Also, they only experiment with a maximum of 16 cores. Krishna et al. propose a system
doi:10.1016/j.micpro.2016.01.005 fatcat:u6n3inj7f5emjeae7nbnasy6hm

Concertina: Squeezing in Cache Content to Operate at Near-Threshold Voltage

Alexandra Ferreron, Dario Suarez-Gracia, Jesus Alastruey-Benede, Teresa Monreal-Arnal, Pablo Ibanez
2016 IEEE transactions on computers  
Darío Suárez-Gracia (S'08, M'12) received the PhD degree in Computer Engineering from the Universidad de Zaragoza, Spain, in 2011.  ...  Suarez Gracia is also a member of the IEEE Computer Society and the Association for Computing Machinery.  ... 
doi:10.1109/tc.2015.2479585 fatcat:ndyxtieje5bavkewwibypqrk2e

Automatic discovery of performance and energy pitfalls in HTML and CSS

Adrian Sampson, Calin Cascaval, Luis Ceze, Pablo Montesinos, Dario Suarez Gracia
2012 2012 IEEE International Symposium on Workload Characterization (IISWC)  
WebChar is a tool for analyzing browsers holistically to discover properties of HTML and CSS that lead to poor performance and high energy consumption. It analyzes a large collection of Web pages to mine a model for their performance based on static attributes of the content. An evaluation on two platforms, a netbook and a smartphone, demonstrates that WebChar can yield actionable conclusions for both content developers and browser implementors.
doi:10.1109/iiswc.2012.6402904 dblp:conf/iiswc/SampsonCCMG12 fatcat:gxtvi2fdi5hxjl3z6rhb6vrvwm

Shrinking L1 Instruction Caches to Improve Energy–Delay in SMT Embedded Processors [chapter]

Alexandra Ferrerón-Labari, Marta Ortín-Obón, Darío Suárez-Gracia, Jesús Alastruey-Benedé, Víctor Viñals-Yúfera
2013 Lecture Notes in Computer Science  
Instruction caches are responsible for a high percentage of the chip energy consumption, becoming a critical issue for battery-powered embedded devices. We can potentially reduce the energy consumption of the first level instruction cache (L1-I) by decreasing its size and associativity. However, demanding applications may suffer a dramatic performance degradation, specially in superscalar multi-threaded processors, where, in each cycle, multiple threads access the L1-I to fetch instructions. We
more » ... introduce iLP-NUCA (Instruction Light Power NUCA), a new instruction cache that substitutes the conventional L2, improving the Energy-Delay of the system. iLP-NUCA adds a new tree-based transport network topology that reduces latency and energy consumption, regarding former LP-NUCA implementations. With iLP-NUCA we reduce the size of the L1-I outperforming conventional cache hierarchies, and reducing the overall consumption, independently of the number of threads.
doi:10.1007/978-3-642-36424-2_22 fatcat:f57hsdiahjgwzfb6mxpodjuzgq
« Previous Showing results 1 — 15 out of 1,458 results