Programming Abstractions and Optimization Techniques for GPU-based Heterogeneous Systems [thesis]

Lu Li
CPU/GPU heterogeneous systems have shown remarkable advantages in performance and energy consumption compared to homogeneous ones such as standard multi-core systems. Such heterogeneity represents one of the most promising trends for the near-future evolution of high performance computing hardware. However, as a double-edged sword, the heterogeneity also brings significant programming complexities that prevent the easy and efficient usage of different such heterogeneous systems. In this thesis,
more » ... ms. In this thesis, we are interested in four such kinds of fundamental complexities that are associated with these heterogeneous systems: measurement complexity (efforts required to measure a metric, e.g., measuring enegy), CPU-GPU selection complexity, platform complexity and data management complexity. We explore new lowcost programming abstractions to hide these complexities, and propose new optimization techniques that could be performed under the hood. For the measurement complexity, although measuring time is trivial by native library support, measuring energy consumption, especially for systems with GPUs, is complex because of the low level details involved such as choosing the right measurement methods, handling the trade-off between sampling rate and accuracy, and switching to different measurement metrics. We propose a clean interface with its implementation that not only hides the complexity of energy measurement, but also unifies different kinds of measurements. The unification bridges the gap between time measurement and energy measurement, and if no metric-specific assumptions related to time optimization techniques are made, energy optimization can be performed by blindly reusing time optimization techniques. For the CPU-GPU selection complexity, which relates to efficient utilization of heterogeneous hardware, we propose a new adaptive-sampling based construction mechanism of predictors for such selections which can adapt to different hardware platforms automatically, and shows non-trivial advantages over random sampling. For the platform complexity, we propose a new modular platform modeling language and its implementation to formally and systematically describe a computer system, enabling zero-overhead platform information queries for high-level software tool chains and for programmers as a basis for making software adaptive. For the data management complexity, we propose a new mechanism to enable a unified memory view on heterogeneous systems that have separate memory spaces. This mechanism enables programmers to write significantly less code, which runs equally fast with expert-written code and outperforms the current commercially available solution: Nvidia's Unified Memory. We further propose two data movement optimization techniques, lazy allocation and transfer fusion optimization. The two techniques are based on adaptively merging messages to reduce data transfer latency. We show that these techniques can be potentially beneficial and we prove that our greedy fusion algorithm is optimal. Finally, we show that our approaches to handle different complexities can be combined so that programmers could use them simultaneously. This research has been partly funded by two EU FP7 projects (PEP-PHER and EXCESS) and SeRC. Populärvetenskaplig Sammanfattning Vi lever i ett samhälle där vetenskap och teknik utvecklas i allt snabbare takt, och där datorer numera finnsöverallt. Människan har blivit beroende av datorer för att kunna sköta sitt dagliga arbete ochäven för underhållning och kommunikation. Ett modernt liv kan knappast föreställas utan datorer. I vårt mycket datoriserade samhälle kommer produktiviteten och välfärden drabbas signifikant om datorernas prestanda sviker. Snabbare datorer kanävenöppna nya vägar i forskningen inom andra områden, såsom djupinlärning-teknologin som möjliggör självkörande bilar mm. De kanäven underlätta upptäckter i andra vetenskapliga grenar. Till exempel, genom att köra större simuleringar kan precisare experimentella data genereras, vilket dock kräver en snabbare arbetsstation elleräven en superdator. Eftersom datorerär så betydelsefulla och man fortsätter flytta mer uppgifter till dem så jobbar vi inom vetenskap och teknik hårt för att ytterligare förbättra datorers prestanda. För att nå detta mål måste programvaran och hårdvaran samarbeta på ett bättre sätt. Tidigare kunde programvaran profitera automatiskt från snabbare hårdvara i varje generation och därmed själv bli snabbare. Men dessa gamla goda tideräröver.Än värre: snabbare datorer förbrukar också betydligt mer energi, vilket skapar nya problem för samhället. Lösningen som hårdvaruindustrin tagit till sedan ca 2005 varövergången till fler-och mångkärniga datorarkitekturer, d.v.s. parallella, distribuerade och oftast heterogena datorsystem där vanliga processorer (CPU) kompletteras med grafikprocessorer (GPU) eller andra former av programmerbara hårdvaruacceleratorer. Dessa system erfordrar komplex programmering och noggrann, ressursmedveten optimering av programkoden för prestanda och energieffektivitet. Detär en stor utmaning för mjukvaruingenjörer att skapa snabb kod för dessa komplexa datorarkitekturer som kan möta det moderna samhällets stadigtökande prestandakrav. Dessutom kan den snabba utvecklingstakten i hårdvaran leda till inkompatibilitet eller ineffektivitet av redan existerande programvara på nya hårdvarugenerationer. Sammanfattningsvis så finns huvudsakligen fyra problem: (1) Detär svårt att skriva effektiv programkod. (2) För existerande prestandakritisk programkodär det svårt att garantera att denöverhuvudtaget kan köras på varje ny hårdvarugeneration. (3)Även om själva kodenär portabel såär det svårt att automatiskt bibehålla effektivitetsnivån på nästa hårdvarugeneration. (4) Vi behöver metoder som kan optimera inte bara programmets exekveringstid utanäven dess energianvändning. I denna avhandling utforskar vi programmeringsabstraktioner (t.ex. för programvarukomponenter) och tekniker för heterogena datorsystem som tar itu med dessa problem. Våra metoder och ramverk avlastar programmeraren från flera viktiga uppgifter utan att negativt påverka mjukvarans prestanda. (A) En av ansatserna automatiserar minneshanteringen och optimerar dataöverföringen så att programmet exekverar snabbareän hårdvarutillverkarens egen automatiserade lösning. Samma ansats gör det möjligt för programmeraren att skriva kompaktare, mer läsbar kod som dock exekverar lika effektivt som expert-handskriven kod, och därmedökar programmerarens produktivitet. (B) Vi utvecklade ett plattformsbeskrivningsspråk som underlättar att systematiskt beskriva komplexa datorsystem med deras hårdvaru-och systemprogramvarukomponenter, och som kan främja portabilitet, optimering och adaptivitet av programvara till exekveringsplattformen. (C) Vi utvecklade en ny mekanism för konstruktion av smarta prediktorer som kan göra programexekveringen adaptiv till exekveringsplattformen, möjliggör effektiv användning av hårdvaran, och visar signifikanta förbättringar jämfört med state-of-the-art lösningen. (D) Viöverbryggar gapet mellan prestandaoptimering och energioptimering på ett sätt som möjliggör att under vissa förutsättningaråteranvända prestandaoptimeringstekniker för att få en reduktion av programmets energiförbrukning. Slutligen kan vi nyttja alla dessa metoder och ramverk samtidigt genom att integrera dem på ett lämpligt sätt. Vi gör våra programvaruprototyper allmänt tillgängliga medöppen källkod. På det sättet kan de användas (och faktiskt redan har använts) t.ex. av andra forskare inom vårt område för att hantera vissa av de ovannämnda komplexiteterna och som byggstenar i andra forskningsprototyper. Popular Science Summary We live in a society where science and technology are evolving faster than ever, and computers are everywhere. People rely on computers to perform their daily jobs and get entertainment. Modern life is hard to imagine without computers. In the heavily computerized society where we are living, it will significantly harm the society's productivity and welfare if computers run slowly. Moreover, faster computers can unlock the true power of research in other field, like deep learning technology that enables self-driving cars. They can also facilitate discoveries in other scientific areas, e.g., more precise experimental data can be obtained by running larger simulations which requires a faster work station or even a supercomputer. Since computers are so important and we keep putting more tasks on them, scientists and engineers are working hard to further improve their performance. To achieve such a goal, software and hardware must collaborate. In old times, software could rely on faster hardware in each generation, thereby making itself run faster automatically. But the good old days are gone, possibly forever. To make things worse, faster computers also bring significant more energy consumptions. The alternative is to introduce multicore/many core designs in our computers that lead to a scalable and sustainable energy increase but require parallel and distributed programming of often heterogeneous systems with GPUs and careful optimizations for performance and energy efficiency. Producing fast-running software on these complex parallel computers, to meet the insatiable needs of society, is very challenging for software engineers, not even considering that the fast-evolving hardware may break or run very inefficiently the software already produced. In summary, there are four main problems: 1) it is hard to produce fast software; 2) for already produced high performance software, it is hard to guarantee that they could still run on each generation of hardware that appears frequently; 3) it is hard to automatically maintain its efficiency on time on each new generation of hardware; 4) we need methods to lean more towards reducing energy consumption of software in addition to making it faster. In this thesis, we explore new programming abstractions (for software components) and techniques to tackle these problems. We remove four important responsibilities (handling of measurement complexity, CPU/GPU selection, plaform complexity and data management) from software engineers without sacrificing software performance. VectorPU enables software engineers to write significantly less code still with the same efficiency as expert-written code, resulting in a productivity boost. VectorPU allows software to run significant faster than the current commercially available solution. We design a new platform description language XPDL to systematically describe a computer system, and protect software to be broken by different machines, and possibly by future computers. We design a new construction mechanism for smart predictors that can make software executions adaptive to different machines and allow efficient hardware utilization, and that shows non-trivial advantages over the state-of-the-art solution. We bridge the gap between performance optimization and energy optimization, thus if no metric-specific assumptions related to time optimization techniques are made, we can easily reuse performance optimization techniques to reduce energy consumption instead. Finally, we gain all those benefits simultaneously by integrating them in meaningful ways. We make our designed software framework prototypes available as open source, thus these prototypes can (and already did) help other researchers to tackle these complexities, and utilize those prototypes for new knowledge generation. ix
doi:10.3384/diss.diva-145304 fatcat:etphd6ooybh47mgr3ew753qvoy