Variability Expeditions: A Retrospective

Rajesh K. Gupta, Subhasish Mitra, Puneet Gupta
2019 IEEE design & test  
UCLA  ExpEditions in Computing arE among the largest and most ambitious projects sponsored by the National Science Foundation that seek to explore far out ideas in computing with potential for significant impact on computing and its industry. In 2009, we conceived of an Expeditions project in response to an alarming trend in the semiconductor industry: the manufactured chips were demonstrating significant variation in their performance and power consumption that was increasing at a rapid pace,
more » ... leading designers to overdesign circuits with margins (e.g., margins on speed and voltage, also referred to as guard bands) that often exceeded 40% of the nominal target specifications. This overdesign threatened to make new process nodes ineffective by causing significant loss of yield, effectively wiping out gains due to scaling geometries. In addition to manufacturing, computing machines were experiencing increasing variation in operating conditions due to the proliferating use of these devices in mobile and wireless applications. We envisioned an alternate universe where computing systems would sense their conditions and their environment and adapt to them. Software would drive such adaptation as the underlying components deviated from the normative manufacturing parameters or aged or just faced different operating conditions. Besides improved efficiency (e.g., in energy) due to reduced or eliminated overdesign, such systems will be inherently more reliable and available due to their continued operation through uncertain operating hardware or environments. The latter aspect is becoming critical as our reliance on autonomous systems (e.g., autonomous driving) continues to increase. The Expedition was an audacious attempt to rethink the computing universe, one worthy of an expedition to a new way of computing, where sensing circuits provide signatures that propagate through a new software stack, one that matches application needs to underlying physical capabilities, scaling one or the other appropriately. Software, appropriately enabled by sensing hardware, would ultimately provide an expanding source of new capabilities that tradeoff reliability, costs against the quality of computing, and/or storage. Expedition took on a detailed characterization of uncertainty in computing from spatial dimensions such as manufacturing to temporal dimensions such as circuit aging effects, to dynamic variations caused by workload and operating environments. As we built circuits, microarchitecture, devised coding methods, and adaptive algorithms, the research accelerated the trends toward fault tolerance in programming languages from fringe efforts such as principled approximation and probabilistic accuracy bounds to more mainstream approximate computing. The research showed that some of the classical fault tolerance techniques had outlived utility in the new computing systems. Instead, resilience enabled by cross-layer measures carried a prominent role in building robust systems that relied upon new methods for online self-test, diagnostics, and self-repair. Techniques such as concurrent autonomous chip self-test using stored test patterns assisted
doi:10.1109/mdat.2018.2889103 fatcat:4yskrdyganh6baxctvjaekccya