Chip multi-processor generator
Proceedings - Design Automation Conference
Recent changes in technology scaling have made power dissipation today's major performance limiter. As a result, designers struggle to meet performance requirements under stringent power budgets. At the same time, the traditional solution to power efficiency, application specific designs, has become prohibitively expensive due to increasing nonrecurring engineering (NRE) costs. Most concerning are the development costs for design, validation, and software for new systems. One direction that
... stry has attempted, with the goal of mitigating the rising costs of per-application designs, is to add a layer of programmability that specifies how the hardware operates. Example of this approach include baseband processors for software-defined-radio (SDR) wireless devices [28, 100, 51] . Similarly, our previous study, Stanford Smart Memories (SSM), showed that it is possible to build a reconfigurable chip multiprocessor memory system that can be customized for specific application needs [71, 41, 92, 89] . These programmable, or reconfigurable, hardware solutions enable per-application customization and amortization of NRE costs-to a limited extent. However, reconfigurability introduces overheads at the circuit level, and customization is limited to those resources that were decided upon, and verified, upfront. In this thesis, we argue that one can harness the ideas of reconfigurable designs to build a design framework that can generate semi-custom chips-a Chip Generator. A domainspecific chip generator codifies the designer knowledge and design trade-offs into a template that can be used to create many different chips. Like reconfigurable designs, these systems fix the top level system architecture, amortizing software and validation and design costs, and enabling a rich system simulation environment for application developers. Meanwhile, below the top level, the developer can "program" the individual inner components of the architecture. Unlike reconfigurable chips, a generator "compiles" the program to create a customized chip. This compilation process occurs at elaboration time-long before silicon v Dedicated with love to my grandfather Simcha. As much, and perhaps more than this thesis represents my personal abilities, it represents how lucky I was to be surrounded by incredible people who guided me, mentored me, taught me, provided a shoulder to lean on, and examples to live by. First and foremost, I would like to thank Professor Mark Horowitz, my advisor. When I just started at Stanford, Mark needed someone to help wrap-up a project, and hired me with the promise that this is "not research, just grant work for a couple of quarters," and I was happy to take it. But I stuck around, closing on my sixth year now. Throughout these years, from being my teacher (and my boss), Mark became my advisor, my mentor, and my role model. I learned so much about circuit and chip design from Mark, but even more than Mark taught me academically, he taught me how to think. I would also like to thank my co-advisor, Professor Subhasish Mitra. I got to meet Subhasish for the first time through an independent study project I performed under his supervision on my first year. I quickly realized what an endless source of information he is, and that his door was always open for me to just wander in with any random question. Subasish always accepted me with a big smile, with great patience, (with chocolates from his most recent trip) and with a clear answer to whatever I was puzzled about. Many thanks are extended to Professor Christos Kozyrakis for serving in my reading and defense committee, and for teaching me so much, whether in his classes, a random corridor talk, or during the Stanford Smart Memories project. A special thank you is extended to Professor Dan Boneh, for being the chair of my defense committee and for all the wonderful things I learned in his classes. Somehow, Dan was able to take the arguably most difficult topics-cryptography and computer security-and make it understandable to mere humans like myself. vii A great and special thanks is extended to Consulting Professor Stephen Richardson, who also served on my defense and reading committee. When I met Steve for the first time, he manifested his goal as to "help students understand what they are doing and what they need to do to achieve their PhD goals." I took Steve on his word, and we had numerous meeting in which he always provided good, useful, advice. Over the years, and even more important than any academic advice, co-authoring of papers and research collaborations, Steve and I became good friends. Stanford is a wonderful place with many wonderful people, many of whom became my friends. An enormous thank you is sent to Megan Wachs and Zain Asgar. Without my good friend Megan, much of the research presented here would not have existed. Many thanks are also extended to Amin Firoozshahian, Alex Solomatnikov and Francois Labonte who took me in to the group and helped, taught, and mentored me in my early days at Stanford. Finally I would like to thank the rest of the Stanford VLSI research group who have been such a fertile ground for thoughts and research. This thesis, and my entire career at Stanford for that matter, would not have existed if it was not for the kindness and generosity of Mr. and Ms. Irving and Harriet Sands who not only supported me financially, but also supported me morally and mentally throughout these years. In fact, as I was considering my options before going to Stanford, Harriet and Irving, who at that point had last seen me as a little kid, called me to explain why I must not miss out on this great opportunity, and even offered to help. They were indeed very right. Finally, I would like to thank my family who supported me through all these years of living abroad. I would like to thank my parents, Itzhak and Lea Shacham, who always encouraged me to strive higher and be better, and have gave me the foundations I needed to complete this task. I would also like to thank my wife's parents, Rafi and Chaya Taterka, who supported our decision to live on the other side of the world for this adventure even though I know how tough this is for them, and for making constant trips to visit us. My greatest thanks are for my daughters, Ori and Alma, who joined us along the way and made our lives so wonderful, and made it practically impossible to work too hard! And to top it all up, Neta my love, there are not enough words to express my love and gratitude-this PhD is as much yours as it is mine! viii Contents Abstract v Acknowledgments vii 2 CHAPTER 1. INTRODUCTION reminiscent of chip design problems in the early 1980s, when all chips were designed by full custom techniques. At that time, few companies had the skills or the dollars to create chips. The invention of synthesis and place-and-route tools dramatically reduced design costs and enabled cost effective ASICs. Over the past 25 years, however, complexity has grown, creating the need for another design innovation. To enable this innovation, we first need to face the main issue: building a completely new complex system is expensive. The cost of design and verification has long exceeded tens of millions of dollars. Moreover, hardware is only half the story. New architectures require expensive new software ecosystems to be useful. Developing these tools and code is also expensive. Providing a designer with complex IP blocks does not solve this problem: the assembled system is still complex and still requires custom verification and software. Furthermore, verification costs still trend with system complexity and not with the number of individual blocks used. To address some of these design costs, the industry has been moving toward platform-based designs , where the system architecture has been fixed, to provide an interface, an abstraction layer, for the design space exploration, validation and software efforts. A platform in this () sense is an architecture that, rather than being assembled from a collection of independently developed blocks of silicon, is derived from a specific "family" of micro-architectures, oriented toward a particular class of problems. Most often, to make these platforms serve a wide class of problems, design houses rely on hardware programmability and/or reconfigurability [71, 64, 51, 61] . While such strategies address some of the design costs, these general, programmable platforms still do not provide the desired ASIC-like performance and power efficiency. The amount of resources in a programmable platform (e.g., compute engines, instruction and data caches, processor width, memory bandwidth, etc.) is never optimal for any particular application. Since the power and area of the chip are limited, a compromise among the expected use-cases is typically implemented. Similarly, adding configuration registers to a design also implies adding circuit inefficiencies, such as muxes in data paths or table look-ups for control, impeding both performance and energy. Furthermore, while a reconfigurable chip is likely to work in the modes for which it was designed and tested, and perhaps for some closely related configurations, it is doubtful if a completely new use-case would work efficiently the first time. It therefore seems that on one hand, a reconfigurable platform based approach does not provide the required performance and power efficiency, and on the other, ASIC based 3 solutions are too expensive for most applications. The key to solving this impasse is to understand that while we cannot afford to build a customized chip for every application, we can reuse one application's design process to generate multiple new chips. For example, many applications within a domain may require similar systems with small variations in hardware units, or the same application may be used in multiple target devices with different power and performance constraints. While a configurable chip cannot be as efficient as its set of application specific counterparts, suppose we could introduce the one piece of "secret sauce" that makes that application work, and then generate (rather than program) a system configuration that meets the power and performance constraints, and only then fabricate the chip; we would certainly end up with a much more efficient chip. Furthermore, every time a chip is built, we inherently evaluate different design decisions, either implicitly using micro-architectural and domain knowledge, or explicitly through custom evaluation tools. While this process could help create other, similar chips, today these trade-offs are often not recorded-we either settle on a particular target implementation and record our solution, or we create a chip that is a super-set or a compromise among design choices (and is thus less than optimal). We argue that this implicit and explicit knowledge should be embedded in the modules we construct, allowing others, with different goals or constraints, to create different chip instances. Rather than building a custom chip, designers should create a module that can generate the specialized chip-a chip generator. As presented in Chapter 2 of this thesis, the chip generator approach uses a fixed system architecture, or "template," to simplify both software development and hardware verification. This template is composed of highly parametrized modules, to enable pervasive customization of the hardware. The user, an application developer, tunes the parameters to meet a desired specification. The chip generator compiles this information and deploys optimization procedures to produce the final chip. This process results in customized function units and memories that increase compute efficiency.