Inference Acceleration: Adding Brawn to the Brains

Mark Campbell
2020 Computer  
S ince the artificial intelligence (AI) spring, a burgeoning ecosystem of model frameworks, reference algorithms, vast inexpensive training data repositories, and tools to streamline model development, training, and deployment has emerged. How to execute these models after creation is left as an exercise for the reader. Until recently, discussions about model optimization, execution platforms, and model tuning were not pressing issues for most AI shops. The conventional thinking has been that
more » ... ing has been that model creation is the tricky part. If you obtain enough of the right data, get a hefty training platform, and train the model successfully, then the heavy lifting is over. Just throw the trained model onto whatever hardware you have, stand back, and let it do its "thang." For the most part, this approach works. After all, the precious hours of a data scientist's day are not needed to execute the model but to make it deeper and smarter, collect more and better training data, and create new purpose-built versions for edge cases. This all makes very good sense, except this approach is breaking down. THE AI BLOAT PROBLEM AI models do not grow fast-they grow very, very, very fast. One example that illustrates this model bloat is a group of natural language generators called transformers. In late 2018, Google AI Language published a landmark paper on Bidirectional Encoder Representations from Transformers (BERT), a natural language transformer with 345 million
doi:10.1109/mc.2020.2984870 fatcat:kzyzrpt5fbd5zdwctzaiev7hya