A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference
[article]
2020
arXiv
pre-print
In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel. We observe that the coarse-grained graph batching becomes suboptimal in effectively handling the dynamic inference request traffic, leaving significant performance left on the table. This paper proposes LazyBatching, an
arXiv:2010.13103v1
fatcat:4kvl5wcvxvfg3a6pvx3ulun2ki