Coding for Low-Latency Distributed Computation in Cloud Infrastructure1.5.2017
An increasing amount of computation is being carried out on massively scaled distributed “cloud” server infrastructure. In practice, there are significant variations in how long a particular machine takes to execute a given task within a larger job. However, the risk of very slow job completion times can be avoided through redundant execution of tasks, analogous to the use of error-control coding in communication systems. Despite the growing use of such infrastructure, there has been comparatively little analysis to guide system designers. In this work, we analyze the fundamental tradeoff between execution time (job latency) and machine time (redundant execution) in such systems, focusing on protocols that avoid the need for resource state information. Our analysis reveals important new principles for the design of systems that approach these limits. Among other results, we show that there are two distinct behavior regimes, corresponding to the log-convexity or log-concavity of the task service distribution. In the former regime, we show that a protocol with maximum redundancy is most effective, while in the latter regime, a protocol with less redundancy and an early cancellation policy is most effective.