Are Lock-Free Concurrent Algorithms Practically Wait-Free?
Journal of the ACM
Lock-free concurrent algorithms guarantee that some concurrent operation will always make progress in a finite number of steps. Yet programmers prefer to treat concurrent code as if it were wait-free, guaranteeing that all operations always make progress. Unfortunately, designing wait-free algorithms is generally a very complex task, and the resulting algorithms are not always efficient. While obtaining efficient wait-free algorithms has been a long-time goal for the theory community, most
... ommunity, most non-blocking commercial code is only lock-free. This paper suggests a simple solution to this problem. We show that, for a large class of lockfree algorithms, under scheduling conditions which approximate those found in commercial hardware architectures, lock-free algorithms behave as if they are wait-free. In other words, programmers can keep on designing simple lock-free algorithms instead of complex wait-free ones, and in practice, they will get wait-free progress. Our main contribution is a new way of analyzing a general class of lock-free algorithms under a stochastic scheduler. Our analysis relates the individual performance of processes with the global performance of the system using Markov chain lifting between a complex per-process chain and a simpler system progress chain. We show that lock-free algorithms are not only wait-free with probability 1, but that in fact a general subset of lock-free algorithms can be closely bounded in terms of the average number of steps required until an operation completes. To the best of our knowledge, this is the first attempt to analyze progress conditions, typically stated in relation to a worst case adversary, in a stochastic model capturing their expected asymptotic behavior. * email@example.com. † firstname.lastname@example.org. Shalon Fellow. ‡ email@example.com. process is always guaranteed to make progress by completing its operations, while maximal progress means that all processes always complete all their operations. Most non-blocking commercial code is lock-free, that is, provides minimal progress without using locks [6, 12] . Most blocking commercial code is deadlock-free, that is, provides minimal progress when using locks. Over the years, the research community has devised ingenious, technically sophisticated algorithms that provide maximal progress: such algorithms are either wait-free, i.e. provide maximal progress without using locks  , or starvation-free  , i.e. provide maximal progress when using locks. Unexpectedly, maximal progress algorithms, and wait-free algorithms in particular, are not being adopted by practitioners, despite the fact that the completion of all method calls in a program is a natural assumption that programmers implicitly make. Recently, Herlihy and Shavit  suggested that perhaps the answer lies in a surprising property of lock-free algorithms: in practice, they often behave as if they were wait-free (and similarly, deadlock-free algorithms behave as if they were starvation-free). Specifically, most operations complete in a timely manner, and the impact of long worst-case executions on performance is negligible. In other words, in real systems, the scheduler that governs the threads' behavior in long executions does not single out any particular thread in order to cause the theoretically possible bad behaviors. This raises the following question: could the choice of wait-free versus lock-free be based simply on what assumption a programmer is willing to make about the underlying scheduler, and, with the right kind of scheduler, one will not need wait-free algorithms except in very rare cases? This question is important because the difference between a wait-free and a lock-free algorithm for any given problem typically involves the introduction of specialized "helping" mechanisms , which significantly increase the complexity (both the design complexity and time complexity) of the solution. If one could simply rely on the scheduler, adding a helping mechanism to guarantee wait-freedom (or starvationfreedom) would be unnecessary. Unfortunately, there is currently no analytical framework which would allow answering the above question, since it would require predicting the behavior of a concurrent algorithm over long executions, under a scheduler that is not adversarial. Contribution. In this paper, we take a first step towards such a framework. Following empirical observations, we introduce a stochastic scheduler model, and use this model to predict the long-term behavior of a general class of concurrent algorithms. The stochastic scheduler is similar to an adversary: at each time step, it picks some process to schedule. The main distinction is that, in our model, the scheduler's choices contain some randomness. In particular, a stochastic scheduler has a probability threshold θ > 0 such that every (non-faulty) process is scheduled with probability at least θ in each step. We start from the following observation: under any stochastic scheduler, every bounded lock-free algorithm is actually wait-free with probability 1. (A bounded lock-free algorithm guarantees that some process always makes progress within a finite progress bound.) In other words, for any such algorithm, the schedules which prevent a process from ever making progress must have probability mass 0. The intuition is that, with probability 1, each specific process eventually takes enough consecutive steps, implying that it completes its operation. This observation generalizes to any bounded minimal/maximal progress condition : we show that under a stochastic scheduler, bounded minimal progress becomes maximal progress, with probability 1. However, this intuition is insufficient for explaining why lock-free data structures are efficient in practice: because it works for arbitrary algorithms, the upper bound it yields on the number of steps until an operation completes is unacceptably high. Our main contribution is analyzing a general class of lock-free algorithms under a specific stochastic scheduler, and showing that not only are they wait-free with probability 1, but that in fact they provide a