Distributed computing in practice: the Condor experience

Douglas Thain, Todd Tannenbaum, Miron Livny
2005 Concurrency and Computation  
Since 1984, the Condor project has enabled ordinary users to do extraordinary computing. Today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world-wide computational grid. In this chapter, we provide the history and philosophy of the Condor project and describe how it has interacted with other projects and evolved along with the field of distributed computing. We outline the core components of the Condor
more » ... tem and describe how the technology of computing must correspond to social structures. Throughout, we reflect on the lessons of experience and chart the course traveled by research ideas as they grow into production systems. In this environment, the Condor project was born. At the University of Wisconsin, Miron Livny combined his doctoral thesis on cooperative processing [47] with the powerful Crystal Multicomputer [24] designed by Dewitt, Finkel, and Solomon and the novel Remote Unix [46] software designed by Michael Litzkow. The result was Condor, a new system for distributed computing. In contrast to the dominant centralized control model of the day, Condor was unique in its insistence that every participant in the system remain free to contribute as much or as little as it cared to. The Condor system soon became a staple of the production computing environment at the University of Wisconsin, partially because of its concern for protecting individual interests. [44] A production setting can be both a curse and a blessing: The Condor project learned hard lessons as it gained real users. It was soon discovered that inconvenienced machine owners would quickly withdraw from the community. This led to a longstanding Condor motto: Leave the owner in control, regardless of the cost. A fixed schema for representing users and machines was in constant change and so eventually led to the development of a schema-free resource allocation language called ClassAds. [59, 60, 58] It has been observed that most complex systems struggle through an adolescence of five to seven years. [42] Condor was no exception. Scientific interests began to recognize that coupled commodity machines were significantly less expensive than supercomputers of equivalent power [66]. A wide variety of powerful batch execution systems such as LoadLeveler [22] (a descendant of Condor), LSF [79], Maui [35], NQE [34], and PBS [33] spread throughout academia and business. Several high profile distributed computing efforts such as SETI@Home and Napster raised the public consciousness about the power of distributed computing, generating not a little moral and legal controversy along the way [9, 67]. A vision called grid computing began to build the case for resource sharing across organizational boundaries [30]. Throughout this period, the Condor project immersed itself in the problems of production users. As new programming environments such as PVM [56] , MPI [78], and Java [74] became popular, the project added system support and contributed to standards development. As scientists grouped themselves into international computing efforts such as the Grid Physics Network [3] and the Particle Physics Data Grid (PPDG) [6], the Condor project took part from initial design to end-user support. As new protocols such as GRAM [23] , GSI [28], and GridFTP [8] developed, the project applied them to production systems and suggested changes based on the experience. Through the years, the Condor project adapted computing structures to fit changing human communities. Many previous publications about Condor have described in fine detail the features of the system. In this chapter, we will lay out a broad history of the Condor project and its design philosophy. We will describe how this philosophy has led to an organic growth of computing communities and discuss the planning and scheduling techniques needed in such an uncontrolled system. Next, we will describe how our insistence on dividing responsibility has led to a unique model of cooperative computing called split execution. In recent years, the project has added a new focus on data-intensive computing. We will outline this new research area and describe out recent contributions. Security has been an increasing user concern over the years. We will describe how Condor interacts with a variety of security systems. Finally, we will conclude by describing how real users have put Condor to work.
doi:10.1002/cpe.938 fatcat:7ewsokcld5curbe6ku3r5khu2y