The Asterina Project Homepage

 

Home
People
Resources
Publications
 
Group Membership and Wide-Area Master-Worker Computations, Kjetil Jacobsen, Xianan Zhang and Keith Marzullo, submitted to ICDCS 2003.

Abstract: Group communications system have been designed to provide an infrastructure for fault-tolerance in distributed systems, including wide-area systems.  In our work on master-worker computation for GriPhyN, which is a large project in the area of the computational grid, we asked the question should we build our wide-area master-worker computation using wide-area group communications?  This paper explains why we decided doing so was not a good idea.

 

A Practical Approach to Fault-tolerant Master/Worker in Condor, Master Thesis, University of California, San Diego, September 2002.

Abstract: The thesis addresses the problem of providing fault-tolerance for master/worker problems from a practical point of view.  For this thesis, we develop a master/worker toolkit that can scale to wide-area systems.  It is built using Condor, which is a popular vehicle for Grid-based computing.  Our toolkit supports different configurations of replicated masters and different policies for partitioning out work among workers, thereby allowing the developer to tune the exact structure of the computation to best serve the environment and problem at hand.  In developing this toolkit, we are able to extract some common rules when providing fault tolerance to master/worker programs in Condor and existing distributed systems.