


 | |
 | Group Membership and Wide-Area Master-Worker
Computations, Kjetil Jacobsen, Xianan Zhang and Keith Marzullo,
submitted to ICDCS 2003.
Abstract: Group communications system
have been designed to provide an infrastructure for fault-tolerance in
distributed systems, including wide-area systems. In our work on
master-worker computation for GriPhyN, which is a large project in the area
of the computational grid, we asked the question should we build our
wide-area master-worker computation using wide-area group communications?
This paper explains why we decided doing so was not a good idea.
|
 | A Practical
Approach to Fault-tolerant Master/Worker in Condor, Master Thesis,
University of California, San Diego, September 2002. Abstract:
The thesis addresses the problem of providing fault-tolerance for
master/worker problems from a practical point of view. For this
thesis, we develop a master/worker toolkit that can scale to wide-area
systems. It is built using Condor, which is a popular vehicle for
Grid-based computing. Our toolkit supports different configurations of
replicated masters and different policies for partitioning out work among
workers, thereby allowing the developer to tune the exact structure of the
computation to best serve the environment and problem at hand. In
developing this toolkit, we are able to extract some common rules when
providing fault tolerance to master/worker programs in Condor and existing
distributed systems. |
|