The STAR Fault Manager for Distributed Environments

Pierre Sens 1 Bertil Folliot 1
1 SRC - Systèmes répartis et coopératifs
LIP6 - Laboratoire d'Informatique de Paris 6
Abstract : This paper presents the design, implementation, and performance evaluation of a software fault manager for distributed applications. Dubbed STAR , it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault-tolerant applications, STAR implements non-blocking and incremental checkpointing to perform an efficient backup of process state. Moreover, STAR is application independent, highly configurable. Star actually runs on top of SunOs and is easily portable to UNIXTM-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment.
Document type :
Journal articles
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01621560
Contributor : Lip6 Publications <>
Submitted on : Monday, October 23, 2017 - 4:09:32 PM
Last modification on : Thursday, March 21, 2019 - 1:07:10 PM

Links full text

Identifiers

Citation

Pierre Sens, Bertil Folliot. The STAR Fault Manager for Distributed Environments. Software Practice and Experience, 1998, 28 (10), ⟨10.1002/(SICI)1097-024X(199808)28:10<1079::AID-SPE199>3.0.CO;2-D⟩. ⟨hal-01621560⟩

Share

Metrics

Record views

93