Checkpointing Process Groups in a Grid Environment

John Mehnert-Spahn 1 Michael Schoettner 1 Christine Morin 2
2 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : The EU-funded XtreemOS project implements a grid operating system transparently exploiting resources of virtual organizations through the standard POSIX interface. Grid checkpointing and restart requires to save and restore jobs executing in a distributed heterogeneous grid environment. The latter may spawn millions of grid nodes ( PCs, clusters, and mobile devices ) using different system-specific checkpointers saving and restoring application and kernel data structures for processes executing on a grid node. In this paper we shortly describe the XtreemOS grid checkpointing architecture and how we bridge the gap between the abstract grid and the system-specific checkpointers. Then we discuss how we keep track of processes and how different process grouping techniques are managed to ensure that all processes of a job and any further dependent ones can be checkpointed and restarted. Finally, we present how Linux control groups can be used to address resource isolation issues during the restart.
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01271213
Contributor : Christine Morin <>
Submitted on : Monday, February 8, 2016 - 7:35:49 PM
Last modification on : Friday, November 16, 2018 - 1:23:18 AM

Identifiers

Citation

John Mehnert-Spahn, Michael Schoettner, Christine Morin. Checkpointing Process Groups in a Grid Environment. Proc. of the 9th International Conference on Parallel and Distributed Computing (PDCAT '08), Dec 2008, Dunedin, New Zealand. pp.243 - 251, ⟨10.1109/PDCAT.2008.14⟩. ⟨hal-01271213⟩

Share

Metrics

Record views

225