Is it Worth Relaxing Fault Tolerance to Speed Up Decommission in Distributed Storage Systems?

Abstract : Efficient resource utilization is a major concern for large-scale computer platforms. One method used to lower energy consumption and operational cost is to reduce the amount of idle resources. This can be achieved by using malleability, namely, the possibility for resource managers to dynamically increase or decrease the amount of resources of jobs while they are running. Decommissioning (i.e., removing from the cluster) the idle nodes as soon as possible allows the resource manager to quickly reallocate those nodes to other jobs. Challenges appear when such nodes host part of a distributed storage system. Such storage systems may need to transfer large amounts of data before releasing the nodes, in order to ensure data availability and a certain level of fault tolerance. In this paper, we model and evaluate the performance of the decommission operation when relaxing the level of fault tolerance (i.e., the number of replicas) during this operation. Intuitively, this is expected to reduce the amount of data transfers needed before nodes are released, and thus allow nodes to be returned to the resource manager faster. We quantify theoretically how much time and resources are saved by such a fast decommission strategy compared with a standard decommission that does not temporarily reduce the fault-tolerance level. We establish lower bounds for the duration of the different phases of a fast decommission. We use the lower bounds to estimate when fast decommission would be useful to reduce the usage of core-hours and when not. We implement a prototype for fast decommission and experimentally validate the lower bounds on the duration of the operation and confirm in practice our theoretical findings.
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02116727
Contributor : Nathanaël Cheriere <>
Submitted on : Wednesday, May 1, 2019 - 2:32:46 PM
Last modification on : Tuesday, February 25, 2020 - 8:08:10 AM

File

Paper.pdf
Files produced by the author(s)

Identifiers

Citation

Nathanaël Cheriere, Matthieu Dorier, Gabriel Antoniu. Is it Worth Relaxing Fault Tolerance to Speed Up Decommission in Distributed Storage Systems?. CCGrid 2019 - IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing, May 2019, Larnaca, Cyprus. pp.1-10, ⟨10.1109/CCGRID.2019.00024⟩. ⟨hal-02116727⟩

Share

Metrics

Record views

101

Files downloads

144