Skip to Main content Skip to Navigation
Journal articles

How fast can one resize a distributed file system?

Nathanael Cheriere 1 Matthieu Dorier 2 Gabriel Antoniu 1
1 KerData - Scalable Storage for Clouds and Beyond
Inria Rennes – Bretagne Atlantique , IRISA-D1 - SYSTÈMES LARGE ÉCHELLE
Abstract : Efficient resource utilization becomes a major concern as large-scale distributed computing infrastructures keep growing in size. Malleability, the possibility for resource managers to dynamically increase or decrease the amount of resources allocated to a job, is a promising way to save energy and costs. However, state-of-the-art parallel and distributed storage systems have not been designed with malleability in mind. The reason is mainly the supposedly high cost of data transfers required by resizing operations. Nevertheless, as network and storage technologies evolve, old assumptions about potential bottlenecks can be revisited. In this study, we evaluate the viability of malleability as a design principle for a distributed storage system. We specifically model the minimal duration of the commission and decommission operations. To show how our models can be used in practice, we evaluate the performance of these operations in HDFS, a relevant state-of-the-art distributed file system. We show that the existing decommission mechanism of HDFS is good when the network is the bottleneck, but can be accelerated by up to a factor 3 when storage is the limiting factor. We also show that the commission in HDFS can be substantially accelerated. With the highlights provided by our model, we suggest improvements to speed both operations in HDFS. We discuss how the proposed models can be generalized for distributed file systems with different assumptions and what perspectives are open for the design of efficient malleable distributed file systems.
Complete list of metadata

Cited literature [43 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-02961875
Contributor : Gabriel Antoniu <>
Submitted on : Thursday, October 8, 2020 - 6:16:32 PM
Last modification on : Wednesday, May 26, 2021 - 3:39:55 AM
Long-term archiving on: : Saturday, January 9, 2021 - 7:31:45 PM

File

JPDC-Cheriere-Dorier-Antoniu-2...
Files produced by the author(s)

Identifiers

Citation

Nathanael Cheriere, Matthieu Dorier, Gabriel Antoniu. How fast can one resize a distributed file system?. Journal of Parallel and Distributed Computing, Elsevier, 2020, 140, pp.80-98. ⟨10.1016/j.jpdc.2020.02.001⟩. ⟨hal-02961875⟩

Share

Metrics

Record views

78

Files downloads

76