Reducing Late-Timing Failure at Scale: Straggler Root-Cause Analysis in Cloud Datacenters

Abstract : Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.
Type de document :
Communication dans un congrès
Matthieu Roy; Javier Alonso Lopez; Antonio Casimiro. Fast Abstract in the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Jun 2016, Toulouse, France. DSN2016-FAST-ABSTRACT
Liste complète des métadonnées

Littérature citée [9 références]  Voir  Masquer  Télécharger

https://hal.archives-ouvertes.fr/hal-01316515
Contributeur : Matthieu Roy <>
Soumis le : mardi 17 mai 2016 - 11:32:49
Dernière modification le : jeudi 19 mai 2016 - 12:44:03
Document(s) archivé(s) le : jeudi 18 août 2016 - 10:23:20

Fichier

Reducing Late-Timing Failure a...
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01316515, version 1

Collections

Citation

Xue Ouyang, Peter Garraghan, Renyu Yang, Paul Townend, Jie Xu. Reducing Late-Timing Failure at Scale: Straggler Root-Cause Analysis in Cloud Datacenters. Matthieu Roy; Javier Alonso Lopez; Antonio Casimiro. Fast Abstract in the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Jun 2016, Toulouse, France. DSN2016-FAST-ABSTRACT. 〈hal-01316515〉

Partager

Métriques

Consultations de la notice

82

Téléchargements de fichiers

166