Reducing Late-Timing Failure at Scale: Straggler Root-Cause Analysis in Cloud Datacenters

Abstract : Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.
Liste complète des métadonnées

Cited literature [9 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01316515
Contributor : Matthieu Roy <>
Submitted on : Tuesday, May 17, 2016 - 11:32:49 AM
Last modification on : Thursday, May 19, 2016 - 12:44:03 PM
Document(s) archivé(s) le : Thursday, August 18, 2016 - 10:23:20 AM

File

Reducing Late-Timing Failure a...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01316515, version 1

Collections

Citation

Xue Ouyang, Peter Garraghan, Renyu Yang, Paul Townend, Jie Xu. Reducing Late-Timing Failure at Scale: Straggler Root-Cause Analysis in Cloud Datacenters. Fast Abstract in the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Jun 2016, Toulouse, France. ⟨hal-01316515⟩

Share

Metrics

Record views

85

Files downloads

179