Zen and the Art of Network Troubleshooting: A Hands on Experimental Study

. Growing network complexity necessitates tools and methodologies to automate network troubleshooting. In this paper, we follow a crowd-sourcing trend, and argue for the need to deploy measurement probes at end-user devices and gateways, which can be under the control of the users or the ISP. Depending on the amount of information available to the probes (e.g., ISP topology), we formalize the network troubleshooting task as either a clustering or a classiﬁcation problem, that we solve with an algorithm that (i) achieves perfect classiﬁcation under the assumption of a strategic selection of probes (e.g., assisted by an ISP) and (ii) operates blindly with respect to the network performance metrics, of which we consider delay and bandwidth in this paper. While previous work on network troubleshooting privileges a more theoretical vs practical approaches, our workﬂow balances both aspects as (i) we conduct a set of controlled experiments with a rigorous and reproducible methodology, (ii) on an emulator that we thoroughly calibrate, (iii) contrasting experimental results affected by real-world noise with expected results from a probabilistic model.


Introduction
Nowadays, broadband Internet access is vital. Many people rely on online applications in their homes to watch TV, make VoIP calls, and interact with each other through social media and emails. Unfortunately, dynamic network conditions such as device failures and congested links can affect the network performance and cause disruptions (e.g. frozen video, poor VoIP quality).
Currently, troubleshooting performance disruptions is complex and ad hoc due to the presence of different applications, network protocols, and administrative domains. Typically, troubleshooting starts with a user call to the ISP help desk. However, the intervention of the ISP technician is useless if the root cause lies outside of the ISP network, which possibly includes the home network of the very same user -hence, for the ISP, it would be valuable to extend its reach beyond the home gateway by instrumenting experiments directly from end-user devices. While (tech savvy) users can be assisted in their troubleshooting efforts by software tools such as [4,6,17,19] which automate a number of useful measurements, these tools do not incorporate network tomography techniques [9,21] to identify the root causes of network disruptions (e.g., faulty links). Additionally, these tools are generally ISP network-agnostic, hence, they would benefit from cooperation with the ISP.
In this paper, we propose a practical methodology to automate the identification of faulty links in the access network based on end-to-end measurements. Since the devices participating in the troubleshooting task can be either under the control of the end-user or the ISP, the knowledge of the ISP topology is not always available for the measurement probes. Consequently, we formalize the troubleshooting task as either a clustering or a classification problem -where respectively end-users are able to assess the severity of the fault, or ISPs are able to identify the faulty link.
This paper makes several contributions. While our troubleshooting model (Sec. 3), algorithm (Sec. 4) and software implementation (Sec. 5) are interesting per se, we believe our major contribution is the rigour of the evaluation methodology (Sec. 6), which overcomes state of the art limits (Sec. 2). Indeed, on one hand, previous practical troubleshooting efforts [4,6,16,17,19] are valuable in terms of domain knowledge and engineering, but lack theoretical foundations and rigorous verification. On the other hand, prior analytical efforts are cast on solid theoretic ground [9,21], but their validation is either simplistic (e.g. simulations) or lacks ground truth (e.g. PlanetLab).
In this work, we take the best of both worlds, as we (i) propose a practical methodology for network troubleshooting with an open source implementation; (ii) provide a model of the expected fault detection probability that we contrast with experimental results; (iii) use an experimental approach where we emulate controlled network conditions with Mininet [13]; (iv) perform a calibration of the emulation setup, an often neglected albeit mandatory task; (v) in spirit with Mininet and the TMA community, we further make all our source code available for the scientific community at [1,2].
Our methodology is based solely on end-to-end measurements to localize the set of links that are the most likely root cause of performance degradations. Closest to our work is the large body of work in network tomography which exploits the similarity of end-to-end network performance from a source to multiple receivers due to common paths to infer properties of internal network links such as network outages [18], delays [23], and packet losses [8]. However, these studies make simplifying assumptions that do not hold in real deployments [9,15] such as the use of multicast [23]. In addition, the proposed algorithms are computationally expensive for networks of reasonable scale and their accuracy is affected by the scale and the topology of the network [9].
In this work, we instead present a practical, general framework to identify faulty links that we instantiate on two specific metrics: delays as in [23] and bottleneck band-width, which is notoriously more difficult to measure. When full topological information is not available, our algorithm performs a clustering of measurement probes as in binary network tomography [21], where the inference problem is simplified by separating links (in our case probes) into good vs failed, instead of estimating the values of the link performance metrics.
Additionally, one major problem of the related literature is the realism of ground truth data to evaluate the accuracy of the algorithms. Even in practical approaches, ground truth in the form of user tickets [3] or user feedback [16] is extremely rare, so that the absence of ground truth is commonplace [4,6,17,19]. Theoretic work builds ground truth with simulations [8], or using syslogs and SNMP data in operational networks [18]. On the one hand, although simulations simplify the control over failure location and duration, they do not provide realistic settings. On the other hand, the ground truth is either completely missing in real operational networks (such as PlanetLab [21]) or partially missing in testbeds [15,18], where network events outside of the control of researchers can happen. Our setup employs controlled emulation through Mininet [13] which is (relatively) fast to implement, uses real code (including kernel stack and our software), and allows testing on fairly large scale topologies. This setup allows full control on the number, duration, and location of network problems. Additionally, by running the full network stack, Mininet keeps the real world noise in the underlying measurements, thus providing a more challenging validation environment with respect to simulation. As a side effect of this choice, the NetProbes software that we release as open source [2] has also undergone a significant amount of experimental validation. Most importantly, any peer researcher is capable of repeating our experiments in order to validate our results, compare their approach to ours, and extend this work.

Problem statement and model
Considering an ISP network, and focusing for the sake of simplicity on its access tree, faults can occur at multiple levels in the access network hierarchy. The ability to launch measurements between arbitrary pairs of devices in the same access network would significantly enhance the diagnosis of network performance disruptions. In this work we consider two use-cases: User-managed probes and ISP-managed probes. Usermanaged probes run only on end-user devices and lack topology information. In contrast, ISP-managed probes can reside in home gateways, in special locations inside the ISP network, and can also be available as "apps" on user devices (e.g., smartphones and laptops). We address both use-cases with the same algorithm: clustering in the userscenario separates measurement probes into two sets (i.e., un/affected sets), whereas an additional mapping in the ISP-scenario allows to pinpoint the root cause link.
We formalize the problem and introduce the notation used in this paper with the help of Fig. 1, which depicts a binary access network tree. The troubleshooting probe software runs in the leaf nodes of the tree. However, the ISP can strategically place probes inside the network (e.g. probe 0 in the picture attached to the root). Our algorithm runs continuously in the background to gather a baseline of network performance, and troubleshooting is triggered by the user (e.g., upon experiencing a degradation of For the sake of clarity, let us assume that probe 1 launches a troubleshooting task. In this context, we can safely assume that the root cause is located somewhere in the path from the user device or gateway towards the Internet (links ℓ 4 , ℓ 3 , ℓ 2 , ℓ 1 in bold in Fig. 1). In order to identify which among ℓ 4 , .., ℓ 1 is the root cause of the fault, probe 1 requires sending probing traffic to a number M of the overall available probes N . Let us denote, for convenience, by D + = log k (N ) the maximum depth (i.e., height) of a k-ary tree and by D i the set of probes . The set D i includes probes whose shortest path from probe 1 passes through ℓ i , but does not pass through ℓ i−1 . In the access tree, whenever a link ℓ f (located at depth f in the tree) is faulty, all probes whose shortest path from the diagnostic probe (probe 1 in our example) passes through ℓ f will also experience the problem, unlike probes that are reachable through ℓ f +1 : it follows that the troubleshooting algorithm requires probes from both sets D f and D f +1 to infer with certainty that the fault is located at ℓ f . For a k-ary tree, the minimum number of probes that allows to identify the faulty link irrespectively of the depth f of the fault is M = O(log k (N )) -i.e., one probe in each of the strata suffices to accurately pinpoint the root cause.
Such a strategic probe selection requires either topology knowledge or the assistance of a cooperating server managed by the ISP (e.g., an IETF ALTO [24] server). However, this strategy is not feasible with user-managed probes, in which probe selection is either uniformly random or based on publicly available information such as IP addresses. It is thus important to assess the detection probability of a naive random selection.
Let us denote by p − (f, α) the probability that a random selection includes a probe that is useful to locate a fault at depth f ∈ [1, D + ], with a probe budget α = M/N . The deeper is the fault location, the smaller is the number of probes available to identify the faulty link. As the size of D f exponentially decreases as f increases (card(D f ) = k D + −f ), we expect the random selection strategy to easily locate faults at small depths (close to the root) and fail at large depths (close to the leaves) where a stratified selection is necessary to sample probes in the smaller set D f . The probability that none of the M vantage points falls into D f decreases exponentially fast with the size of D f , i.e., (1 − α) card(D f ) . Consequently, the probability to sample at least 1 one probe in D f is: Expression (1) is a lower bound on the expected detection probability with random selection. When a random subset of probes does not contain any probe in D f , it is still possible to correctly guess the root cause link. Here, there will be ambiguity because multiple links are equally likely to be root cause candidates. At any depth d, ambiguity will be limited to the links located between the fault and the root of the tree (i.e., ℓ d , .., ℓ 1 ): since, at depth d, ambiguity involves d links, the probability of a correct guess is 1/d. To compute the average probability of a correct guess E[p guess ], we have to account for the relative frequency of the different ambiguity cases, which for depth d We can then compute the expected discriminative power of a random selection, expressed in terms of the probability to correctly identify a fault at depth f as: where the first term accounts for the proportion of random selection that is structurally equivalent to a stratified selection (so that the root cause link can be found with probability 1), and the second term accounts for the proportion of random selection able to pinpoint the faulty link by luck (thus with probability E[p guess ]). By plugging (1) and (2) into (3) we get: Notice that (5) has structurally the form 1 − p loss . The term p loss can be interpreted as the loss of discriminative power with respect to a perfect strategic selection that always achieves correct detection. Clearly, this model is simplistic as it does not consider all combinatorial aspects which could be used to obtain finer-grained expectations at each depth of the tree. Yet, the main purpose of the model is to serve as a reality check for our experimental results.
We treat both clustering and classification problems with a single algorithm, whose pseudocode is reported in Algorithm 1. Assuming the algorithm runs at a source node s, for any performance metric Q (e.g., delay, bandwidth), s collects baseline statistics Q 0 (p) with low-rate active measurements towards other peers p. When the troubleshooting is triggered, s iteratively selects up to R batches of B of probes, so that R · B represents a tuneable probing budget. Selection is made according to a selection policy S p , based on a probe score S(p). The probe selection is iterative because S(p) can vary, and thus the next batch is selected based on the results of the previous batch. At each step, upon doing B measurements, we compute, for each probe p, Q(p) − Q 0 (p) and add it to the set P : K-means clustering partitions P into P + and P − . Two points are worth stressing: first, the algorithm does not associate any semantic to clusters: e.g., a node in P + can be affected by large delay, whereas a node in P − can be affected by a bottleneck bandwidth. Second, in case of a single failure, it can be expected that probes in one of the two clusters exhibit Q(p) − Q 0 (p) ≈ 0, so P + and P − should be interpreted as a syntactical difference. Once the probe budget is exhausted (or once other stop criteria, that we don't mention for the sake of simplicity, are met), the algorithm either returns P + and P − (user-managed case, line 12), or continues with the mapping. When no clear partition can be established, only one set is returned.
To map probes in P + and P − to links, the algorithm requires the knowledge of the links ℓ in the shortest path SP (s, p). The score S(ℓ) of ℓ ∈ SP (s, p) is incremented by +1 for p ∈ P + and decremented by -1 for p ∈ P − . As a consequence of metricagnosis, the algorithm needs to know if links with the largest (smallest) S(ℓ) scores are to be pinpointed, which is done according to a link selection policy S ℓ .
We experiment with S p ∈ {random, |IP (s) − IP (p)|, balance} and combinations of the above. Random selection is useful as a baseline and to compare with the model. We additionally consider probe selection policies that are more complex to model such as the absolute distance in the IP space, as well as a policy that attempts at equating the size of P + and P − , by selecting an IP that is close to IPs in the small cluster, and far from IPs in the large cluster (exact definition omitted due to lack of space). Moreover, we consider S ℓ ∈ {random, proportional, argmax}. The naïve random method makes an informed guess by selecting one of the D + links in the path ℓ D+ , . . . , ℓ 1 to the root (success probability 1/D + , much larger than the 1/2(k D + − 1) = 1/2(N − 1) in case of a random guess over all links). We also select links proportionally to their score (proportional policy), or only the link with the largest (smallest) score (argmax policy).

Calibration of the emulation environment
Before running a full-fledged measurement campaign, it is mandatory to perform a rigorous calibration phase, yet this phase is often neglected [22]. In this work, we follow an experimental approach using emulation in Mininet, to control the duration and the location of the faults. However, it is unclear how well state-of-the-art delay and bandwidth measurement techniques perform in Mininet. In order to disambiguate inconsistencies due to Mininet from measurement errors intrinsic to measurements techniques, Algorithm 1 Detection algorithm at s 1: Get a baseline Q0(p) for metric Q(p), ∀p ▷ Initialization, over long timescale 2: for round ∈ [1..R] do ▷ When triggered upon user/ISP demand 3: select a batch of B probes according to a probe selection policy Sp, based on score S(p) 4: for p ∈ B do 5: perform active measurements with p to get Q(p) − Q0(p) 6: add probe p to probed set P 7: partition P into P + and P − , by K-means clustering on Q(p) − Q0(p) 8: end for 9: update probe scores S(p), ∀p 10: end for 11: if topology is not available then ▷ Clustering results 12: return P + and P − 13: else ▷ Classification results 14: for probe p ∈ P do 15: for link ℓ ∈ shortest path SP (s, p) do 16: end for 18: end for 19: return link ℓ according to a link selection strategy S ℓ based on scores S(ℓ) 20: end if we perform calibration experiments for a set of delay (expectedly easy) and bandwidth (notoriously difficult) measurement tools and assess their accuracy in Mininet. In this section, we first briefly describe Mininet and NetProbes, the diagnosis software we develop for this work (Sec. 5.1), then present the calibration results (Sec. 5.2).

Software Tools
Mininet [13] Mininet is an open source emulator which creates a virtual network of end-hosts, links, and OpenFlow virtual switches in a single Linux kernel and supports experiments with almost arbitrary network topologies. Mininet hosts execute code in real-time, exchange real network traffic, and behave similarly to deployed hardware. All the software developed for a virtual Mininet network can run in hardware networks and be shared with others to reproduce the experiments. Mininet provides the functional and timing realism of testbeds in addition to the flexibility and full control of simulators. Experimenters configure packet forwarding at the switches with OpenFlow and link network characteristics (e.g., delay and bandwidth) with the Linux Traffic Control (tc). Reproducing experiments from tier-1 conference papers 2 indicates that results from Mininet and from testbeds are in agreement.

NetProbes [2]
We design NetProbes, a distributed software written in Python 3.x that runs on end-hosts and executes a set of user-defined active measurement tests. Net-Probes agents deployed at end-user devices and gateways form an overlay. They perform a set of periodic measurements to monitor the paths in the overlay and collect a baseline network performance. When the user experiences network performance issues, the NetProbes agent running at the user device launches a troubleshooting task to assess the severity of the performance issue and the location of the faulty link. It is worth pointing out that the set of measurement tasks that can be performed by NetProbes agents (e.g., HTTP or DNS requests, multicast UDP tests, etc.) is far larger than what we consider within the scope of this paper, and that the software is available at [2].

Delay and bandwidth calibration
Setup We build a Mininet virtual network with the topology depicted in Fig. 1 on a server with four cores and 24 GB of RAM. We run the selected tools on probes 1 and 2. In our delay experiments, we impose five different delay values (0 ms, 20 ms, 100 ms, 200 ms, 1000 ms) on ℓ 3 located at depth d = 3 in the tree. At each delay level, probes 1 and 2 perform 50 measurements of round trip delays to probes 7 and 6 respectively (250 measurements in total for each pair of probes). We use Mininet processes through the Python API to issue ping and traceroute to measure RTTs (we test traceroute with UDP, UDP Lite, TCP, and ICMP).
Similarly, in the bandwidth experiments, we vary the link capacity of ℓ 3 (100 Mbps, 10 Mbps, 1 Mbps) under three different traffic shapers, namely the hierarchical token bucket (HTB), the token bucket filter (TBF), and the hierarchical fair service curve (HFSC) and we make 20 measurements of the available bandwidth between probes 1 and 7 and probes 2 and 6 (120 in total for each value of the link capacity). There is a plethora of measurement tools designed by the research community to estimate the available bandwidth [11]. In this work we limitedly report the calibration of three popular tools (Abing [20], ASSOLO [10], and IGI [14]) which are characterised by low intrusiveness: Abing and IGI infer the available bandwidth based on the dispersion of packet pairs measured at the receiver. ASSOLO sends a variable bit-rate stream with exponentially spaced packets and calculates the available bandwidth from the delays at the receiver side. We compare the performance of the three bandwidth estimation tools in the absence of cross traffic and under the three traffic shapers mentioned earlier.
Delay We expect delay measurements to be flawless. Yet we observe that the first packet sent between any two hosts exhibits a large delay variance: this is due to the fact that the corresponding entry for the flow is missing in the virtual switch and thus requires data exchange between the OpenFlow controller and the virtual switch, whereas the forwarding entry is ready for subsequent packets. We thus do the baseline Q 0 (p) over multiple packets (50 for delay) to mitigate this phenomenon, so that the impact of the first packet delay is factored out in the warmup phase. Doing a baseline and subtracting it from each delay measurement enables an accurate study of the effect of the imposed delay value on the accuracy of the measurement technique. Further results are shown in Fig. 2. All techniques exhibit a time evolution similar to ICMP ping whose experiment is depicted in Fig. 2(a). We report the PDF of the measurement error (i.e., the difference between the measured and the enforced RTT) in Fig. 2(b). Results for traceroute with various protocols are similar: we observe that, for all the delay measurement techniques, the bulk of the error distribution is less than 1 ms (with outliers not shown up to 10ms). Moreover, we note that using ICMP brings the absolute  error to less than 0.1 ms for both traceroute and ping. From this calibration phase, we select ICMP ping to measure delay: as the measurement noise is insignificant, errors in the classification outcome should be solely attributed to our troubleshooting algorithm.
Capacity Fig. 3 reports the evolution of the estimated available bandwidth as a function of three link capacity values for the cross product of {Abing, ASSOLO, IGI} × {HTB, TBF, HFSC}. We stress that while comparison of bandwidth estimation tools under the same experimental conditions has already been studied, we are not aware of any study jointly considering bandwidth estimation and bandwidth shaping, especially since many bandwidth measurement tools rely on effects of cross-trafic to estimate available bandwidth. As before, we use a warmup phase to factor out the extra delay incurred by the first packet. We can see that Abing systematically fails in estimating the available bandwidth under HTB and TBF shaping, while the estimation is correct with HFSC. Similarly, ASSOLO fails in estimating 1 Mbps available bandwidth under all shapers, and additionally fails the estimation of 10Mbps under TBF. In contrast, IGI succeeds in accurately tracking changes of available bandwidth at ℓ 3 , although outliers are still possible (see IGI+TBF). A downside of IGI is that the measurements last longer than measurements with Abing or ASSOLO. These results and tradeoffs are interesting and require future attention. However, this is beyond the scope of this work. The most important takeaway is that measurement errors of such magnitude would invalidate all experiments, showing once more the importance of this calibration phase. We additionally gather that the IGI+HFSC combination offers the most accurate estimates of available bandwidth. As accurate input is a necessary condition for trobuleshooting success, we use this combination in the remainder of this paper.

Experimental results
We now evaluate the quality of our clustering and classification for various probe budgets (namely 10, 20 and 50 probes) for faults (e.g., doubling delay or halving bandwidth) at controlled depths of the tree. All the scripts to reproduce the experiments are available at [1]. We first compare experimental results in a calibrated Mininet environment (including real-world noise), with those expected by a probabilistic model (neglecting noise) (Sec. 6.1). We next perform a sensitivity analysis by varying topological properties, probe selection policies S p , and link selection policies S ℓ (Sec. 6.2).

Performance at a glance
We perform experiments over a binary tree scenario (k = 2) with depth D + = 9 and N = 512 leaf nodes. In this case, a strategic probe selection would need M/N = 9/512 probes (α = 1.75%) to ensure perfect classification, but we consider larger budget M = {10, 20, 50} in our experiments. Unless otherwise stated, we use a random probe selection S p and an argmax link selection S ℓ policies. We first evaluate the clustering methodology by comparing the two sets of affected and unaffected probes obtained from the algorithm with our ground truth, using the well-known rand index [12], which takes value in [0, 1] ⊂ R, with 1 indicating that the data clusters are exactly the same.
Since we have full control over the location of the fault, we build our ground truth by assigning the label "affected" to all the available probes (under a given budget constraint) for which the path to the diagnostic probe passes through the faulty link. The remaining probes constitute the unaffected set. Fig. 4-(a) shows that, provided measurements are accurate, the clustering methodology successfully identifies the set of probes whose paths from the diagnostic software experience significant network performance disruptions (and as a consequence accurately identifies nodes in the complementary set of unaffected probes). For budgets of 10, 20 and 50 probes, the rand index shows perfect match between the ground truth and the clustering output in the case of delay measurement. Results degrade significantly instead for bandwidth measurement: we point out that the loss of accuracy is not tied to our algorithm, but rather to measurements that are input to it, which was partly expected and confirms that calibration is a necessary, but unfortunately not sufficient, step.
Abstracting from limits in the measurement techniques, these result indicates that in practice our clustering methodology works well in assessing the impact of a faulty link without requiring knowledge of the network topology. Yet, root cause link identification is a clearly more challenging and important objective, which we analyze in the following by restricting our attention to delay experiments: as the classification step is a deterministic mapping from the clusters, as long as the measurement error remains small, the results of the classification task are not affected by the specific metric under investigation. We expect classification results to apply at large, as opposite to merely illustrating the algorithm performance under delay measurement (although they are not representative of bottleneck localization as per Fig. 4-(a)).
We next show that the experimental and modelling results are in agreement, with a random probe selection policy and a budget of M = 50 probes, which corresponds to α = 9.75%. For each fault depth f , we perform 10 experiments by randomizing the set of destination probes. Results, as reported in Fig. 4-(b), depict the correct classification probability of the model vs the experiments. Recall that equation (1) gives a lower bound p − (f, α) to the experimental results, while (3) models the average expected detection probability E[p]. We consider α = 9.75%, to directly compare with experimental results, as well as α = 1.75%, to assess the loss of discriminative power from a strategic selection, that could achieve perfect classification in this setting, to a random selection (denoted with p loss in the figure).

Sensitivity analysis
Impact of topology We study the impact of the network topology on the classification performance. We use two trees with 512 probes (i.e. leaves) each. The first tree has a depth d = 3 and a fanout k = 8 while the second tree has a depth d = 9 and a fanout k = 2. Fig. 5 reports the correct detection probability of the faulty link as a function of the depth of the injected fault in the tree, using variance bars. As expected, results indicate that the correct detection probability decreases as the fault depth increases 3 . Thus, when the root cause link is located close to the leaves of the tree, it is harder to randomly sample another probe which is also affected by the fault: we thus need a smarter probe selection strategy to improve the link classification performance.
Impact of the probe selection policy S p We consider policies based on IP-distance (IP), cluster-size (balance), and a linear combination of both. We average the results over all the depths of the binary tree and contrast them with a random selection policy. Unfortunately, our attempts are so far unsuccessful as shown in Fig. 6(a), where the discriminative power is roughly the same over all probe selection policies. This is due to the fact that the current set of metrics we consider to select probes do not encode useful information to bias the selection. The absence of a notion of net masks and hierarchy with IP-distance for example makes it hard to extract information about how topologically close/far probes are from each other. An obvious improvement would be to consider the IP-TTL field. However, since Mininet uses virtual switches to construct the network, the IP-TTL field remains unchanged. As a consequence, we could not conduct experiments with this field and we leave it as future work.
Impact of the link selection policy S ℓ Finally, we use three different policies to select the faulty links: S ℓ ∈ {random, proportional, argmax}. Results, averaged over all depths of the binary tree, are reported in Fig. 6. The plot is futher annotated with the gain factor over the random selection: while proportional selection brings a constant

Conclusions and future work
In this work, we present a troubleshooting algorithm to diagnose network performance disruptions in the home and access networks. We apply a clustering methodology to evaluate the severity of the performance issue and leverage the knowledge of the access network topology to identify the root cause link with a correct classification probability of 70% using 10% of the available probes. We follow an experimental approach and use an emulated environment based on Mininet to validate our algorithm. Our choice of Mininet is guided by our requirements to have flexibility in designing the experiments, full control over the injected faults, and realistic network settings. We contrast the experimental results with an analytical model that computes the expected correct classification probability under a random probe selection policy. We also evaluate the impact of topology, probe and link selection policies on the algorithm. Our proposed solution is a first step towards the goal of having reproducible network troubleshooting algorithms -for which we make all our code publicly available. Our future work will focus on extending the algorithm to different network topologies and to diversify the set of network performance metrics, to verify its generality. Also, while simplicity was one of the goals of this paper, and allowed to compare analytical vs experimental results, our future work will address more practical issues, such as how our design can be integrated and complement troubleshooting systems already deployed by ISPs.