Suppression Distance Computation for Hierarchical Clusterings

We discuss the computation of a distance between two hierarchical clusterings of the same set. It is deﬁned as the minimum number of elements that have to be removed so the remaining clusterings are equal. The problem of distance computing was extensively studied for partitions. We prove it can be solved in polynomial time in the case of hierarchies as it gives birth to a class of perfect graphs. We also propose an algorithm based on recursively computing maximum assignments.


Introduction
Decomposing a set into patterns of interest is a central problem in data analysis.Evaluating the distance between decompositions is an important task in this context as it allows to study the behaviour of clustering algorithms or study the evolution of a set of patterns over time.The situation where the detected patterns do not overlap is called partitions.Measures based on edit distance [3] or on mutual information [6] can be used to assess the distance between those objects.The first corresponds to the minimum number of elements that need to be moved from one group to another for the two partitions to be equal (called transfer distance in [7]).It was used for practical applications in bioinformatics [10].Similar definitions can also applied to different kind of decompositions e.g. with overlapping groups (called set covers).
This work focuses on hierarchical clusterings (also called hierarchies) in which each group can be recursively decomposed into smaller groups.The problem of distance definition between hierarchies is of interest as they can be used to represent and study a system (such as complex networks [8]) at different scales.Comparing hierarchical clusterings is related to the comparison of phylogenetic trees [9] in biology although those objects have typically more constraints than the decompositions studied here.
We investigate the problem of finding the minimum number of elements to be removed so that the remaining hierarchical clusterings are equal or, equivalently, the size of of smallest subset of elements for which the decompositions "disagree".After having define the core concepts (Section 2), we will provide two alternative proofs of the main claim (Section 3 and 4).The first links the problem to a class of perfect graphs (generalizing the results of [3]) since the difference between hierarchies can be encoded into a graph (called the difference graph) with specific characteristics.The second provides a polynomial algorithm to compute the distance between hierarchical clusterings.Both approaches are based on similar observations (Lemmas 2 and 3).Section 5 provides concluding remarks and directions for future work.

Definitions
We assume we have a set S of elements of finite cardinality.A hierarchy The relation of inclusion between the sets defines a partial ordered set.It can be represented in a forest fashion, the roots of each tree being the sets that are not include in any other group.
Let N i (H) denotes the i-th level of H i.e. the groups sitting at depth i in this forest.Notice it is still well defined if H contains repeated groups.A level N i (H) is a partition since it does not contain overlapping sets.The depth of a hierarchy d(H) is the maximum depth of its groups.We define as H[S ] the sub-hierarchy induced by S ⊆ S as the non-empty sets of {S ∩ H i } 1≤i≤k .It is the hierarchical clustering of S obtained after the removal of every elements of {S \ S } in each group of H (discarding empty sets).Definition 1. (Suppression Distance) Let H 1 and H 2 be two hierarchies of S. The suppression distance d s is defined as Theorem 1.The function d s is a metric.
Proof.The non-negativity, identity and symmetry properties are straightforward for d s .Moreover, this distance respects the triangular inequality.Consider three hierarchies H 1 , H 2 and H 3 .Let S ij ⊆ S be a minimum suppression set for (H i , H j ).Since S 12 ∪ S 23 is also a suppression set for (H 1 , H 3 ), we have:

Existence of a polynomial-time solution
We give here a non-constructive proof for the existence of a polynomial time algorithm.It generalizes the results of Gusfield [3] on the equivalence between this problem and the minimum vertex cover problem on perfect graphs.The difference between hierarchies can be encoded in a difference graph (Definition 2).Finding a suppression set for two hierarchies is equivalent to find a minimum vertex cover in this graph (Theorem 2).Since, this graph is perfect [5] (Theorem 3), it exists a polynomial time algorithm to solve this problem.
This graph can contain self-loops.
Two elements of S are connected iff they do not appear in the same number of groups together in both hierarchies.An example of hierarchies and their difference graph can be found in Figure 1.
).Indeed, for i ∈ {1, 2}, the number of groups where {s 1 , s 2 } ∈ S 2 appear together is equal in H i and H i [S ] by definition of induced hierarchy.Therefore, we have ) by definition of induced subgraph.
Every s ∈ S belongs to the same number of sets in both hierarchies and ) since all elements in S belong to at most one group at a given level by definition of hierarchy.Indeed, let (a, b) and any H ∈ H only belongs to one level N i (H) by definition of level.(d) By contradiction, assuming E(G) = ∅ and ) should contain at least one edge as the difference graph of two partitions (Lemma 3.1 of [3]).This contradicts the hypothesis E(G) = ∅.
We show now that a minimum suppression set for (H 1 , H 2 ) is also a minimum vertex cover of G. Since (E(G) = ∅) ⇔ (H 1 = H 2 ) and according to Lemma 1, for S ⊆ S, we have We assume for the rest of the paper that each element of S belongs to the same number of sets in both hierarchies.Indeed, if it is not the case, the elements that appear in a different number of groups are part of every suppression sets (equivalently, they will have self-loops and belong to every minimum vertex covers of G).Those elements can be found in polynomial time.If G(H 1 , H 2 ) contains no self-loops then H 1 and H 2 have the same depth d.
We use now the edge function p : E(G) → N to encode the first level at which (a, b) ∈ E belongs to a group of H 1 but not H 2 (or the opposite).We denote by G i the subgraph of G formed by the edges {e ∈ E, p(e) ≥ i}.Notice we have G 1 = G.In Fig. 1, p(e) = 1 for the thin edges and p(e) = 2 for the tick edges.Observe that subsets of elements connected with edges of values 2 (e.g.{a, b, c}) belongs to the same group at the first level of H 1 and H 2 .Moreover, those subsets are either disconnected or fully pairwise connected (for example, {a, b, c} and d form a K 3,1 when looking only at thin edges).Those observations are generalized in Lemmas 2 and 3.
Lemma 2. Let S ⊆ S and i > 1, if G i [S ] is connected then, at a given lower level j < i, it exists two unique groups is a connected subgraph but it exists two non-overlapping subsets A and B of S with (A ∪ B) = S such that A and B either (1) belong to the same group in N j (H 1 ) but not in N j (H 2 ) (2) do not belong to same group in N j (H 1 ) and N j (H 2 ).By definition of hierarchy, A and B can be split at most one time.Therefore both cases are impossible otherwise the edges between (A, B) would (1) have a value of j (2) have a value lower than j or form an empty set (j = 0).This contradicts our hypothesis since we assume Proof.According to Lemma 2, the elements in S belong to the same groups of depth lower than i in both hierarchies.If it exists u ∈ S such that (u, v) ∈ E and p(u, v) = j < i then there is a group at depth j in H 1 (or H 2 ) that contains (u ∪ S ) and a group at depth j in H 2 (or H 1 ) that contains S but not u.Theorem 3. Let H 1 , H 1 be two hierarchies of finite depth of a set S, computing d s (H 1 , H 1 ) can be done in polynomial time.
Proof.First, the difference graph G can be computed in polynomial time.
the pairs of hierarchies of depth d where each element appears in the same number of groups.We show that, for any S, the difference graph G of any pair in Ψ d (S) is a perfect graph by induction over d.

Basis. For d = 1, Ψ 1 (S) corresponds to pairs of partitions and the graph
G is therefore perfect (Theorem 3.4 of [3]).2. Inductive step.Assuming it is true for d we show it is also true for d + 1.
The vertices set of G is S.According to Lemma 2, elements within the same connected components of G 2 belong to the same group in the first level.Thus, G = G( P1 , P2 ) where ( P1 , P2 ) ∈ Ψ 1 ( S) are obtained via the fusion of each maximal connected components of G 2 into a new element in the partitions (N 1 (H 1 ), N 1 (H 2 )) ∈ Ψ 1 (S) 2 .Therefore, G is perfect as the difference graph of a pair of Ψ 1 ( S).
According to Lemma 3, the graph G can be recovered from G by deleting each u ∈ S and replacing them by their corresponding connected component S of G 2 , connecting each v ∈ S to the vertices previously adjacent to u in G (the operation is called substitution of u by S ).Note that G is perfect and every connected subgraphs of G 2 is perfect as the difference graph of pairs in Ψ d (S ) (by hypothesis).Therefore, G is also perfect since it can be obtained after substituting perfects graphs for vertices of a perfect graph (Theorem 1 of [5], p. 255).
By Theorem 2, the distance d s (H 1 , H 1 ) is equal to the size of the minimum vertex cover of G.In our case, G is perfect, so the minimum vertex cover can be computed in polynomial time.

An Algorithm based on recursive maximum assignment
The minimum vertex cover problem can be solved in polynomial time for perfect graphs using the generic ellipsoid method [2].This method is however not very practical.We therefore propose a combinatorial algorithm for computing the suppression distance based on observations made in the previous section (Lemma 3).We prove its correctness (Theorem 4) using the fact that a minimum vertex cover G 2 (see previous section) is a subset of the minimum vertex cover of G (Lemma 4).
We start by discussing the case of partitions (P 1 , P 2 ).The distance d s (P 1 , P 2 ) can be computed by solving a maximum assignment problem based on the size of intersections between all pairs of groups in P 1 and P 2 using the Hungarian algorithm [4].The resulting complexity is O( As explained in the proof of Theorem 2, two hierarchies are equal iff the pairs {(N i (H 1 ), N i (H 2 ))} 1≤i≤d are all pairwise equals.However, finding a suppression set S using a greedy "level-by-level" approach (either top-down or bottom-up) may not lead to an optimal solution.Consider the example given in Fig. 1 where d s (H 1 , H 2 ) = 3, a top-down approach may fail since either {a, b, c} or {d, e, f } can be chosen at level 1 to be part of S .But choosing {d, e, f } would lead to a distance of 4. Alternatively, consider the sub-hierarchies induced by the set {a, b, c}, a bottom-top approach may also fail since either a or b can belong to S at the last level.Choosing b would lead to a distance of 2 whereas Input: H 1 , H 2 two hierarchies of a set S Output: S ⊆ S a minimum suppression set Algorithm 1 can be used to compute a minimum suppression set (MSS) for two hierarchies.It recursively computes a suppression set for two sub-hierarchies whose elements belong to the same groups at the current level.
The set {C 1 ∩ C 2 : C 1 ∈ P 1 , C 2 ∈ P 2 } contains the maximal subsets of S that are in the same group in both partitions P 1 and P 2 .The function flatMSS (P 1 , P 2 ) returns a minimum suppression set for partitions (P 1 , P 2 ).The intuition behind Algorithm 1, is that if the set S constructed at line 6 is a minimum suppression set for the sub-hierarchies then it is a subset of an optimal solution for (H 1 , H 2 ) (Lemma 4).Theorem 4 shows it is actually the case.According to Lemma 3, the edge cut (S , S ) forms a complete bipartite graph where S is the set of vertices in (S \ S ) connected to S .The minimum vertex cover of G i−1 should contain either all S or all (S ∪ C ).Therefore, C is a subset of the cover in both cases.Since it is true for the minimum cover of every maximal connected components of G i , the set C is a subset a minimum vertex cover of G i−1 .
Proof.Termination: The hierarchies are of finite depth d and the recursive call is used on two sub-hierarchies of depth d − 1 (the "root" group is removed in both hierarchies in line 6).The condition in line 1 is always met since we assume elements of S appears the same number of sets in both hierarchies.
Assume that at the end of the loop 5-7, the set S is the union of the elements to be removed so that those sub-hierarchies are equal.According to Lemma 4, S is a subset of a minimum suppression set between (H 1 , H 2 ).A possible solution is therefore the union of S and a suppression set of H 1 [S \ S ] and H 2 [S \ S ].The latter can be found only looking at the first level of both hierarchies.We can show the assumption on S to be true by induction since the Algorithm will return a minimum suppression set if (H 1 , H 2 ) are partitions.
We briefly discuss the complexity of Algorithm Let C j be the subsets of S for which the algorithm is used at depth j, it is the union of all intersections computed at depth j − 1 during the algorithm execution (when j > 1).For C ∈ C j , the number of required operations is O((|N j (H

Conclusion and Future Work
We introduced a generalisation of suppression distance, defined for partitions, to hierarchical clusterings.Algorithm 1 is polynomial in term of hierarchies sizes and the number of elements being clustered.Although the number of groups seems to be a limitation, we believe this method is efficient in practice since it recursively removes partial solutions from the hierarchies (which is not taken into account in the complexity analysis).
Hierarchies are a subclass of set covers i.e. a collections of (overlapping) subsets of S. The same definition of distance can be used.In this case, finding a minimum suppression set is equivalent to the maximum common subhypergraph problem, which is N P-hard [1].The same vertex cover technique could not be directly applied to the most general set covers.However, it might be potentially useful for other similar structures with nested objects like hierarchies.

Lemma 4 .
Let G = (S, E) be the difference graph of two hierarchies of S, any minimum vertex cover of G i is a subset of a minimum vertex cover of G i−1 .Proof.Let C be a minimum vertex cover of G i , S be a maximal connected component of G i .The set C = (S ∩ C) is a minimum vertex cover of G i−1 [S ].