Applicability and Interpretability of Hierarchical Agglomerative Clustering With or Without Contiguity Constraints

Hierarchical Agglomerative Classification (HAC) with Ward's linkage has been widely used since its introduction in Ward (1963). The present article reviews the different extensions of the method to various input data and the constrained framework, while providing applicability conditions. In addition, various versions of the graphical representation of the results as a dendrogram are also presented and their properties are clarified. While some of these results can sometimes be found in an heteroclite literature, we clarify and complete them all using a uniform background. In particular, this study reveals an important distinction between a consistency property of the dendrogram and the absence of crossover within it. Finally, a simulation study shows that the constrained version of HAC can sometimes provide more relevant results than its unconstrained version despite the fact that the latter optimizes the objective criterion on a reduced set of solutions at each step. Overall, the article provides comprehensive recommandations for the use of HAC and constrained HAC depending on the input data as well as for the representation of the results.

Hierarchical Agglomerative Clustering (HAC) with Ward's linkage has been widely used since its introduction by Ward (1963). The method is appealing since it provides a simple approach to approximate, for any given number of clusters, the partition minimizing the within-cluster inertia or "error sum of squares". In addition to its simplicity and the fact that it is based on a natural quality criterion, HAC often comes with a popular graphical representation called a dendrogram, that is used as a support for model selection (choice of the number of clusters) and result interpretation. Originally described to cluster data in R p , the method has been applied more generally to data described by arbitrary distances (or dis-similarities). Constrained versions of HAC have also been proposed to incorporate a "contiguity" relation between objects into the clustering process (Lebart, 1978;Grimm, 1987;Gordon, 1996).
However, as already shown by Murtagh and Legendre (2014), confusions still exist between the different versions and how the results are represented with a dendrogram, which is also illustrated in (Grimm, 1987) that presents different alternatives for the representation. These have resulted in different implementations of the Ward's clustering algorithm, with notable differences in the results. More importantly, the applicability framework of the different versions is not always clear: Batagelj (1981) has given very general necessary and sufficient conditions on a linkage value to ensure that it is always increasing for any given dissimilarity. This property is important to ensure the consistency between the results of HAC and their graphical display as a dendrogram. Conditions on a general constraint are also provided in Ferligoj and Batagelj (1982) to ensure a similar property and Grimm (1987) proposes alternative solutions to the standard heights to address the fact that the linkage might sometimes fail to provide a consistent representation of the results of HAC. However, none of these articles fully cover the theoretical properties of these alternatives, for unconstrained and constrained versions of the method.
The goal of the present article is to clarify the conditions of applicability and interpretability of the different versions of HAC and contiguity-constrained HAC (CCHAC).
We discuss the relevance of these methods to the analysis of different types of input data, and the interpretation of the corresponding results. We perform a systematic study of the monotonicity of the different versions of the dendrogram heights by reporting the results already available in the literature for standard HAC and its extensions and by completing the ones that were not available to our knowledge. In addition to providing a uniform presentation of a number of results partially present in the literature, this study reveals an important distinction between the consistency of representation and the absence of crossover within the dendrogram that was not discussed earlier to our knowledge.
Finally, we illustrate the respective behavior of HAC and CCHAC in a simulation study where different heights are used in order to represent the results. This simulation shows that, in addition to reducing the computational time needed to perform the method, the constrained version (CCHAC) can also provide better solutions than the standard one (HAC) when the constraint is consistent with the data, despite the fact that it optimizes the objective criterion on a reduced set of solutions at each step.

HAC and contiguity-constrained HAC
2.1 Hierarchical Agglomerative Clustering HAC was initially described by Ward (1963) for data in R p . Let Ω := {x 1 , · · · , x n } be the set of objects to be clustered, which are assumed to lie in R p . A cluster is a subset of Ω. The loss of information when grouping objects into a cluster G ⊂ Ω is quantified by the inertia (also known as Error Sum of Squares, ESS): wherex G = |G| −1 x i ∈G x i is the center of gravity of G and |G| denotes the cardinal of the set G. Starting from a partition P = {G 1 , · · · , G l } of Ω, the loss of information when merging two clusters G u and G v of P is quantified by : The quantity δ is known as Ward's linkage and it is equal to the variation of withincluster inertia (also called within-cluster sum of squares) after merging two clusters. It also corresponds to the squared distance between centers of gravity: The HAC algorithm is described in Algorithm 1. Starting from the trivial partition P 1 = {{x 1 }, {x 2 }, · · · , {x n }} with n singletons, the HAC algorithm creates a sequence of partitions by successively merging the two clusters whose linkage δ is the smallest 1 , until all objects have been merged into a single cluster. Linkage values at step t can be Algorithm 1 Standard Hierarchical Agglomerative Clustering (HAC) 1: Initialization: P 1 = {{x 1 }, {x 2 }, · · · , {x n }} 2: for t = 1 to n − 1 do

3:
Compute all pairwise linkage values between clusters of the current partition P t

4:
Merge the two clusters with minimal linkage value to obtain the next partition P t+1 5: end for 6: return {P 1 , P 2 , . . . , P n } efficiently updated using linkage values at step t − 1 with a formula known as the Lance-Williams formula (Lance and Williams, 1967). In the case of Ward's linkage, this formula has first been demonstrated by Wishart (1969): The framework of the current section can be extended straightforwardly to the case where the objects to cluster are weighted. However, this study focuses on uniform weights for the sake of simplicity. 1 In the rare situation when the minimal linkage is achieved by more than one merger, a choice between these mergers has to be made. Different choices are made by different implementations of HAC.

HAC under contiguity constraint
A priori information about relations between objects can often be available in applications. For instance, it is the case for spatial statistics, where objects possess natural proximity relations, in genomics, where genomic loci are linearly ordered along the chromosome, or in neuroimaging, with the three-dimensional structure of the brain. According to this point of view, Contiguity-Constrained HAC (CCHAC) allows only mergers between contiguous objects. Considering this approach can have two benefits: (i) more interpretable results by taking into account the natural structure of the data; (ii) a decreased computational time, because only a subset of all possible mergers are considered.
A very general framework for constrained HAC is described in Ferligoj and Batagelj (1982): the contiguity is defined by an arbitrary symmetric relation R ⊂ Ω × Ω that indicates which pairs of objects are said contiguous. Only these pairs are then allowed to be merged at the first step of the algorithm, using the same objective function than in the standard HAC algorithm. The next step iterates similarly, by using the following rule to extend the contiguity relation to merged clusters: Algorithm 2 describes contiguity-constrained hierarchical agglomerative clustering (CCHAC). The only difference with standard HAC lies in the fact that only contigu-Algorithm 2 Contiguity-Constrained Hierarchical Agglomerative Clustering (CCHAC) 1: Initialization: Compute all pairwise linkage values between contiguous clusters of the current partition P t with respect to R t

4:
Merge the two contiguous clusters, G t v 1 and G t v 2 with minimal linkage value to obtain the next partition P t+1 = {G t+1 u } u=1,...,n−t

5:
Extend the contiguity relation to the new cluster G t v 1 ∪ G t v 2 ∈ P t+1 by setting 6: end for 7: return {P 1 , P 2 , . . . , P n } ous clusters are merged. From a computational viewpoint, only the linkage values for a subset of P t × P t have to be considered, which can drastically reduce the number of values to be computed with respect to the standard algorithm. This gain in computational time comes at the price of a (potential) loss in the objective function at a given step of the algorithm, especially if the constraint is not consistent with the dissimilarity or similarity values (see Section 5 for illustration and discussion). This also has a side effect on standard representations of the result of the algorithm, which is discussed in Section 4.
Order-constrained HAC. A simple and useful case of contiguity constraint is the case when the symmetric relation is a contiguity relation defined along a line. This special case of constraint is frequently encountered in genomics (where the contiguity relation is deduced from genomic positions along a given chromosome) and will be called order-constrained HAC (OCHAC) in the sequel. In this context, every cluster has exactly two neighbors (except for the two positioned at the beginning and the end of the line) and at step t of the algorithm, only n − t values of the linkage have to be computed (instead of (n − t)(n − t − 1)/2 for standard HAC). This case is the one implemented in the R package adjclust and an efficient algorithm is described in Ambroise et al. (2019) for sparse datasets.
In this specific case, constrained HAC can be seen as a heuristic to approximate the search of an optimal segmentation (i.e., achieving minimal ESS among all possible segmentations) of the data into K(= n − t) groups, for each possible K. This problem is also known as the "multiple changepoint problem". Strategies already exist to solve this problem both in Euclidean or non-Euclidean settings, and it is known that the sequence of optimal segmentations for each K can be found efficiently in a quadratic time and space complexity using dynamic programing (Steinley and Hubert (2008); Arlot et al. (2019)).
Nevertheless, those approaches are restrained to order constraints and cannot be applied to more general contiguity constraints, contrary to CCHAC. Moreover, the nestedness of the clustering sequences obtained from HAC allows useful graphical representations such as dendrograms (discussed in Section 4.1), contrary to the previously cited methods.
In the present paper, we demonstrate the good properties of the CCHAC for the case of a general contiguity relation and illustrate the opposite situation (where some good properties are not always satisfied for CCHAC) by providing counter-examples and illustrations in the specific case of OCHAC.

Validity of HAC in possibly non-Euclidean settings
In this section, we systematically justify the use of HAC algorithm (with or without contiguity constraints) for all kinds of proximity data, including dissimilarity and similarity data.

Extension to dissimilarity data
The HAC algorithm of Ward (1963) has been designed to cluster elements of R p . In practice however, the objects to be clustered are often only indirectly described by a matrix of pairwise dissimilarities, D = (d ij ) 1≤i,j≤n . Formally, a dissimilarity is a generalization of a distance that is not necessarily embedded into a Euclidean space (e.g., because the triangle inequality does not hold). Here, we only assume that D satisfies the following properties for all i, j ∈ {1, . . . , n}: The HAC algorithm will be applicable to such a dissimilarity matrix D if D is Euclidean. Formally, D is Euclidean if there exists an Euclidean space (E, ·, · ) and n points {x 1 , ..., x n } ⊂ E such that d ij = x i − x j for all i, j ∈ {1, ..., n}, with · the norm induced by the inner product, ., . , on E. Under this assumption, the dissimilarity case is a simple extension of the original R p framework described in Section 2. Different versions of necessary and sufficient conditions for which an observed dissimilarity matrix is Euclidean have been obtained in Schoenberg (1935); Young and Householder (1938); Krislock and Wolkowicz (2012).
When such conditions do not hold, D is simply called a dissimilarity dataset, which is a particular case of proximity or relational datasets. Schleif and Tino (2015) have proposed a typology of such datasets and described different approaches that can be used to extend statistical or learning methods defined for Euclidean data to such proximity data. In brief, the first main strategy consists in finding a way to turn a non-Euclidean dissimilarity into an Euclidean distance, that is the closest (in some sense) to the original dissimilarity. This can be performed using eigenvalue corrections (Chen et al., 2009), embedding strategies (like multidimensional scaling, Kruskal (1964)) or solving a maximum alignment problem (Chen and Ye, 2008), for instance.
A general construction. Alternatively, by using an analogy between distance and dissimilarity, HAC can be directly extended to non-Euclidean data as in Chavent et al. (2018). This extension stems from the fact that, in the Euclidean case of Section 2, the inertia of a cluster may be expressed only in function of sums of the entries of the pairwise where ∆ is defined by ∆(G u , G v ) = x i ∈Gu,x j ∈Gv x i − x j 2 R p for any clusters G u and G v . As a consequence of (5), Ward's linkage between any two clusters G u and G v may itself be written in function of these pairwise distances, see, e.g., Murtagh and Legendre (2014, p. 279): Therefore, as proposed by Chavent et al. (2018), an elegant way to extend Ward's HAC to dissimilarity data is to define the inertia of a cluster using (5), with (sums of) distances replaced by (sums of) dissimilarities, that is: where∆ The corresponding HAC is then formally obtained as the output of Algorithm 1, as described in Section 2.1. In particular, Ward's linkage is still given by (6), with ∆ formally replaced by∆, and, as a consequence, the Lance-Williams update formula is also still given by (4). When the elements of Ω do belong to an Euclidean space and the dissimilarities are the pairwise Euclidean distances x i − x j R p , these two definitions of HAC coincide. Otherwise, HAC is still formally defined, and the linkage can still be seen as a measure of heterogeneity, but the interpretation of the inertia of a cluster as an average squared distance to the center of gravity of the cluster (as in Equation (1)) is lost. Since the two definitions, I andĨ coincide for the Euclidean case, we will only use the notation I in the sequel for the sake of simplicity, even when the data are non-Euclidean dissimilarity data.
The above approach based on pairwise dissimilarities and pseudo-intertia may be used to recover generalizations of Ward-based HAC to non-Euclidean distances already proposed in the literature. In particular, the Ward HAC algorithm associated the latter is also called the Manhattan distance) correspond to the methods proposed by Székely and Rizzo (2005) and Strauss and von Maltitz (2017), respectively.
Remark 1. Székely and Rizzo (2005) and Strauss and von Maltitz (2017) take a different point of view: they define the linkage between two clusters by (6) (up to a scaling factor 1/2); their generalized HAC is then the HAC associated to this linkage. Then, they prove that the Lance-Williams Equation (4) is still valid for this linkage. We favor the above construction by Chavent et al. (2018), which is simply based on pairwise dissimilarities, as it is more intrinsic. It provides a justification for the linkage formula, and the Lance-Williams formula is automatically valid with no proof required.
Finally, there is an ambiguity in the definition of the pseudo-inertia as an extension of the Ward's case. If most authors consider that the dissimilarity is associated to a distance and therefore define the pseudo-inertia based on the squared values d 2 ij , some authors (as Strauss and von Maltitz (2017)) define a linkage equal to the one that would have been obtained with Ward's linkage and a pseudo-inertia described as 1 2|G| x i ,x j ∈G d ij . This ambiguity has long been enforced by popular implemented versions of the algorithm, as it was the case in the R function hclust before Murtagh and Legendre (2014) raised and corrected this problem.

Extension to kernel data
In some cases, proximity relations between objects are described by their resemblance instead of their dissimilarity. We start with the case when the data are described by a kernel matrix. A kernel matrix is a symmetric positive-definite matrix K = (k ij ) 1≤i,j≤n whose entry k ij corresponds to a measure of resemblance between x i and x j . Here, contrary to the Euclidean setting, no specific structure is assumed for Ω, which can be an arbitrary set. Aronszajn (1950) has proved that there exists a unique Hilbert space H equipped with the inner product ·, · H and a unique map φ : This allows to consider the associated distance in H between any two elements φ(x i ) and φ(x j ) for x i , x j ∈ Ω, that implicitly defines a Euclidean distance in Ω by: Therefore, it is possible to use Algorithm 1 for kernel data, even when H is not known explicitly and/or when it is not finite-dimensional. This is an instance of the so-called "kernel trick" (Schölkopf and Smola, 2002). The associated Ward's linkage can itself be re-written directly using sums of elements of the kernel matrix, as shown, e.g., in Dehman (2015): where Contrary to the dissimilarity case described in Section 3.1, the kernel case is a truly interpretable generalization of Ward's original approach because Ward's linkage as calculated in (10) is the variation of within-cluster inertia in the associated Hilbert space H. This case has been described previously in Qin et al. (2003); Ah-Pine and Wang (2016), for instance.

Extension to similarity data
Similarity data also aim at describing pairwise resemblance relations between the objects of Ω through a matrix of similarity (or proximity) measures S = (s ij ) 1≤i,j≤n . Even though the precise definition of a similarity matrix can differ within the literature (see e.g., Hartigan (1967)), it is generally far less constrained than kernel matrices. In most cases, the only conditions required to define a similarity is the symmetry of the matrix S 2 and the positivity of its diagonal. Since both similarities and kernels describe resemblance relations, it seems natural to try to extend the background of Section 3.2 to similarity datasets by using Equation (10). This allows the definition of a linkage, δ S , between clusters via sums of elements of S. However, this heuristic is not well justified since the quantity s ii + s jj − 2s ij is not necessarily non-negative when S is not a positive definite kernel. Thus, it can not be associated to a squared distance as in Equation (9).
The previous work of Miyamoto et al. (2015) has explicitly linked similarity and kernel data in HAC results. More precisely, for any given similarity S, the matrix S λ = (s λ ij ) 1≤i,j≤n such that s λ ij := s ij + 1 {i=j} λ is definite positive for any λ larger than the absolute value of the smallest eigenvalue of S. Therefore, the kernel matrix S λ induces a well-defined linkage δ S λ via Equation (10), which is linked to δ S by: This proposition justifies the extension of Equation (10) to similarity data with Using this heuristic is indeed equivalent to using a given kernel matrix obtained by translating the diagonal of the original similarity S: doing so, the clustering is unchanged and the linkage values are all translated from +λ for the kernel matrix, which does not even change the global shape of the clustering representation when the heights in this representation are the values of the linkage (as discussed in Section 4). The invariance property to this type of correction is specific to Ward's linkage. Therefore, the choice of Ward's linkage is the only choice that provides a natural interpretation of similarity matrices as dot product matrices and that makes a direct link between general similarities and the standard case of Euclidean distances.
However, as for general dissimilarity data in Section 3.1, the interpretation of the linkage as a variation of within-cluster inertia is lost.
Conclusion. In conclusion to this section, we are finally left with only two cases: the Euclidean case (in which objects are embedded in a direct or indirect manner in a Euclidean framework) and the non-Euclidean case. The first case includes the standard case, the case of Euclidean distance matrices and the case of kernels while the latter case includes general dissimilarity and similarity matrices. In the Euclidean case, the original description of the Ward's algorithm is valid as such while, in the second, the algorithm can still be formally applied in a very similar manner at the cost of a loss of the interpretability of the criterion.

Dendrograms
The results of HAC algorithms are usually displayed as dendrograms. A dendrogram is a binary tree in which each node corresponds to a cluster, and, in particular, the leaves are the original objects to be clustered. The edges connect the two clusters (nodes) merged at a given step of the algorithm. The height of the leaves is generally supposed to be h 0 = 0.
In the case of OCHAC, these leaves are displayed as indicated by the natural ordering of the objects, while in the general case of unconstrained HAC they are ordered by a permutation of the class labels that ensures that the successive mergers are neighbors in the dendrogram. The height of the node corresponding to the cluster created at merger t, h t , is often the value of the linkage. To distinguish the height of the dendrogram from the value of the linkage, we will denote by m t the value of the linkage at step t. Alternative choices for the values of (h t ) t are discussed in Section 4.4.
Dendrograms are used to obtain clusterings by horizontal cuts of the tree structure at a chosen height. A desirable property of a dendrogram is thus that the clusterings induced by such a cut corresponds to those defined by the HAC algorithm. This property is equivalent to the fact that the sequence of heights is non-decreasing. When this monotonicity property is not satisfied, a merging step t for which h t < h t−1 , is called a reversal.
Reversals can be of two types, depending on whether or not they correspond to a visible crossover between branches of the dendrogram. Mathematically, a crossover corresponds to the case when the height of a given merger G v 1 ∪ G v 2 is less than the height of G v 1 or the height of G v 2 . A toy example of reversal with crossover is shown in Figure 1, between nodes merged at steps 1 and 2, for the result of OCHAC.
The goal of this section is to study which settings and which definitions of height guarantee the absence of reversals -with and without crossovers.

Monotonicity, crossovers and ultrametricity
A crossover in a dendrogram automatically implies the non-monotonicity of the sequence of heights. The converse is true when the height of the dendrogram corresponds to the value of the linkage (or to a non-decreasing function of the linkage) for the corresponding merger, by virtue of Proposition 1 below.
Proposition 1. Consider a dendrogram whose sequence of heights (h t ) t is a nondecreasing transformation of the linkage values (m t ) t . Then the only reversals that can occur are crossovers.
The proof of Proposition 1 is not specific to Ward's linkage and is a simple consequence of the fact that the linkage is the objective function of the clustering: Proof of Proposition 1. Consider an arbitrary merger step of the HAC, characterized by the linkage value m t . If the next merger does not involve the newly created cluster, then this merger was already a candidate at step t. Then, by optimality of the linkage value at step t, this merger can not be a reversal. Therefore, any reversal must involve the newly created cluster, and is thus a crossover.
An important consequence of Proposition 1 is that when the height of the dendrogram is the corresponding linkage, the absence of crossovers is equivalent to the monotonicity of the sequence of heights.
We shall see in Section 4.4 that for an arbitrary height, the absence of crossover in the dendrogram is not necessarily equivalent to the monotonicity of the sequence of heights. The absence of crossover can be characterized by a mathematical property of the cophenetic distance associated to the heights of the dendrogram, called ultrametricity (see e.g., Rammal et al. (1986)). Formally, let us define, for all i, j ∈ {1, . . . , n}, the cophenetic distance h ij between i and j as the value of the height h t * such that t * is the first step (or the smallest merge number) such that the i-th and j-th objects are in the same cluster. h is said to satisfy the ultrametric inequality if: ∀i, j, k ∈ {1, ..., n}, h ij ≤ max{h ik , h kj }.
As announced, this property is key to ensure the monotonicity of the sequence of heights.
More precisely, Johnson (1967) has defined an explicit bijection between a hierarchy of clusterings with an associated sequence of non-decreasing "heights" (called "values" in the article) and matrix of values with a diagonal equal to zero and satisfying the ultrametric inequality. It turns out that this bijection explicitly defines the entries of the ultrametric matrix as the cophenetic distance of the dendrogram whose heights are the one of the associated hierarchy of clusterings. In other words, this means that a given sequence of heights defining a dendrogram is non-decreasing if and only if the cophenetic distance associated to this dendrogram (or equivalently to this sequence of heights) satisfies the ultrametric inequality.

Monotonicity of Ward's linkage
Ward's linkage corresponds to the variation of within-cluster inertia, so that the monotonicity of the linkage is ensured for Ward's standard HAC algorithm with Euclidean data. More generally, Batagelj (1981) gives necessary and sufficient conditions based only on the Lance-Williams coefficients that ensures monotonicity for a given linkage. These results apply to the extensions of HAC to non-Euclidean datasets and show that the monotonicity of the linkage values is always ensured for standard HAC with Ward's linkage. In addition, Ferligoj and Batagelj (1982) give necessary and sufficient conditions on the Lance-Williams coefficients to ensure the monotonicity of the linkage values in constrained HAC, for an arbitrary symmetric relational constraint. These conditions are not fulfilled for Ward's linkage. Therefore, monotonicity is not guaranteed for CCHAC with Ward's linkage, as also noted by Grimm (1987) for the specific case of OCHAC. It can be shown that even for Euclidean data, the contiguity constraint can induce non increasing linkage values for some steps of the algorithm, as illustrated by Figure 1.
More precisely, if we consider OCHAC, the following proposition establishes necessary and sufficient conditions on a dissimilarity d to observe a reversal at a given step of OCHAC when the height is defined by Ward's linkage: δ(G l , G r ) ≥ min gl l δ(Gl, G l ) + gl r δ(Gl, G r ) gl l + gl r , g lr δ(G l , Gr) + g rr δ(G r , Gr) g lr + g rr where we have used the notation g uv : The fact that Condition (11) involves clusters contiguous to the last merger is a consequence of Proposition 1. The formulation of Condition (11) is quite intuitive: crossovers correspond to situations in which the Ward linkage between two newly merged clusters is larger than a (weighted) average Ward linkage between each of these two clusters and one of the contiguous clusters. The proof of Proposition 2 is given in Appendix A.
Let us apply Proposition 2 to the specific case of the first and second mergers in the algorithm. Assuming that the optimal merger at step 1 is between the l-th and r-th objects, and recalling that the Ward linkage between two singletons is simply δ({u}, {v}) = d 2 uv /2, Condition (11) reduces to: 2d 2 l,r > min d 2 l,l + d 2 l,r , d 2 r,r + d 2 l,r In particular, given the p − 1 distances (d 2 i,i+1 ) 1≤i≤p−1 that determine the first step of the OCHAC algorithm, it is always possible to find an adversarial dissimilarity yielding a reversal at the second step, e.g., by choosing d l,r such that d 2 l,r < 2d 2 l,r − d 2 r,r . This is the case in the counter-example of Figure 1.
An example of relevant reversal for OCHAC. Because of the possible presence of crossovers in OCHAC even in a simple Euclidean setting, CCHAC may appear as a deteriorated version of standard HAC, where the optimal merger is chosen within a reduced set of possible mergers compared to the unconstrained version. One may then expect that the total within-cluster inertia at a given step of the algorithm is larger than for the unconstrained version that chooses the "optimal" merger at this step (that is, the merger with the smallest increase of the total within-inertia). In addition, the algorithm does not necessarily exhibit a clear and understandable monotonic evolution of the objective criterion, (m t ) t . However, it can be shown, even in a very simple example, that OCHAC can lead to better solutions in terms of within-cluser intertia, when the constraint is consistent to the spatial structure of the data. This fact is illustrated in Figure 2 3 . In this example, 7 data points are displayed in R 2 with an order constraint illustrated by a line linking two points allowed to be merged. In this situation, (m t ) t is indeed non monotonic for OCHAC (bottom left figure) but leads to a better total withincluster inertia for k = 3 clusters (vertical green line), which is also more relevant for the data configuration (top figures). This is a typical case where the constraint forces the algorithm to explore under-efficient configurations but that can be aggregated into a better solution, contrary to the unconstrained algorithm. This is explained by the fact that even the unconstrained algorithm is greedy, by construction, and thus not optimal compared to an exhaustive search of the best partition in k classes. Top left: Initial configuration with the order constraint represented by straight lines. Top right: Clustering with 3 clusters as produced by OCHAC (red) and standard HAC (blue). Bottom: Evolution of (m t ) t and of the total within-cluster inertia (also called, Error Sum of Squares: (ESS t ) t ) along the clustering processes, the green line correspond to the 3 components clustering.

Monotonicity of alternative heights
Since reversals can occur in CCHAC dendrograms with Ward's linkage, alternative definitions of the height have been proposed to improve the interpretability of the result in this case. They are defined as quantities related to the heterogeneity of the partition. In this section, we study the monotonicity of such alternative heights. Grimm (1987) presents three alternative heights to the standard variation of withincluster inertia (m t ): • the within-cluster (pseudo-)inertia (or Error Sum of Squares) that corresponds to the value of the objective function. In this case, the height at step t is given by: where P t+1 = {G t+1 u } u=1,...,n−t is the partition obtained at step t of the algorithm. This alternative height is very natural (and the one implemented in the R package rioja for OCHAC) since it corresponds to the criterion whose minimization is approximated by HAC (and OCHAC) in a greedy way; • the (pseudo-)inertia of the current merger, which is defined as: where G t u and G t v are the two clusters merged at step t. Grimm (1987) remarks that this measure is very sensitive to the cluster size |G t u | + |G t v |.
• the average (pseudo-)inertia of the current merger, that has been designed so as to avoid the bias related to the cluster size in I t . It is defined as: Standard HAC: Known properties of alternative heights. Note that ESS t = t ≤t m t . As explained in Section 4, (m t ) t is monotonic for standard HAC, both for Eu-clidean and non-Euclidean data. Since m 0 = 0 by definition, this ensures the monotonicity of (ESS t ) t , for Euclidean and non-Euclidean data in the case of standard HAC.
On the contrary, I t and I t may induce reversals even for standard HAC and Euclidean data. More importantly, contrary to the case when the height of the dendrogram is m t , even when the ultrametric property is satisfied, the monotonicity is not ensured for these criteria. This is illustrated in Figure 3 (and in Figure 11 of the Appendix C), for I t (and forĪ t , respectively) and data in R 2 .
In this case, the dendrogram has a conventional look but the mergers are not ordered by increasing heights. For instance, in Figure 3, the cluster merged at step 2 is above the one at step 3. Hence, cutting the dendrogram at height h = 2.5 leads to a clustering into Euclidean case for OCHAC, and show that there is no guarantee for monotonicity in the case of general CCHAC. The fact that (Ī t ) t is not necessarily monotonous for OCHAC has already been mentioned by Grimm (1987).
CCHAC: Within-cluster pseudo-inertia for dissimilarity data. The only unanswered case is whether (ESS t ) t is monotonic or not for CCHAC and non-Euclidean data.
We provide a counter-example that proves that the monotonicity is not ensured in this case: Figure 4 shows that the dendrogram obtained from OCHAC on a given non-Euclidean dissimilarity D contains a crossover (m 4 < m 3 ). In particular, the associated sequence of heights is not monotonic. However, Proposition 1 ensures that (ESS t ) t has the nice property that the absence of crossovers is equivalent to its monotonicity. Indeed, as Top left: Configuration of the objects in R 2 . Top right: Coordinates of the objects and Euclidean distance matrix corresponding to this configuration. Bottom left: Representation of the values of the dissimilarity (dark colors correspond to larger values, so to distant objects). Bottom right: dendrogram obtained from standard HAC. Only the first 3 merges of the dendrogram is represented to ensure a comprehensive view of the sequence of heights. and ESS t is equal to the addition of ESS t−1 . As, by definition, ESS t−1 is, as any I(G t−1 u ), positive, this ensures that this mapping is non-decreasing.    HAC can be seen as a greedy algorithm to solve the problem of finding the partition with minimal within-cluster inertia ESS t of n objects into n − t classes, for each t = 1 . . . n − 1.
It may be expected that the inertia of the partitions will be lower for HAC than OCHAC, since the possible mergers in OCHAC are chosen among a subset of the possible mergers in HAC. Can we quantify the impact of the order constraint on the quality of the partitions (as measured by ESS) obtained for HAC and OCHAC, depending on the strength of the actual order structure in the data? In this section, we address this question by analyzing Hi-C data (Dixon et al., 2012), which present a strong order structure, as illustrated by Figure 5. We use a perturbation process to progressively break the consistency between the data structure and the constraint imposed in OCHAC.

Data and method
Hi-C studies aim at characterizing proximity relationships in the 3D structure of a genome, by measuring the frequency of physical interaction between pairs of genomic locations via sequencing experiments. Formally, a Hi-C map is a symmetric matrix S = (s ij ) i,j in which each entry s ij is equal to the frequency of interaction between genomic loci i and j. Here, a locus is a fixed-size interval of genomic positions, also called a "bin". Hi-C maps are classically represented by the upper triangular part of the matrix, as shown in Figure 5. The matrix has a strong diagonal structure that reflects the linear order of DNA within chromosomes (loci that are close along the genome are more frequently interacting than distant loci). An important question in Hi-C studies is to identify Topologically Associating Domains (TADs), which are self-interacting genomic regions appearing to be more compact than the rest of the genome. Indeed, TADs have been shown to play an important role in gene regulation (Dixon et al., 2012). A number of TAD detection methods have been proposed (see e.g., Zufferey et al. (2018) for a review) and some are based on HAC or OCHAC (Fraser et al., 2015;Haddad et al., 2017;Ambroise et al., 2019). This is both natural, since Hi-C maps can be seen as similarity matrices, and formally justified, as explained in Section 3.3. In practice, Hi-C maps are indeed non-positive, and downloaded Hi-C matrix contains 4,864 bins. It has been obtained with a bin size of 40kb and normalized using ICE (Imakaev et al., 2012). We further performed a logtransformation of the entries to reduce the distribution skewness prior clustering.
In order to assess the influence of the data structure on the quality of the partitions obtained by OCHAC and standard HAC algorithms, we have used a perturbation process to progressively remove the strong diagonal in the original Hi-C map. The perturbation consists in swapping two entries, s ij and s i j of the matrix, in which (i, j) and (i , j ) have Figure 6: Illustration of the perturbation process. From left to right : example of Hi-C maps corresponding to increasing perturbation levels.
been randomly sampled with uniform probability among the pairs {(u, v), (u , v )} for which u ≤ v, u ≤ v and s uv + s u v > 0, where the last condition avoids swapping entries that are both zero. The proportion of such swapped pairs, which we call perturbation level, varied from 0% up to 90% ( Figure 6).
This process was repeated 50 times to allow assessing the variability. Since obtained matrices are not necessarily positive definite, we translate their diagonal by a small quantity that ensures the positivity of all d 2 ij = s ii + s jj − 2s ij as described in Section 3.2. All simulations were performed with R. The results for standard HAC were computed with the function hclust (from the stats package) and those for OCHAC were computed with the function adjClust (from the adjclust package). Figures were obtained using adjClust or ggplot2 (Wickham, 2016).

Comparison of standard HAC and OCHAC results
In this section, the results of standard HAC and OCHAC are compared through the corresponding height sequences of the dendrograms, through dendrograms themselves and through clusterings obtained by horizontal cuts of the dendrograms. While dendrograms and height sequences are a direct output of the HAC process, clusterings are obtained using a model selection strategy. We have considered two such strategies: the broken stick (Bennett, 1996), as implemented in adjclust, and the slope heuristic (Arlot et al., 2016), as implemented in capushe. The idea of the broken stick heuristic is to test the reduction of within-cluster inertia along the clusterings sequence considered backward (starting by the clustering consisting in the whole set of objects) against the reduction obtained for a model in which within-cluster inertia is divided with uniform probability in the corresponding number of components. On the other hand, the slope heuristic assumes the existence of a true clustering which is detected by a change in the slope of within-cluster inertia along the clustering sequence.
For both strategies, the "best" clustering is defined based on the within-cluster inertia of the sequence of clusterings obtained by the hierarchical process. As both strategies gave similar results, we chose to report only the results obtained for the broken stick heuristic here. For each Hi-C map of the simulation and for both methods of hierarchical clustering, clustering comparisons will be based on the clusterings selected by the broken stick heuristic.
Height sequences. Figure 7 shows the evolution of m t (normalized by its maximal value among both methods at a given permutation level) and ESS t (normalized by the total inertia of the set of bins) along the two clustering processes for increasing perturbation levels. For the original dataset, which presents an organization strongly consistent with the order constraint, the heights of standard HAC and OCHAC are very similar. However, interestingly, OCHAC improves the objective criteria (ESS t and m t ) for low perturbation levels (15%-30%) across a wide range of merging levels.
More specifically, we compared the heights obtained for HAC and OCHAC at the merger number selected by the broken stick heuristic (Bennett (1996); vertical lines in Figure 7). At these numbers of clusters or in their close neighborhood, ESS t is always smaller for OCHAC, which we interpret as more homogeneous clusterings for OCHAC than for HAC. The magnitude of the improvement achieved by OCHAC with respect to HAC depends on the perturbation level: for the original data, it is close to 5%, whereas it is much larger (25-30%) when the perturbation level is 15%-30%. It then decreases again (< 20%) for larger perturbation levels (60%).
The fact that OCHAC can achieve lower values than HAC for ESS t and m t may be counter-intuitive, since -as explained at the beginning of Section 5-possible mergers in OCHAC are chosen among only a subset of the possible mergers in standard HAC. In fact, HAC itself is a heuristic for the minimization of ESS t , because of its hierarchical agglomerative nature; in contrast, the optimal clustering at step t in the sense of ESS t may not necessary be obtained by merging two clusters of the optimal clustering at step t − 1. This result illustrates the robustness to noise of the constrained approach, which is very interesting in practice: in Hi-C experiments, for instance, many biases (genomic, experimental, etc.) are encountered. Thus, OCHAC has to be preferred in such contexts and will additionally result in a lower computational cost. The benefit of using a relevant constraint had already been observed by Steinley and Hubert (2008): their simulations proved that a relevant order constraint (in their case, obtained from the data) could improve the recovery of the true cluster structure (although possibly at the cost of a slight decrease in ESS t compared to the unconstrained version). For perturbation levels larger than 60%, the data structure is no more compatible with the constraint (see Figure 6) and standard HAC seems to perform globally better than OCHAC, as expected. In addition, in this extreme situation, OCHAC exibits very large reversals for m t (seen with the grey shadow in Figure 7), that are due to sudden breaks in the quality of the clusterings, induced by the constraint. The presence of such large reversals is a practical and visible indication that the constraint is not relevant for the data and that OCHAC should not be used.
Dendrograms and clusterings. The same type of conclusion can be drawn when comparing not just the heights of the dendrograms but the dendrograms themselves or the clusterings induced by these dendrograms. Figure 8 shows the distribution of a measure of similarity between the order of fusion in the dendrogram. More precisely, the cophenetic distances have been computed for all pairs of objects in the dendrograms induced by standard HAC and OCHAC at different levels of perturbation and the Spearman correlation betwen these two vectors of cophenetic distances (coming from the constrained and the unconstrained version of the algorithm) has been obtained. As the perturbation level in-  creases, the Spearman correlation linearly decreases from a value close to 1 (implying very similar dendrograms) to a value close to 0 (implying completely different dendrograms).
Finally, we compared the clusterings obtained by the broken stick heuristic (Bennett, 1996) as follows. For larger perturbation levels (more than 60%) of permuted coefficients, we obtained a trivial clustering with only one cluster, a strong indication that the cluster structure had disappeared at these levels. For lower perturbation levels, the obtained clusterings were compared using the Normalized Mutual Information (NMI, Danon et al. (2005)). As for the Spearman correlation, the NMI values obtained for the original data and low levels of perturbations (up to 30%) are very close to 1, which shows a strong similarity of the induced clusterings. As the perturbation level increases, the obtained partitions became more and more different, with NMI values below 0.6 (results not shown).

Reversals for the different heights
In this section, we investigate the reversals obtained for different heights and for standard HAC and OCHAC. Figure 9 gives the evolution of the percentage of reversals (relative to the total number of simulations, 50), for standard HAC and OCHAC and for the different types of heights, along the hierarchical clustering process.  Figure 9: Evolution of the number of reversals for ESS t , m t , I t andĪ t for standard HAC (top) and OCHAC (bottom) for increasing levels of perturbation of the original Hi-C matrix. The background shadow is the actual value and the strong line is a smoothed value (box kernel, bandwith equal to 50). The dotted vertical line corresponds to the average number of clusters chosen by the broken stick heuristic.
As expected from Section 3 (Table 1), (ESS t ) t does not have reversals and (m t ) t only has reversals for OCHAC. When the perturbation level increases, the evolution of the number of reversals in (I t ) t and (Ī t ) t is markedly different from that of (m t ) t . For the smallest perturbation levels (up to 30%), the number of reversals of (m t ) t is close to 0, while it ranges from 10 to 50% for (I t ) t and (Ī t ) t . At these perturbation levels, (m t ) t almost never has a reversal at a merger number that corresponds to the number of clusters chosen by the broken stick heuristic: most reversals are concentrated at a merger number smaller than the merger chosen by the broken stick heuristic. Actually, for small perturbation levels, these reversals in m t values help improve the quality of further clusterings by choosing a solution that is less efficient than that of standard HAC but more consistent with the data (as already discussed in the example of Figure 2).
Hence, when the data structure is consistent with the constraint, (m t ) t typically provides an interpretable dendrogram. This nice property is, of course, lost when the constraint is no more consistent with the data structure (above a perturbation level of 60%), which is explained by the fact that the OCHAC has a poor performance in that context, as already discussed in the previous section.
On the contrary, (I t ) t and (Ī t ) t exhibit larger numbers of reversals. This is particularly the case for the last mergers, even for small levels of perturbation and even in the unconstrained case: 40-60% of the simulations have reversals for both OCHAC and standard HAC at a number of clusters corresponding to the selected clustering. We also observe that the percentage of simulations showing a reversal for standard HAC tends to decrease when the perturbation level in the data increases for the first steps of the hierarchical process (the same can be observed, to a much lesser extent, for OCHAC).
This phenomenon is explained below. Figure 10 displays the evolution of the merged cluster size thorough the hierarchical clustering and provides an explanation for this fact. For standard HAC, the number of clusters with a size equal to 2 during the first steps of the algorithm is strongly increasing when the perturbation level increases. For a permutation level of 90%, most of the mergers have a size equal to 2 during half of the clustering process (for fusion numbers ranging from Figure 10: Evolution of the average cardinal of mergers along the hierarchical clustering process for standard HAC (top) and OCHAC (bottom), and different levels of perturbation. Note that the average is computed over the 50 simulations, whereas the original data correspond to a unique value. Data are shown only for the first 4,000 mergers and for a cardinal smaller than 8 for the sake of readability. The two dotted vertical lines correspond, respectively to n/2 and 3n/4. 1 to at least 2,000). However, for clusters with a size equal to 2, I t is equal to m t which explains the similiarities between m t and I t curves during the first steps of the clustering process, as the perturbation level increases. Since (m t ) t is increasing for standard HAC, this explains why I t has less reversals in standard HAC for the first merger numbers when the perturbation level is higher. The same holds forĪ t up to a fixed size factor of 2.

Conclusion
In this article, we have studied the applicability of HAC and its constrained version to a wide range of input data. In particular, we have shown that these applications are justified beyond the Euclidean framework. We have also shown that the monotonicity of the sequence of heights is not always ensured, although this property is necessary for the sequence of clusterings obtained by cutting dendrograms to be consistent with the sequence of clusterings of the algorithm. We have clarified which heights have this property depending on the input data types and for the constrained and unconstrained HAC. We have also pinpointed an important distinction between this monotonicity and the existence of crossovers.
These results imply that the variance of the merged cluster, I t , or the average variance of the merged cluster,Ī t , are never ensured to be monotonic, and should thus not be chosen to represent the dendrogram heights. Strikingly, we have also shown that the constrained version of the HAC can provide more relevant and efficient solutions than its unconstrained versions, not only in terms of algorithmic complexity, but also in terms of the values of the objective function ESS t . In such cases, a small number of reversals can actually be beneficial to explore intermediate solutions closer to the data and that lead to more relevant clusters.

A Proof of Proposition 2
Proof of Proposition 2. We begin by noting that by Proposition 1, the only reversals that may occur are crossovers. With the notation of Proposition 2, a crossover at step t + 1 corresponds to the situation where δ(G l , G r ) ≥ δ(G l ∪ G r , Gr) or δ(G l , G r ) ≥ δ(G l ∪ G r , Gl).
By symmetry we focus on the first case. With the notation of Proposition 2, and using the Lance-Willams formula (4), the first condition is equivalent to δ(G l , G r ) ≥ g lr δ(G l , Gr) + g rr δ(G r , Gr) g lr + g rr while the second one is equivalent to δ(G l , G r ) ≥ gl l δ(Gl, G l ) + gl r δ(Gl, G r ) gl l + gl r hence the result.

B Step-by-step description of the counter-examples
In the following tables, red color is used to signal reversals. Green color in details of     C Counter-example of the monotonicity ofĪ t for standard HAC in the Euclidean case x 1 x 2 x 3 x 4 x 5