Stochastic self-organizing map variants with the R package SOMbrero

Self-Organizing Maps (SOM) [ ] are a popular clustering and visualization algorithm. Several implementations of the SOM algorithm exist in different mathematical/statistical softwares, the main one being probably the SOM Toolbox [2]. In this presentation, we will introduce an R package, SOMbrero, which implements several variants of the stochastic SOM algorithm. The package includes several diagnosis tools and graphics for interpretation of the results and is provided with a complete documentation and examples.


INTRODUCTION
Self-Organizing Maps (SOM), introduced by Teuvo Kohonen [1], [3], [4], are a popular clustering and visualization algorithm. The success of the method is due to its very simple definition as well as its ability to perform, in a single analysis, clustering and visualization by projection of the data onto a small dimensional grid. This is a powerful and easy-to-use method for the exploration of multidimensional data. Originally, SOM has been inspired by neuro-biological learning paradigms which were intended to model some sensory or cognitive processes in which the learning is directed by the experience and the external inputs without supervision. Later, the method has been used in a wide variety of areas ( [2] reports more than 7,000 scientific publications related to SOM with applications in industrial, biomedical and financial analyses, for instance). Several implementations of the SOM algorithm exist in different mathematical/statistical softwares, most on them usable only for numeric data. The main one is probably the SOM Toolbox 1 , which is released under GPL2 licence and runs in Matlab. The toolbox contains many different functions to learn and interpret SOM and to handle various types of data, including datasets with missing data, symbol strings datasets and relational datasets (i.e., datasets described by a dissimilarity, [5]). It also includes implementations of Generative Topographic Mapping [6] and of LVQ and metric learning for LVQ [7], [8]. The toolbox is fully documented and illustrated in the book [2].
Since the free statistical software R [9], is one of the most (if not the most) widely used program in the statistical and bioinformatics communities, we have implemented a full R package providing various versions of the SOM algorithm and many functions for diagnosis and interpretation. SOMbrero is inspired by Patrick Letremy's SAS programs 2 . Prior SOMbrero 3 , several R packages were proposing SOM implementations (e.g., class, som, popsom, kohonen, ... see [10] for a more precise description of these packages). However, as far as we can tell, SOMbrero is the most complete one, with user-friendly features and implementations of variants of the original algorithm for relational data and contingency tables, based on the stochastic version of the method. Current version of the package is 1.2 (2016/09/02) and the package runs on R (≥ 3.0.1) with dependencies to the packages wordcloud, scatterplot3d, RColorBrewer, shiny, grDevices, graphics and stats.
The next sections describe the different features of the package. More precisely Section 2 presents the general organization of the package and its standard use for numeric datasets. Section 3 presents applications of SOM to explore contingency tables and Section 4 describes a variant for dissimilarity datasets.

SOM FOR NUMERIC DATASETS AND GENERAL ORGANIZATION OF SOMbrero
This section provides an overview of the package main features. It fully describes functions that can be used for the standard SOM (that processes numeric datasets). However, most of what is described in this section is not restricted to numeric datasets and also exist for the other two variants of the algorithm that are described in Sections 3 and 4.

Training stochastic SOM for numeric datasets
SOMbrero implements the stochastic (also called "on-line") version of the SOM algorithm. More precisely, the dataset 2. current version 9.1.3, last updated in December 2005, no longer maintained, available at http://samos.univ-paris1.fr/ Programmes-bases-sur-l-algorithme 3. https://CRAN.R-project.org/package=SOMbrero X = (x ij ) i=1,...,n,j=1,...,d , made of n observations taking values in R d , is mapped into a low dimensional grid composed of U units. Conversely, every unit u is associated with a prototype p u ∈ R d . The grid induces a natural distance d on the map which provides a measure of dissimilarity between every pair of units, d(u, u ). The stochastic version of the algorithm processes the observations one by one, with the iterative application of two steps: • an assignment step where one observation (on-line version) is classified into the unit with the closest prototypes (according to R d Euclidean distance); • a representation step where all prototypes are updated according to the new assignment. For the on-line version of the algorithm, this step is performed by mimicking a stochastic gradient descent scheme.
The method is fully described in Algorithm 1, in which H t is the neighborhood function that satisfies H t : R + → R + , H t (0) = 1 and lim z→+∞ H t (z) = 0, and µ(t) is a training parameter. Generally, H t and µ(t) are supposed to be decreasing with the iteration number, t. Assignment step: find the unit of the prototype closest to where 1 i is a vector with a single non null coefficient at the ith position, equal to one. 6: end for 7: return Clustering: (f T (x i )) i=1,...,n and prototypes: (p T u ) u=1,...,U .
In SOMbrero, the following options can be passed to the function trainSOM that trains a SOM: • the distance between units on the grid, d, is Euclidean (default) or any type of distance available in the R function dist or also a relationship denoted by letremy which corresponds to the original implementation by Patrick Letremy and switches between Euclidean distance and the distance "maximum" which is the maximum between the distance on the first and the second coordinates; • the function H t , also called neighborhood relationship, can be chosen as Gaussian (default), H t (z) = e −z 2 /r(t) in which r(t) is decreasing with t, or can be of type letremy: H t (z) = 0 if z ≤ r(t) 1 otherwise , with r(t) decreasing during the training.
• prototypes can be initialized randomly as described in step 1 of Algorithm 1 or to one observation of the dataset, randomly chosen, each or by positioning them regularly on the first to PC of a PCA performed on X or finally, to values specified by the user; • X can be preprocessed by centering and scaling to unit variance or not preprocessed at all. In the case of a preprocessing of the data, the prototypes are returned in the original scale of X; • the assignment step can be performed in a standard way, as described in step 4 of Algorithm 1, or using the Heskes's modified assignment step [11]: • the number of iterations T can be set by the user. Default values is equal to 5n; • finally, the user can chose to save intermediate states of the training, which includes saving the prototypes, the clustering for all observations and the value of

Plots and diagnosis functions
SOMbrero has been conceived to help the user interpret the output of the algorithm. More precisely, two standard quality measures [12] are given with the function quality that computes: • the topographic error, which is the average frequency (over all observations) with which the prototype that comes second closest to an observation is not in the direct neighborhood (on the grid) of the winner prototype. It is a real number between 0 and 1, a value close to 0 indicating good quality; • the quantization error, computed as: Moreover, SOMbrero comes with many different plots, that can be used to represent the original dataset, the prototypes or an additional variable related to the observations, on the grid. 16 types of plots are available with the single function plot and two options what (what to plot? observations, prototypes or additional information) and type (which type of graph to plot?). Some of these plots display distances between prototypes on the grid and can be used as diagnosis plots. An exhaustive description of available graphics is provided in [10] and examples of such plots, obtained on the famous iris dataset [13], are given in Figure 1.
Finally, super-clusters can be obtained by using the function superClass on the output of the trainSOM function. This method performs hierarchical clustering on the prototypes of the map to cluster the units in "super-clusters".
These super-clusters can also been visualized on the map in various ways.

Documentation and graphical user interface
Additionally to the standard R documentation, SOMbrero also provides five vignettes (documentation files accessible from within the package and at https://cran.r-project.org/ web/packages/SOMbrero/). The vignettes illustrate the functions with three datasets that correspond to the three variants implemented in the package. The first dataset is the iris dataset [13]. The second dataset (KORRESP variant, see Section 3) is the contingency table of votes for the combinations of "Départements" and "Candidates" during the French presidential election of 2002 4 . It provided in the package under the name presidentielles2002. The last dataset (relational variant, see Section 4) is the shortest path length between all nodes in the co-occurrence network obtained from the French novel "Les misérables" (V. Hugo) [15]. This dataset (co-occurrence network and dissimilarity matrix) is provided in the package under the name lesmis. For users who are not familiar with R command lines, SOMbrero contains a graphical user interface which has 4. Source: http://www.interieur.gouv.fr/Elections/Les-resultats/ Presidentielles been programmed with the R package shiny [16] (see Figure 2). The GUI is launched from within R with the function sombreroGUI() and opens in default web browser.

SOM FOR CONTINGENCY TABLES
Another very standard type of data that can be handled by SOMbrero is the case of contingency tables in which the dataset T = (n ij ) i=1,...,p,j=1,...,q is composed of joint frequencies for a pair of categorical variables (p is the number of levels for the first categorical variable and q the number of levels for the second). Classical correspondence analysis performs a weighted principal components analysis, using the χ 2 distance simultaneously on the row profiles and on the column profiles. The same principle is used in the so-called "KORRESP" algorithm, [17], [18], which extends SOM to contingency tables and is implemented in SOMbrero.
Prototypes are defined in R p+q . The algorithm then alternatively processes a row and a column (randomly chosen during the stochastic training process). For the chosen row (or column), • the assignment step uses the profile (x i ) i=1,...,p (in R q for rows) or (x i ) i=p+1,...,p+q (in R p for columns). For a row i ∈ {1, . . . , p}, this gives: • the representation step uses the augmented profile (in R p+q ). For the chosen row i, this gives: Assignment step: find the unit of the prototype closest to x i : Representation step: ∀ u = 1, . . . , U , where 1 i is a vector with a single non null coefficient at the ith position, equal to one. 8:

9:
Start Process columns 10: Randomly choose an input i ∈ {p + 1, . . . , p + q} 11: Assignment step: find the unit of the prototype closest to x i :

12:
Representation step: ∀ u = 1, . . . , U , where 1 i is a vector with a single non null coefficient at the ith position, equal to one.
This method can be used in SOMbrero with trainSOM(..., type= "korresp") and provides a map in which levels for both variables of the contingency tables are clustered simultaneously. This approach is analogous to the common representation provided for contingency tables in correspondence analysis.

Relational SOM
In the case where the observations (x i ) i=1,...,n take values in an arbitrary input space G, a widely used representation of the data is to compute a measures of dissemblance or resemblance between pairs of observations. Well known examples of such frameworks include shortest path lengths between pairs of nodes in a graph [19], string edit distance between categorical time series [20] or DNA samples [21] or various kernels used in many application fields, including biology [22]. In this section, we will consider that the data are described by a dissimilarity measure ∆ = (δ ij ) i,j=1,...,n , which is such that δ ij = δ(x i , x j ) is the measure of dissemblance between x i and x j , is symmetric and null on the diagonal. Note that a natural Euclidean structure is not necessarily associated with this dissimilarity measure.
Several extensions of the SOM algorithm have been proposed in this context, including "median SOM" [23]- [25] and kernel SOM [26]- [28]. SOMbrero implements the relational variant of SOM [5], [21] which relies to a pseudo-Euclidean framework. More precisely, [29] shows that, given δ as described above, we can find two Euclidean spaces, E + and E − , and two mappings φ + : G → E + and φ − : , that take values in the orthogonal direct sum of the two Euclidean spaces implicitely defined by the dissimilarity, the prototypes can thus be defined as p u = n i=1 β ui φ(x i ) with β ui ≥ 0 and n i=1 β ui = 1. Using the standard operations in the pseudo-Euclidean space induced by the similarity, the representation and assignment steps of the SOM algorithm can be rewritten as described in Algorithm 3 (justifications are given in [21]). Assignment step: find the unit of the closest prototype Representation step: ∀ u = 1, . . . , U , where 1 i is a vector with a single non null coefficient at the ith position, equal to one. 6: end for If the dissimilarity matrix is a Euclidean distance, then the relational SOM is exactly identical to the standard numerical SOM as long as the prototypes of the original SOM are initialized in the convex hull of the original data (i.e., the initial prototypes can be written p 0 the relational SOM is identical to kernel SOM as described in [27], [30], [31] for a dissimilarity defined from a kernel K by

Acceleration of relational SOM
However, the complexity of the assignment and representation steps of stochastic relational SOM are, respectively, O(n 2 U ) and O(nU ), as pointed in [32]. This leads to a total complexity of O(n 2 U ) for one iteration. To obtain good convergence properties, the algorithm requires a number of iterations of the order of αn, as shown in [33], yielding a complexity of O(αn 3 U ). Hence, relational SOM is not adapted to large datasets and cannot be used to analyze more than a few thousands observations. [34] have proposed two approximate versions to overcome this issue, using sparse representations of the prototypes or DR preprocessing techniques. However, the version implemented in SOMbrero uses the solution proposed in [35] which yields to a complexity of O(αn 2 U ).
More precisely, the method is based on a re-formulation of the assignment step (step 4 in Algorithm 3): is a vector of size U and The updates of A t and B t are performed during the representation step, which is thus equivalent to ∀u = 1, . . . , U, β t u = (1−λ u (t))β t−1 u +λ u (t)1 i , in which λ u (t) = µ(t)H t (d(f t (x i ), u)). This leads to the following updates: ui . Provided that the assignment and representation steps are usually performed O(αn) times, the total complexity of the algorithm is dominated by O(αn 2 U ). This computational cost is obtained using the additional storage of A t and B t which requires an additional memory of O(U ) and O(nU ), respectively. This solution is implemented in SOMbrero since version 1.2, which considerably reduced the computational time required by the relational version, as shown in [35].

Special features for graphs
As already mentionned, graphs are a special case of relational data: if (x i ) i are the nodes of a given graphs, several types of similarities/dissimilarities can be used to describe ressemblance between those nodes. Widely used examples include shortest path lengths or kernels based on the Laplacian of the graph [36] that include, among others, the commute time kernel, K CT = L + [37] or the heat kernel, K H = e −βL (β > 0), [38].
SOMbrero includes additional functions for this type of data. When the clustered entities are nodes of a graph G, a projection of the graph onto the map can be obtained with the function projectIGraph. This projected graph has a number of nodes equal to the number of (nonempty) units in the map and edges that connect pairs of vertices that contain each at least one node connected by an edge in G. The output of this function is given as an igraph 5 that can provide a simplified representation of the graph as in Figure 3 (this figure provides a simplified representation of the graph lesmis described in Section 2.3 after a superclustering has been applied to the result of the relational SOM algorithm).