A Review on Dimensionality Reduction for Multi-Label Classification

Multi-label classification has gained in importance in the last decade and it is today confronted to the current needs to process massive raw data from heterogeneous sources. Therefore, dimensionality reduction, which aims at reducing the number of features, labels, or both, knows a renewed interest to enhance the scaling properties of the classifiers and their predictive performances. In this paper we review more than fifty papers presenting dimensionality reduction approaches for multi-label classification and we propose an analysis in three steps : (i) a typology of the methods describing the main components of their strategies, the problem they tackle and the way they solve it (ii) a unified formalization of the problems to help to distinguish the similarities and differences between the approaches, and (iii) a meta-analysis of the published experimental results inspired by the consensus theory to identify the most efficient algorithms.


INTRODUCTION
T HE most popular classification paradigms are the single label classification and the multi-class classification. For the first one, the objective is to decide, for each instance described by its features, whether it is associated to a given label or not. The second one is a generalization and it aims at associating each instance to one label among several. However, in many real-world applications (e.g., sound analysis [1], [2], computer vision [3], [4], text analysis [5], [6], biology and health [7], [8], recommender systems [9], [10]), items are intrinsically describable with multiple labels. For instance, in a Video on Demand catalog, a movie is described by a set of complementary labels (e.g., Funny, Masterpiece, Based on novel, Futuristic) which are used by a recommender system to provide users with movies that are relevant to their preferences. Consequently, multi-label classification, which associates each instance to multiple labels, has received a great attention in recent years. From the pioneering works of Boutell and al. [11], Zhang and al. [12] and Tsoumakas and al. [13], several reviews have been published [13], [14], [15], [16], [17], [18]. They group the algorithms in three main families : (i) the problem transformation methods which transform the multi-label problem into one or several single-label classification or regression problems, (ii) the algorithm adaptation methods which adapt existing algorithms to learn from multi-label data and (iii) the ensemble methods which deduce multi-label predictions from a collection of learners.
This effervescence in research has allowed a significant improvement of the result quality for benchmarks routinely used in the literature. But it has also coincided with the explosion of data dimensionality. In particular, today, the expansion of online labeling services generates a production of massive raw data of varying quality. This scaling evolution has recently led to the emergence of the socalled eXtreme Multi-label Learning community which considers problems in which the number of labels is extremely large (in the order of 10 6 and more) [19], [20], [21]. This increasing complexity entails a renewed interest for the dimensionality reduction approaches which aim at reducing the number of features, labels, or both in order to improve the scaling properties of the classifiers and their predictive performances.
Dimensionality reduction has a long history in data science [22], [23] associated to different motivations such as, in particular, data visualization and interpretation [24], data compression [25] and data denoising [26]. In short, applying dimensionality reduction on raw data offers a synthetized representation which allows highlighting links and structures hidden in the mass and guiding learning algorithms [27], [28]. As a promising lever for dealing with large and noisy data, dimensionality reduction in multi-label classification has been the subject of a large number of publications over the last decade, resulting in various developments of methods. However, to the best of our knowledge, only one state-ofthe-art was already published five years ago [29] and it neither explores the wide range of existing approaches nor provides a global framework to compare them.
For the study presented in this paper, we have gathered more than fifty papers to provide a macroscopic view of the dimensionality reduction strategies developed in multi-label classification and to help users select the most efficient ones. Let us note that we do not consider the variable selection methods (see [30] for a recent review) which are efficient to change the relative importance of variables but which are not designed to extract semantic links between variables as pointed out by several authors [31], [32]. Here we go beyond a classical state-of-the-art often based on an organized list of the existing works by structuring our analysis of the literature along three complementary objectives : (1) a typology of the different approaches, (2) a unified formalization of the problems, and (3) a meta-analysis of the published experimental results. The typology is built from the main components which determine the nature of the problem and the way to solve it: (i) the choice of the reduced space (feature space, label space or both), (ii) the independence/dependence between the dimensionality reduction objective and the classification objective, (iii) the characteristics of the transformations which reduce the initial spaces, and (iv) the regularization functions and set of constraints which improve the problem solving process. To help to distinguish the similarities and differences between the approaches with more precision, we introduce two generic formulations which scan the large majority of the problems encountered in the literature. We complete this thorough review of the problem ingredients by a meta-analysis of the experimental comparisons carried out in the papers inspired by the consensus theory [33], [34]. For each selected evaluation measure, the published pairwise comparisons (algorithm A i is better than algorithm A j at a statistical significance level a) are represented by a multigraph where the vertices are the algorithms and the directed edges represent the domination relationships extracted from the published experimental results. The analysis of the multigraphs allows to identify communities which are families of algorithms that have been mostly examined separately in the litterature. Moreover, in each community, the approaches which outperform the others are highlighted.

TYPOLOGY OF MULTI-LABEL DIMENSIONALITY REDUCTION METHODS
Throughout the paper we consider a dataset with N instances described by a set of n x features and labeled by a set of n y labels. We denote by X (resp. Y ) the N Â n x (resp. N Â n y ) matrix describing the features (resp. the labels). As usually done in the literature, X (resp. Y ) also refers to the feature (resp. label) space when there is no ambiguity. The objective of the multi-label classification is to predict the right label vector y 2 R n y for any feature vector x 2 R n x . During a training phase, given the feature matrix X, a classifier is adjusted to fit its prediction to the label matrix Y . The vast majority of multi-label classification approaches based on dimensionality reduction follows a two-step process : (1) reduction of X or Y or both, (2) prediction of the labels from the reduced spaces with a classifier. The dimensionality reduction is very often applied as an independent data pre-processing before prediction, but recent research stimulates exploration of the coupling between reduction and classification [35]. Whatever the strategy, the impact of the reduction on the classifier performances is in fine evaluated by the quality of the label prediction for which numerous measures have been proposed in the literature (e.g., Hamming Loss, F1) [14], [17]. Consequently, three ingredients are considered in the dimensionality reduction problem: the objective function f d for the dimensionality reduction which is independent from or dependent on the classifier, the objective function f c associated to the classifier and the final prediction quality measure m q . Finally, the choice of the reduced space closely determines the nature of the problem and the way to solve it.
Let us denote the reduction of X (resp. Y) by the N Â k x (resp. N Â k y ) matrix X 0 (resp. Y 0 ) where k x (resp. k y ) is the dimension of the reduced space X 0 (resp. Y 0 ). In practice, the values of k x and k y are often fixed a priori (100 and 500 are commonly used values [36], [37]) but different classical strategies can be applied to guide their choice in particular when the reduction method performs an eigendecomposition (e.g., k x and/or k y are the number of eigenvalues above a fixed threshold, or necessary to preserve a percentage of the total sum of eigenvalues). There are three different ways to tackle the dimensionality reduction problem for the multi-label classification ( Fig. 1): (i) reduce the feature space X into X 0 and predict the label matrix Y from the reduced feature matrix X 0 , (ii) reduce the label space Y into Y 0 and predict the reduced label matrix Y 0 from the feature matrix X, (iii) reduce both the label and the feature spaces into X 0 and Y 0 and predict the reduced label matrix Y 0 from the reduced feature matrix X 0 .
For each of these cases, the dimensionality reduction problem can be set as an optimization problem: where : U and V are either the parameters of a transformation function on X and Y (e.g., projection matrices P x and P y in the case of a linear transformation) or the reduced matrices X 0 and Y 0 . When a method reduces one space only (either X or Y ) the problem is defined with one parameter (U or V ). f d is the reduction objective function which is independent from or dependent on the classifier. r is a regularization function often associated to a norm (L 1 , L 2 , L 12 ) on the parameter space which is introduced to limit the overfitting phenomena and to simplify the model. c is a constraint set on the search space. Some approaches do not introduce constraints but most of them try to reduce the degree of freedom of the problem to make its resolution easier. In the following we detail the different ingredients of the problem (1). We first present the most popular objective functions f d which are independent of the classifiers and we specify their definitions according to the spaces targeted with the dimensionality reduction. Then, we discuss the different cases where the dimensionality reduction objective is coupled with the classification objective. For each case, only one example from the literature is given for illustration and we refer to Table 2 for a detailed state-of-the-art. In addition, Tables 1 synthetize the strategy of each of the reviewed methods. We finish with a synthetic presentation of the regularization functions and the additional constraints applied by multi-label dimensionality reduction methods.

Classifier-Independent Objective Functions
We here present the dimensionality reduction methods with an objective function independent of the classifier. They are grouped according to the space they reduce (X, Y , both X and Y ).

Feature Space Reduction (X)
The feature space reduction methods turn the initial large feature space X into a reduced space X 0 with the goal of extracting the essential information of the data. As features are partially noisy, redundant and/or irrelevant, some works also aspire to fix the original defects [38]. The objective function f d is either independent from or dependent on the information carried by the labels.
Most of the label-independent methods have been initially developed for other learning paradigms but quite a few of them have also been frequently applied in multi-label learning. Their objectives can be organized into three families depending on the considered information for the reduction 1 : 1) Objective FI1: maximize the conservation of the feature covariance/co-occurrences (e.g., Principal Component Analysis (PCA) [39]); 2) Objective FI2: minimize the reconstruction error formulated by a distance between X and X 0 (e.g., Autoencoders (AE) [40]); 3) Objective FI3: maximize the conservation of distances between items described by X and by X 0 (e.g., Locality Preserving Projection (LPP) [41]). The conservation is either global if all pairwise distances are equally maintained or local if, for example, each item only preserves its distances with its nearest neighbors. Let us remark that these objectives may be closely linked together; for instance, PCA, classified in FI1, also implicitly minimizes the quadratic reconstruction error between a projection of X 0 and X (FI2). In addition, besides these approaches, random projections (Objective R) have been explored [42], [43].
The label-dependent objectives aim at guiding the reduction with label information [44], [45]. This helps to strengthen the link between the extracted reduced feature space X 0 and the label space Y . They cover three main strategies: 1) Objective FD1: maximize the X-Y link via a standard criterion (covariance, Hilbert-Schmidt Independence) (e.g., Multi-label Dimensionality Reduction via Dependence Maximization (MDDM) [35]) ; 2) Objective FD2: preserve the isometry between the instances described in the initial label space Y and the instances described in the reduced feature space X 0 (e.g., Hypergraph Spectral Learning (HSL) [46]) 3) Objective FD3: maximize the link between the feature and the label space by learning a subspace X 0 that can be used to reconstruct both X and Y (e.g., Multilabel Latent Semantic Indexing (MLSI) [47]). In addition, several hybrid approaches optimize a parameterized trade-off (e.g., u 1 objective FD1 + u 2 objective FD2) between the above objectives (e.g., Maximizing feature Variance and feature-label Dependence simultaneously (MVMD) [48]).

Label Space Reduction (Y)
As some labels are correlated, it seems intuitive to take these correlations into account to improve both quality and scalability of the classification [17]. This can be achieved by learning a dimensionality-reduced label space. One of the first label space reduction for multi-label classification was based on compressed sensing (CS) [25]. The transformation made by CS is a random projection without training (Objective R). However in the prediction phase, CS solves an optimization problem for each instance to reconstruct the label vector from a reduced one. Since then, various strategies have been proposed. Dually to the feature space reduction above, they are either independent from or dependent on the information carried by the features. And, the feature-independent objectives can be organized into three families similar to the label-independent feature space reduction: 1) Objective LI1: maximize the conservation of the label covariance (e.g., Principal Label Space Transformation (PLST) [49] which is the equivalent of PCA applied to the label space); 2) Objective LI2: minimize the reconstruction error formulated by a distance between Y and Y 0 (e.g., Multilabel prediction via compressed sensing [25]). 3) Objective LI3: maximize the conservation of distances between items described by Y and by Y 0 (e.g., 1. Each objective is encoded to be identified in Table 2. For instance, FI1 refers to the 1st objective of the label-Independent Feature reduction methods. Orthonormal Neighborhood Preserving Orthonormal feature space projection which preserves each item location with respect to Projection (ONPP) its l nearest neighbors.

9
Shared subspace for multi-Label Embedding resulting from a trade-off between the label space and the feature space (ML LS ) reconstructions.
10 Partial Least Square Construction of a dimensionality reducing projection that maximizes the correlations (PLS) between the projected feature space and the label space.
11 Multi-label Latent Semantic Indexing Linear feature space projection to optimize both the reconstruction of the original feature (MLSI) space and the correlations between the projected feature space and the label space.
12 Orthonormal Partial Least Square Extension of PLS with an orthonormality constraint on the reduced feature space. (OPLS) 13 Hypergraph Spectral Learning Spectral decomposition of an hypergraph which links instances with many common (HSL) labels to obtain a reduced feature space that favors locality between them.
14 Joint Dimensionality Reduction and Multi-Simultaneous learning of a feature space reducing projection and an SVM classifier label Classification (MLSVM) applied on the obtained reduced space.    39 Supervised Semantic Indexing Reduction of the feature space and the label space to increase (resp. decrease) the (SSI) similarity between relevant (resp. irrelevant) pair x 0 -y 0 of feature and label vectors. 45 Supervised Dual Space Reduction Family of methods that apply an existing dependent feature space reduction method (2SDSR) on X (e.g., MDDM) and an existing dependent label space reduction method on Y .
46 Convex Co Embedding Projection of the label and feature spaces which optimize a similarity, for each instance, (ILA) between the reduced feature vector and reduced label vector.
47 Low rank Empirical risk minimization Simultaneous training of a classifier and a linear feature space reduction with the low for Multi-Label Learning (LEML) rank Empirical Risk Minimization problem. 48 Bi-Directionnal Representation Learning Simultaneous predictions of labels from features and features from labels with an (Bi-Dir) intermediary dimensionality reduction based on a bi-directional neural network.

Bayesian Multi-label Learning via EM-based construction of a subset of reduced labels (called topics) that can (i) recon-Positive Labels (BMLPL)
-struct all the labels (Poisson law) and (ii) be predicted from features (Gamma law).

Sparse Local Embedding for Extreme
Clustering of the instances and construction of local embeddings, on each cluster, of Classification (SLEEC) the feature space to obtain the same closest neighborhood as in the label space.  We report "yes" for scalability if the complexity is strictly under quadratic. The objective codes, which refer to the type of objective considered by the methods, are detailed in Section 2.
There are also feature-dependent objectives. Indeed, when considering dimensionality reduction for classification, reducing the labels while strengthening the links with the features can be useful. Conditional Principal Label Space Transformation (CLPST) [51] is one of the first methods to reduce the labels with an objective dependent on the features and it has opened the way to many other feature-dependent label space reduction approaches. They maximize the correlations between X and Y 0 to improve the predictability of one matrix from the other (Objective LD).
In addition, several hybrid approaches solve a parametrized trade-off between minimizing the reconstruction error between Y and Y 0 and maximizing the prediction of the feature matrix X from the reduced label matrix Y 0 (e.g., Dependence Maximization based Label space dimensionality Reduction (DMLR) [52]).
Note that when the label space is reduced and the classification model is trained on Y 0 , the latter predicts reduced label vectors y 0 and it is necessary to reconstruct the original label vectors y from it. Three cases are commonly encountered in practice (the main associated methods are indicated in parentheses): 1) A reconstruction model C inv : y 0 7 ! y is trained during the reduction phase or after [53]. It allows to reconstruct y from y 0 in the test phase. When the reduction is based on an orthogonal projection P y , the reconstruction is often computed by the transpose projection (y 0 7 ! y ¼ P T y y 0 ). (PLST, MOPLMS, ML-CSSP, CPLST, BML-CS, FaIE, Rembrandt, DMLR, TRANS, LEML, Bi-Dir, BMLPL, GIMC, C2AE, WSABIE, COMB) 2) If the dimensionality reduction method explicitely provides a reduction function C : y 7 ! y 0 , then, given a reduced label vector y 0 , the original label vector y can be recovered by solving the following structured output learning [54] problem: min y lðy 0 ; CðyÞÞ; where l is a loss function. The optimization is often performed with matching pursuit [55] or basis pursuit [56]. (CS, MLC-BMaD, MSE) 3) The nearest neighbors of y 0 are computed in the reduced training set and y is deduced from the aggregation of their original label vectors [37]. (CLEMS, SSI, SLEEC)

Both Feature Space and Label Space Reduction
When both spaces are reduced, the reduction of each space depends on the other and two main strategies have been investigated: Objective LFD1: seek the principal directions in both label space and feature space which maximise the linear correlations with each other. Originally developed with the popular method CCA (Canonical Correlation Analysis) [23] this approach has led to dozens of extensions in multi-label classification (e.g., an extension with a least square resolution LS-CCA [29], an extension with a sketching technique [57], an output-code extension [58]). Moreover, some methods have extended CCA by combining it with other approaches (e.g., The Two-Stage Dual Space Reduction Framework (2SDSR) [59]). Objective LFD2: minimize a distance function between X 0 and Y 0 (e.g., Supervised Semantic Indexing SSI [60]). In addition, note that there is a special case (Independent Dual Space Reduction (IDSR) [61], [62]) where the label and feature space reductions are independently operated (objective LFI): a label-independent feature space reduction is applied on X and a feature-independent label space reduction is applied on Y .

Coupling Dimensionality Reduction with the Classifier Objective
As previously pointed out, a large majority of the reduction approaches are applied as a data pre-processing independent of the classification stage. But this procedure can turn out to be lacking in flexibility in some cases: its performances may be high for some problems and degrade some others. Indeed, it has been observed on many benchmarks that the impact of a reduction method on the classification performances varies with the classifier and the datasets [48]. To overcome this limitation, some works have started investigating the coupling between dimensionality reduction and classification. At first glance, this approach consists in setting the coupling as a multi-objective optimization problem which tries to optimize both the reduction and the classifier objectives (resp. f d and f c ) simultaneously. This multi-objective/ multi-parameter problem is difficult to solve [63], [64], [65] and, in practice, f d and f c are alternatively or jointly maximized via a linear combination (Objective C1 -e.g., Simultaneous Large-margin and Subspace Learning Approach (TRANS) [66]). But, when we get down to the details, the coupling can also be set up in two other scenarios: 1) Objective C2: the dimensionality reduction is integrated within the classification model by replacing X and Y by X 0 and Y 0 in f c and the objective is consequently the maximization of f c (e.g., Linear Dimensionality Reduction for Multi-label Classification (MLSVM) [67]). 2) Objective C3: the dimensionality reduction objective f d is implicitly designed to optimize the classifier. This happens when the classifier is k-NN. For instance, Supervised Orthonormal Locality Preserving Projection (SOLPP) [68] learns a projection P x on the feature space X that reduces the distance between instances which share numerous labels. This implicitly optimizes k-NN. Similar strategies are employed in other methods (e.g., Hypergraph Spectral Learning (HSL) [46]).

Explicit and Implicit Transformations
When the algorithm reduces the data via a transformation function, the reduction is explicit and it allows to compute the transformation of any instance on line. Otherwise, the transformation is implicit: it directly provides the reduced matrix but not the transformation function.

Explicit Transformations
The vast majority of the methods presented in that review reduce dimensionality with projections (X 0 ¼ XP x or Y 0 ¼ YP y ). They are consequently explicit and linear. These linear transformations can be extended to a non linear transformation with the classical kernel trick and most of the linear methods have a kernel extension (e.g., kPCA [102] for PCA, kCCA [103] for CCA). Additional non linear explicit approaches have been adapted for the multi-label case. They can be classified into three categories: 1) Locally Linear Embeddings [37], [104]: they produce a non linear transformation, deduced from a piecewise linear transformation, by partitioning the label and/or feature space and computing a specific linear transformation per region. 2) Representation learning with neural networks. The target output depends on the network architecture.
For the auto-encoders [40] the output is a reconstruction of the input layer. For the multi-label neural networks [105], [106], [107] the output is a prediction of Y (resp. X) and the input is X (resp. Y). More complex architectures, which combine auto-encoders and multi-layer perceptrons, have been recently investigated [53], [101]. For details, we refer to the complete review [27] on representation learning which includes several methods which have been adapted to multilabel classification [108]. 3) Probabilistic process [89], [97], [109]. The transformation from the initial space (X or Y ) to the reduced space (X 0 or Y 0 ) is a combination of parameterized probability laws (often Normal, Dirichlet and Gamma distributions). In that case, the construction of the reduced space is achieved by inference.

Implicit Transformations
The implicit transformations directly provide the reduced space without explicitely computing the transformation operator (e.g., using Multi-Dimensional Scaling (MDS) [110] or Matrix Factorization [111]). They consequently have no reason to be linear. Direct learning of the reduced spaces X 0 or Y 0 offers more degree of freedom in the optimization problem but it is more frequently confronted to overfitting [112]. Moreover, it is not adapted to incremental processes: when a new item is added, the reduction must be fully relaunched. Nevertheless, the recent rise of extreme multi-label classification [19], [113] stimulates the development of implicit transformations [37], [50], [90], [93], [96]. They are adapted to the label space reduction because the online reduction of a label vector is not required. On the contrary, explicit transformations are more suitable for feature space reduction because the transformation of new feature vectors is necessary in the prediction phase.

Regularization and Constraints
Adding a regularization function r to the objective function f d or a set of constraints to the optimization problem (1) aims at (i) reducing the degree of freedom of the problem, (ii) providing simpler transformations of the initial spaces into the reduced ones by restricting the parameters, (iii) improving generalization and limiting overfitting and (iv) building more classification-friendly training sets. These objectives, which are common to many machine learning problems, are integrated in the optimization problem in a variety of ways: Sparse transformations. Some methods impose sparsity on the reduced space variables or on the reduction function parameters [37], [96]. Formally, sparsity of a matrix is computed from its L 0 -norm, but due to its non-continuity and non differentiality the authors usually resort to the L 1 -trick and relax the L 0 -norm into a L 1 -norm [114]. In practice, this approach limits overfitting, optimizes storage and speeds training and prediction up ; Limited search space. A major part of the algorithms impose the minimization of the L 2 -norm of the parameters. This benefits solutions with low-value parameters [29], [53]. Sparse and small parameter sets. This is achieved with an Elastic Net Regularization [115] which is a linear combination of L 1 and L 2 regularizations [37]. Parameter clipping. This regularization restrains the parameter definition domain to a fixed interval with thresholding techniques [116]. Dropout regularization. Some neural network based approaches regularize their parameters by using the dropout strategy [117] which selects a different random parameters subset at each training step. Moreover, constraints are also introduced to limit noise and variable correlations which are enemies of most classifiers [118]. Two usual constraints aim at facilitating the classification task: Uncorrelated space. Classification is easier when the correlations in the variable space are limited. Such a constraint can be express in the matrix form X 0T X 0 ¼ I (or Y 0T Y 0 ¼ I). Let us remark that this constraint leads to a L 2 -norm regularization (jX 0 j 2 ¼ trðX 0T XÞ ¼ trðIÞ). Orthonormal projection. This constraint is expressed in the popular linear case by P T P ¼ I. Some authors have also proposed a trade-off P T ðð1 À mÞ X T X þ mIÞP ¼ I between these two constraints [35], [48].

TWO GENERIC PROBLEM FORMULATIONS
The previous section highlights the great variability of the different ways to adress the issue of dimensionality reduction for multi-label classification. In the literature analysis, where each author resorts to his/her own formulation, this variability is an obstacle to a fine understanding of the similarities and differences between the approaches. To make the comparisons easier, we here propose two generic formulations of the general problem (1). The first one, closely linked to a generic scheme of resolution based on eigendecomposition, allows to express more than half of the problems. The second one is an extension which covers all cases. It is associated to a large variety of optimization processes (e.g., gradient descents, Newton method, Lagrangian techniques).

The Basic Framework
As we show in Table 3, a large number of problems can be written as follows : where : 1) A XY and B XY are matrices which are function of X and Y . 2) according to the space that is reduced and the type of transformation, the parameter U is one of the following matrices : X 0 , Y 0 , P x , or P y . 3) the optimization goal is either a minimization or a maximization objective.
Problems expressed by (3) can be solved with an eigendecomposition. It is well-known that, using the Lagrangian method [120], the problem (3) is equivalent to optimize in the following generalized eigenvalue problem: The solution U of (3) in the maximization (resp. minimization) case is therefore the matrix of the eigenvectors associated to the k largest (resp. smallest) eigenvalues of (4). In the frequent case where the matrix A XY is symmetric positive, the eigenvectors of A XY can also be retrieved by a singular value decomposition [121] of the square root R XY of A XY defined by R T XY R XY ¼ A XY . Despite its elegant solution, the eigenvalue decomposition (4) is computationally complex: in the order of n 2 realvalued numbers for spatial complexity and n 3 operations for Notations: M y : pseudo-inverse of a matrix M L (resp. L n ): graph (resp. normalized) Laplacian f: kernel transformation W , S C , S M ; S b , S w , S: pairwise weight matrices u, a , b: trade-off parameters temporal complexity [122]. For scaling, different approaches are used: fast eigendecomposition techniques (e.g., Jacobi [123] and QR [124]), approximation of the largest eigenvalues (power iteration algorithm [125], Lanczos method [126]), matrix sketching [127] (e.g., in randomized PCA [71] or Rembrandt [91]). In addition, a reformulation of problem (3) in a least square form is also popular to resort to a numerical optimization method (e.g., least square version of CCA [92] or LDA [128]). Indeed, the initial least square form min U min M kR XY À MU T k 2 F , where R XY is the square root of A XY , is equivalent to max U trðU T A XY UÞ with the constraint U T U ¼ I.
For illustration let us consider the classical formulation of PCA max Px trðP T x ðX T XÞP x Þ. Subject to P T x P x ¼ I, it can be reformulated into the mean squared reconstruction error minimization problem min X 0 ;P x kX À X 0 P T x k 2 F with simple algebra. The strong constraint U T U ¼ I is sometimes replaced with a simpler L 2 -regularization on U.
Let us note that a portion of the methods expressed with the basic framework (3) are based on graph spectral decompositions [129], [130]. They follow a two-step procedure: (i) build a graph which links the instances with a proximity property (e.g., distance on the label space) and (ii) embed the instances in a reduced space by preserving the graph neighborhood structure. The transformation is computed by an eigendecomposition of the normalized Laplacian of the graph (A XY is the normalized Laplacian and B XY the identity matrix).

Towards a General Framework
The equivalence between the basic framework (3) and a least square formulation highlights both its flexibility and its limits. L 1 -regularizations [131], multi-label loss functions other than mean square error [15] and many other items cannot be expressed as a matrix trace. An attempt at generalization has been proposed in [96]. The problem is set as a problem of minimization of the empiral risk (ERM) [132] which does not require a specific loss function nor any specified regularization. Let us denote by hðx; ZÞ : x 7 ! b y the classification model of parameter Z, by lðy; b yÞ ¼ lðy; hðx; ZÞÞ the loss function between the predicted label vector b y and the true label vector y, and by rðZÞ the parameter regularization. The low rank empirical risk minimization problem is expressed as follow: X ny j¼1 lðY ij ; h j ðx i ; ZÞÞ þ rðZÞ subject to rankðZÞ k: Let us remark that this formulation differs from the classical ERM problem: the added rank constraint on Z 2 R n x Ân y entails a dimensionality reduction [133].
The formulation (5) covers a large part of the methods of the literature but to include the remaining uncovered cases, we propose a generic formulation of the objective function which is an additive combination of the essential ingredients encountered in the multi-label dimensionality reduction typology: JðX 0 ; Y 0 ; Z x ; Z y ; Z xy Þ ¼ a x e x ðX 0 ; X; Z x Þ þ a y e y ðY 0 ; Y; Z y Þ þ a xy e xy ðX 0 ; X; Y; Y 0 ; Z xy Þ þ a p pðX 0 ; Y 0 Þ þ a r rðX 0 ; Y 0 ; Z x ; Z y ; Z xy Þ; (6) where: e x is a reconstruction error between X and its reduced version X 0 . e y is a reconstruction error between Y and its reduced version Y 0 . e xy is a joint error between X, Y , X 0 , and Y 0 which can, for instance, express the classification error. r is a parameter regularization. Z x ; Z y ; Z xy are the parameters of the reduction and the classification functions. p are additional properties imposed on both reduced spaces. The reconstruction error e x can be expressed with both the encoding loss l x1 (reconstruction of X 0 from X) and the decoding loss l x2 (reconstruction of X from X 0 ): where the f functions are parametric models. This is also valid for e y and e xy . In most cases, the regularization r can be additively decomposed: a r rðX 0 ; Y 0 ; Z x ; Z y ; Z xy Þ ¼ a r 1 r 1 ðX 0 Þ þ a r 2 r 2 ðY 0 Þ þ a r 3 r 3 ðZ x Þ þ a r 4 r 4 ðZ y Þ Y þ a r 5 r 5 ðZ xy Þ: In (6), (7) and (8), the a constants are weights that allow trade-offs between the different components of the problem.
All forms of (6) are tackled with customized numerical optimization methods [134], [135]. Considering the convexity, the smoothness, the order, the differentiability and the conditioning of the formulation, the problem is sometimes reformulated (convex relaxation [136], primal/dual conversion [137], preconditioning [138]) and the resolution is either performed with an adapted variant of the gradient descent [139], [140], a coordinate descent [141] or higher order algorithms such as Newton method [142] or Frank Wolfe's algorithm [143]. Also, constrained problems are generally solved with a Lagrangian method [144], with one of its diverse extensions (e.g., Augmented Lagrangian like ADMM [37], [145]) or with a projected gradient descent. The choice of the couple formulation/resolution is essential: it affects the spatial and temporal complexities of the computations and the quality of the convergence towards the solution.
To be complete, let us point out that two families of dimensionality reduction methods explored for multi-label classification reach the limits of the generic formulation. The first one includes approaches based on mixture models [97] and solved with the suitable state-of-the-art EM algorithm and its variants [146], [147]. The second one includes the ensemble strategies (bagging [80], [88] and boosting strategy [72]) where multiple dimensionality reducing transformations are trained on bootstraps and aggregated according to two main strategies. Each transformation produces its own reduced space and either the reduced spaces are aggregated into a global reduced space and the classifier is trained on it or a classifier is trained on each reduced space and the predictions of each classifier are aggregated.

META-ANALYSIS
Our previous generic frameworks allow to explicitly identify the different ingredients involved in the various approaches proposed in the literature and to help to understand their common points and differences. However in practice, a question persists: which are the most efficient approaches? It is difficult to answer because only partial comparisons are generally reported in the articles and to the best of our knowledge there exist no experimental studies which compare all the approaches presented in Table 2. Moreover, the computational implementations are very diverse and for some approaches the source codes or the parameter settings are even not available. Hence, a normalized comparison would entail a recoding of all the algorithms and a new battery of tests on a unified framework that remains to be defined in the research community. With the large set of algorithms to be considered, this would require a considerable time-consuming effort, and consequently the exploitation of the outcomes of the existing published research works appears as a more realistic alternative. The experimental protocols (datasets, classifiers, performance measures, etc...) varying from one publication to another, we here propose a new meta-analysis methodology.
Often defined as "the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings" [148], the meta-analysis has known an increasing development from its pioneering works in the 30's [149], [150]. One of its favored field is medicine where aggregation of the available pieces of information is required to make as rational as possible decisions. In computer science, this approach is still unusual as the great majority of researchers prefer to compare their own approaches with a restricted subset of existing ones on a set of benchmarks they habitually use but the first attempts (e.g., [151], [152], [153]) seem promising.
In this paper we aim at identifying the dimensionality reduction approaches used in multi-label classification for which several pieces of evidence show their domination over others: these approaches statistically obtain the best performances in the results published in the international conferences and journals with a review process. As it is well-known in multi-label classification that the performances can be evaluated with a wide range of measures we extract the relevant methods for each of the most frequently used and independent quality measures. In the following we first present a descriptive analysis of the observed occurrences and co-occurrences of the algorithms and of the measures, then we detail the process to extract the dominant approaches, and finally we discuss the obtained results.

Methodology
From all the papers referenced in Table 2, we have retained a corpus C of 27 papers marked in bold type-which present relevant and exploitable results for a meta-analysis. More precisely, we have first extracted the 32 papers that compare at least two methods, and then we have removed the 5 papers whose results are given on graphics only since they are difficult to exploit.

The Considered Algorithm Set
Let us denote by A the set of the 42 algorithms that appear in the selected papers of the corpus C. The published pairwise comparisons can be described by a multigraph G c : the vertices represent the algorithms of A and an edge is added between two algorithms when they are compared in a paper. In the graph layout (Fig. 2), the vertex diameter is proportional to the frequency of the associated algorithm and the edge set between a vertex pair is represented by a single edge whose width is proportional to the set cardinality.
The obtained layout is very different from the one of a complete graph which would be the ideal model, but far from reality, where each algorithm is compared to all others in many experiments. However, it highlights two communities that correspond to two families of algorithms which have been mostly studied separately. This confirms that the published comparisons have been done on subsets of algorithms which share common properties. The first community C 1 regroups approaches that reduce the feature space dimension and the second one C 2 regroups label space and co-label and feature space reduction algorithms including those developed in the context of extreme multi-label learning. Moreover, two vertices (CCA and MDDM) appear at the intersection of C 1 and C 2 : they have been considered as baselines for a long time and CCA, which reduces both feature and label spaces, naturally belongs to both communities. Three algorithms (BiDir, MLSVM, MSE) are linked to CCA or MDDM only and consequently, in addition to C 1 and C 2 , we consider an "in-between subset" C 1À2 which includes those five algorithms. Edges between C 1 , C 2 and C 1À2 are mainly originated from the reference [62] which is a recent comparison of different multi-label classification approaches. The vertex diameters allow highlighting the most frequently occurring methods which are often mentioned among the pioneers in their community: CCA, MDDM, LEML, PCA, CS, PLST and CPLST. In the following we aim at identifying the significant relationships from the multigraph G c . Table 4 shows the occurrences and co-occurrences of the different measures used in the articles of the corpus C. It underlines the great variability of the considered criteria, and the frequency distribution allows to distinguish the most popular measures: Hamming Loss (44 percent), AUC (33 percent), F1 (26 percent), Macro-F1 (26 percent), and Micro-F1 (26 percent). In addition to these observations, our selection of the suitable measures for the meta-analysis is guided by a recent comparison [154] which has experimentally proved that some measures are highly correlated whereas some others are independent. More precisely, their authors have tested a set of 16 measures (those present in Table 4 plus some variants) and have compared them with the Pearson and Spearman correlations on 100 000 simulations. Results show that Hamming Loss, Coverage and Ranking Loss are independent, but here only Hamming Loss is taken into account because the frequency of the two others is very low on C. Results also detect a strong correlation between the measures of a large set M ¼ f Subset Accuracy or 01Loss, Accuracy, Precision, Recall, F1, One Error, Average Precision, Micro Precision, Macro Precision, Micro F1, Macro F1, Micro Recall, Macro Recall g. Consequently, when several measures of M are used for a comparison of two algorithms in a same paper, we only retain the most frequent one. Let us precise that AUC and P@3 have not been considered in [154]. But they have been here added to M as the computation on our data of their Pearson correlation coefficients with the other measures of M confirms the correlation: its value is ranging between 0.829 (with Macro-F1) and 0.576 (with One-error) for AUC and is close to 1 (with One-error) for P@3. The two studies with Hamming Loss and the subset of selected measures from M (respectively referred to in the following as H and M) are conducted separately on the article subsets of C which take them into account (12 articles for H and 24 for M).

The Consensus Based Approach
Our meta-analysis inspired by the consensus theory [33], [34] is decomposed into two successive steps: (i) filtering the statistically significant domination relationships for the measures H and M, and (ii) extracting the dominant algorithms for each measure. Finally we identify the algorithms which statistically dominate the others in the two cases. With a similar process we complete the analysis by distinguishing the algorithms which are dominated.
More precisely, the significant domination relationships are extracted with a Friedman and post-hoc Nemeyi tests [155] with a standard confidence level a ¼ 0:05. And, for each measure, we build a directed domination multigraphdenoted respectively by G D ðHÞ and G D ðMÞfrom G c by retaining the significant edges and by orienting them according to the direction of the domination: a directed edge from A i to A j means that the algorithm A i significantly outperforms the algorithm A j in a paper of the corpus C. The first stage of a topological sorting [156] on each multigraph allows identifying the subsets DðHÞ and DðMÞ of A which contains the dominant algorithms: A i is dominant when its indegree is null and its outdegree is strictly positive. Similarly, the dominated algorithms are those with a null outdegree and a strictly positive indegree.

Results
A multigraph overview reveals communities of algorithms with similar behaviors. A detailed analysis of these communities helps identifying the dominant algorithms.

Algorithm Community Detection
The two directed multigraphs G D ðHÞ and G D ðMÞ are represented in Fig. 3. They both have a lot less relationships than the co-occurrence graph G c . With greater alpha thresholds additional directed edges appear but the confidence that can be placed in them is weaker. As a consequence, for the standard threshold a ¼ 0:05, the directed multigraphs become digraphs with at most one directed edge between each vertex pairs and some algorithms of A with a null degree are no more represented. Indeed in some articles the number of experiments is too low to detect a domination which is statistically significant. Table 5 indicates, for each article of C and regardless of the considered quality measure, the ratio CD/r max between the critical difference CD of the post-hoc Neymeni test for a ¼ 0:05 and the theoretical maximal ranking difference r max between the compared algorithms. If q algorithms are compared, then r max ¼ q À 1. The higher this ratio is, the fewer the expected significant relationships are, and when it is greater than 1 none of them can be extracted. In the multigraphs G D ðHÞ and G D ðMÞ, the communities identified in Section 4.1.1 are associated to different connected components: algorithms from different communities have been infrequently compared and are not linked by a significant relationship. Hence, we present the dominant methods for each community. Results are summarized in Table 6. Due to the bibliographic effect which favors the presence of the best approaches at each period, the most recent (resp. oldest) approaches are more likely to be dominant (resp. dominated) but there are noticeable exceptions such as MDDM and MLLS.

In-Depth Comparison
The three methods (MVMD, SSMDDM and MDDM) that dominate for both measures belong to community C 1 (labeldependent feature space reduction methods) or to C 1À2 and they have close strategies. Let us recall that MDDM minimizes the Hilbert Schmidt Independence Criteria between the reduced feature space and the label space and that MVMD and SSMDDM are hybrid methods whose objective is a trade-off between the objective of MDDM and that of another method (see Table 1). MVDM and SSMDDM are recent approaches which have been extensively compared to others but in a single paper whereas MDDM, which is older, resists to a larger number of comparisons.
However, to the best of our knowledge, these three methods have not been directly compared to one another. Consequently, in an attempt to better understand their behaviors, we have compared them within the same framework. More precisely, the algorithms have been re-implemented (Python language) and tested in the same computational environment (standard computer with 16 Gb of RAM). We used ten multilabel datasets often selected in previous multi-label learning studies. They are divided into train/test sets in the Python library scikit-multilearn. 2 The feature (resp. label) dimensionality varies from 72 to 1836 (resp. from 6 to 983). The reduction methods are combined with the ML-k NN classifier and the parameter settings are extracted from the publications. For each method, ten dimensions k x have been tested (from 10 to 100 percent of the feature dimensionality n x ) and the best results with the F1-score in the measure set M and with the Hamming Loss (H) are presented in Fig. 4. With the average rank criterion, SSMDMM outperforms MVDM and MDDM for both measures and MVDM is better (resp. worse) than MDDM for M (resp. H). However, when getting into details, we observe that the results may depend on the datasets: e.g., SSMDMM obtains poor results on the Birds' dataset. This confirms the interest of the meta-analysis which aggregates results gathered on different experiments involving a variety of datasets. Even if a method outperforms the others on average on a limited number of datasets, it is also worth considering those that do not dominate it in a meta-analysis.
The community C 2 is very small in G D ðHÞ: the authors of the algorithms of C 2 are mostly interested in data with a large number of labels and give more importance to ranking measures than to global classification errors such as Hamming Loss. Consequently, there are no methods of C 2 simultaneously dominant for the two measures. For the M measure, SLEEC and REML dominate in several papers. These methods have been especially designed for extreme multi-label learning, and contrary to the others which build low-rank data representations that can miss the information brought by the long tail label distribution, they compute high-rank representation which capture more useful  The occurrence of each measure is on the diagonal.
2. http://scikit.ml/ information. Due to their efficiency, they have gained popularity in recent years. In addition to these most visible results, it is interesting to identify the approaches that are dominant for one measure and dominated for the other. LS-CCA and HCCA (resp. LDA) are dominant in G D ðHÞ (resp. G D ðMÞ) and dominated in G D ðMÞ (resp. G D ðHÞ). These methods are not intrinsically expected to be more efficient for Hamming Loss than for the M measures, and due to the absence of correlation between M and H, it is not surprising to find different behaviors. This result confirms the interest of this double analysis.
The dominated methods for the two measures are HSL, DMLDA, MLSI, and CCA. They all belong to C 1 or C 1À2 due to the lack of Hamming Loss measurements in C 2 . They illustrate the bibliographic effect: they are early methods that have been dominated by more recent proposals. More precise interpretations should be made very carefully. HSL performs poorly in the available experiments but it has been considered only once in the corpus C. Moreover, CCA might have been implemented in the majority of papers with a version that is not the optimal one. In the experiments, the feature, label, and feature/label covariance matrices are often badly conditioned due to the multi-label dataset characteristics and it is known that the integration of a slight regularization with a Penrose generalized inversion leads to much better performances [92].
For the set of algorithms from A which do not belong to these two extremal classes, two cases must be considered. For the algorithms which are not in the G D graphs, our meta-analysis cannot conclude anything more than the lack of significance of the comparisons which take them into account. For the others, which have both non-null indegree and non-null outdegree, a reasonable recommendation is to replace them with one of the dominant algorithms of their community defined for a similar task. This review is written in a context where multi-label classification is getting a growing attention and meets today's needs to process high dimensional data. To tackle the complexity of the problems, a large number of multi-label dimensionality reduction methods have been published in the last decades. These publications greatly enrich the literature but it remains difficult to link them, to pick the right one for the problem at hand and to determine the work that remains to be done in the topic. Our review attempts to provide these elements.
More precisely, we have proposed an overview of the methods through a unifying typology completed by a generalized formulation of the problems and a meta-analysis of the experimental results which can be used as a guideline for algorithm selection and which gives insights for future research.
Overview of the Methods. A construction of a typology has been required to disentangle the links between the various methods. It is based on three major criteria. The first one is the space that the methods reduce (feature space, label space or both). Feature space reduction approaches are prevalent for now but with the increasing interest in extreme multilabel learning, methods that also reduce the label space dimensionality are quickly catching up. The second one distinguish the methods which reduce one space by taking into  1.106 [85] 0.800 [100] 0.828 [73] 0.871 [86] 1.172 [87] 0.778 [76] 0.591 [90] 0.750 [84] 0.762 [78] 0.623 [59] 0.672 Repeated references are associated to multiple sets of experimental comparisons. Dominant methods for the two measures appear in bold. Different significant thresholds (a ¼ 0:01 to 0.1) have been tested and, except for REML for a ¼ 0:01, the highlighted algorithms remain at the top. account the information carried by the other from those which perform the reduction independently. Unlike few years ago, dependent methods predominate today: by preserving the link between the attributes and the labels, they are more efficient for the classification task. The third criterion is the presence/absence of coupling between the classifier and the dimensionality reduction strategy. Today, the two scenarios are very imbalanced and the large majority of the approaches is not coupled with the classifier. In addition to these major structuring aspects, the methods differ in two additional components: the type of transformation (implicit, explicit) that they perform and the constraints that they impose on the problem resolution. Although these differences do not distinguish the approaches on their very nature, they can heavily impact their efficiency and deployability in real-world applications.
Despite all the variability highlighted by the typology, strong similarities between the problems are observable and we have introduced two generic formulations to identify them. The first one scan the problems that rely on loss functions and constraints which can be formulated under a matrix trace minimization and solved by an eigendecomposition. They represent more than half the publications. A more general formulation covers almost the integrality of the publications by integrating the whole set of the implemented ingredients. The combinatorics of these ingredients gives a glimpse of the variety of the approaches. Although a wide spectrum of formulations has been explored, most of the methods focus on ingredients that have interesting properties in terms of numerical optimization (e.g., differentiable, convex, smooth). Then, the chosen resolution method is essential because it affects the scalability (i.e., temporal and spacial complexities) of the reduction algorithm and therefore its potential applications.
Experimental Comparisons of the Methods: In addition to the theoretical specificities of the approaches, the experimental results remain a major selection criterium. A meta-analysis has been conducted to identify the most significant algorithm performances from the numerical comparisons inventoried in the publications. The results depend on both the used quality measure and the main information guiding the dimensionality reduction (feature space versus label space or co-label and feature space). Three methods MVMD, SSMDDM and MDDM based on feature space reduction dominate for the two uncorrelated retained measures (Hamming Loss and a selected measure among a large set of correlated ones including Micro F1, Macro F1 and AUC). For the latter, results also highlights SLEEC and REML which are recent approaches especially designed for extreme multilabel learning. A dual examination of domination relationships completes the analysis by pointing out the methods dominated for the two measures. However, from a methodological point of view the generalization of the conclusions should be considered cautiously. As numerous pairwise comparisons are absent of the published experiments, the metaanalysis has been computed on a non-complete graph. Moreover, the heterogeneity of both the datasets used in the different studies and the number of times each algorithm was evaluated add biases to the comparisons. However, despite these limitations, we believe that this first meta-analysis can help identify recurrent properties in the most efficient approaches and also flaws in the experimental protocols (e.g., the lack of some pairwise comparisons). More broadly speaking, the growth of publications in machine learning will certainly foster meta-analysis procedures in the near future.
Insights for Future Research. The rich literature on dimensionality reduction for multi-label classification offers some major leads of improvement. Theoretical works on stability and robustness guaranties are still at their infancy.
In particular, robustness to sampling, to noise, to geometric transformations and to the type of data (e.g., sparse, dense) are major concerns and are almost never addressed. Furthermore, the combinatorics of the key components in the generic formulation could be exploited for future proposals. The coupling between dimensionality reduction and classification especially appears, intuitively and in the experimental comparisons, as a promising component for improving today's state-of-the-art. Finally, the meta-analysis opens a discussion towards the collective construction of a shared experimental protocol which should allow evaluating the performances with limited bias.
Wissam Siblini received the MEng degree from Ecole Centrale de Nantes, the MSc degree from the University Claude Bernard in Lyon, and the PhD degree from the University of Nantes in partnership with Orange Labs, Lannion, in 2018. His research interests include extreme multi-label classification, dimensionality reduction, anomaly detection, and learning from imbalanced data.
Pascale Kuntz is a tenured professor with Polytech Nantes, the Graduate School of Engineering, University of Nantes, France. She is the head of the Data Science and Decision-Making Department, Computer Science Laboratory of Nantes and vice-president of the French-speaking Classification Society. Her research interests include classification, exploratory data analysis, and digital humanities.
Frank Meyer received the MS degree in computer science from the University of Montpellier, France, in 1995 and the PhD degree in computer science from the University of Grenoble, France, in 2012. He is a senior research engineer and project manager in a team of data scientists at Orange Labs in Lannion, France. He has worked in fraud detection, marketing segmentation and targeting, and automatic recommender systems. His current research interests include semi-supervised learning, text-mining, real time and interactive learning, and extreme multi-label learning.