Student Sliced Inverse Regression

Sliced Inverse Regression (SIR) has been extensively used to reduce the dimension of the predictor space before performing regression. SIR is originally a model free method but it has been shown to actually correspond to the maximum likelihood of an inverse regression model with Gaussian errors. This intrinsic Gaussianity of standard SIR may explain its high sensitivity to outliers as observed in a number of studies. To improve robustness, the inverse regression formulation of SIR is therefore extended to non-Gaussian errors with heavy-tailed distributions. Considering Student distributed errors it is shown that the inverse regression remains tractable via an Expectation- Maximization (EM) algorithm. The algorithm is outlined and tested in the presence of outliers, both in simulated and real data, showing improved results in comparison to a number of other existing approaches.


Introduction
Let us consider a regression setting where the goal is to estimate the relationship between a univariate response variable Y and a predictor X. When the dimension p of the predictor space is 1 or 2, a simple 2D or 3D plot can visually reveal the relationship and can be useful to determine the regression strategy to be used. If p becomes large such an approach is not feasible. A possibility to overcome problems arising in the context of regression is to make the assumption that the response variable does not depend on the whole predictor space but just on a projection of X onto a subspace of smaller dimension. Such a dimensionality reduction leads to the concept of sufficient dimension reduction and to that of central subspace [1]. The central subspace is the intersection of all dimension-reduction subspaces (d.r.s.). A subspace S is a d.r.s. if Y is independent of X given P S X, where P S is the orthogonal projection onto S. In other words, all the information carried by the predictors X on Y can be compressed in P S X. It has been shown under weak assumptions that the intersection of all d.r.s., and therefore the central subspace, is itself a d.r.s. [2]. It is of particular interest to develop methods to estimate the central subspace as once it is identified, the regression problem can be solved equivalently using the lower-dimensional representation P S X of X in the subspace.
Among methods that lead to an estimation of the central subspace, Sliced Inverse Regression (SIR) [3] is one of the most popular. SIR is a semiparametric method assuming that the link function depends on d linear combinations of the predictors and a random error independent of X: Y = f (β T 1 X, . . . , β T d X, ). When this model holds, the projection of X onto the space spanned by the vectors {β i , i = 1, . . . , d} captures all the information about Y . In addition, [3] shows that a basis of this space can be recovered using an inverse regression strategy provided that the so called linearity condition holds. It has been shown that the linearity condition is satisfied as soon as X is elliptically distributed. Moreover, this condition approximately holds in high-dimensional datasets, see [4]. However, solutions have been proposed to deal with non elliptical distributed predictors and to overcome the linearity condition limitation [5,6,7].
The inverse regression approach to dimensionality reduction gained then rapid attention [8] and was generalized in [9] which shows the link between the axes spanning the central subspace and an inverse regression problem with Gaussian distributed errors. More specifically, in [10,9], it appears that, for a Gaussian error term and under appropriate conditions, the SIR estimator can be recovered as the maximum likelihood estimator of the parameters of an inverse regression model. In other words, although SIR is originally a model free method, the standard SIR estimates are shown to correspond to maximum likelihood estimators for a Gaussian inverse regression model. It is then not surprising that SIR has been observed, e.g. in [11], to be at best under normality and that its performance may degrade otherwise. Indeed, the Gaussian distribution is known to have tails too light to properly accommodate extreme values. In particular, [12] observes that SIR was highly sensitive to outliers, with additional studies, evidence and analysis given in [13]. To downweight this sensitivity, robust versions of SIR have been proposed, mainly starting from the standard model free estimators and trying to make them more resistant to outliers. Typically, in [14] classical estimators are replaced by high breakdown robust estimators and, recently in [15] two approaches are built: a weighted version of SIR and a solution based on the intra slice multivariate median estimator.
As an alternative, we propose to rather exploit the inverse regression formulation of SIR [10,9]. A new error term modeled by a multivariate Student distribution [16] is introduced. Among the elliptically contoured distributions, the multivariate Student is a natural generalization of the multivariate Gaussian but its heavy tails can better accommodate outliers. The result in Proposition 6 of [9] is extended from Gaussian to Student errors showing that the inverse regression approach of SIR is still valid outside the Gaussian case, meaning that the central subspace can still be estimated by maximum likelihood estimation of the inverse regression parameters. It is then shown that the computation of the maximum likelihood estimators remains tractable in the Student case via an Expectation-Maximization (EM) algorithm which has a simple implementation and desirable properties.
The paper is organized as follows. In Section 2 general properties of the multivariate Student distribution and some of its variants are first recalled. The inverse regression model is introduced in Section 3 followed by the EM strategy to find the maximum likelihood estimator, the link with SIR and the resulting Student SIR algorithm. A simulation study is carried out in Section 4 and a real data application, showing the interest of this technique, is detailed in Section 5. The final section contains concluding remarks and perspectives. Proofs are postponed to the Appendix.

Multivariate generalized Student distributions
Multivariate Student, also called t-distributions, are useful when dealing with real-data because of their heavy tails. They are a robust alternative to the Gaussian distribution, which is known to be very sensitive to outliers. In contrast to the Gaussian case though, no closed-form solution exists for the maximum likelihood estimation of the parameters of the t-distribution. Tractability is, however, maintained both in the univariate and multivariate case, via the EM algorithm [17] and thanks to a useful representation of the t-distribution as a so-called infinite mixture of scaled Gaussians or Gaussian scale mixture [18]. A Gaussian scale mixture distribution has a probability density function of the form where N p ( . ; µ, Σ/u) denotes the density function of the p-dimensional Gaussian distribution with mean µ and covariance Σ/u and f U is the probability distribution of a univariate positive variable U referred to hereafter as the weight variable. When f U is a Gamma distribution G(ν/2, ν/2) where ν denotes the degrees of freedom, expression (1) leads to the standard pdimensional t-distribution denoted by t p (x; µ, Σ, ν) with parameters µ (lo-4 cation vector), Σ (p×p positive definite scale matrix) and ν (positive degrees of freedom parameter). Its density is given by is the Mahalanobis distance between x and µ. The Gamma distribution has probability density function G(u; α, γ) = u α−1 Γ(α) −1 exp(−γu)γ α , where Γ denotes the Gamma function.
If f U (u; ψ) is set equal to a Gamma distribution G(α, γ) without imposing α = γ, (1) results in a multivariate Pearson type VII distribution (see e.g. [19] vol.2 chap. 28) also referred to as the Arellano-Valle and Bolfarine's Generalized t distribution in [16]. This generalized version is the multivariate version of the t-distribution considered in this work, its density is given by: For a random variable X following distribution (4), an equivalent representation useful for simulation is X = µ + U −1/2X where U follows a G(α, γ) distribution andX follows a N (0, Σ) distribution.
Remark 1 (Identifiability). The expression (4) depends on γ and Σ only through the product γΣ which means that to make the parameterization unique, an additional constraint is required. One possibility is to impose that Σ is of determinant 1. It is easy to see that this is equivalent to have an unconstrained Σ with γ = 1.
Unconstrained parameters are easier to deal with in inference algorithms. Therefore, we will rather assume without loss of generality that γ = 1 with the notation S p (0, V, α, 1) ≡ S p (0, V, α) adopted in the next Section.

Student Sliced Inverse Regression
Let X ∈ R p be a random vector, Y ∈ R the real response variable and S Y |X the central subspace spanned by the columns of the matrix β ∈ R p×d . In the following, it is assumed that dim(S Y |X ) = d where d is known and d ≤ p. To address the estimation of the central subspace, we consider the inverse regression formulation of [9], which models the link from Y to X. In addition to be a simpler regression problem, the inverse regression approach is of great interest because Proposition 6 in [9] states that in the Gaussian case, an estimation of the central subspace is provided by the estimation of the inverse regression parameters. In Subsection 3.1, the inverse regression model of [9] is extended by considering Student distributed errors. It is then shown in Subsection 3.2 that the estimation of the extended model is tractable via an Expectation-Maximization algorithm (EM). A link with SIR is presented in Subsection 3.3 and the resulting Student SIR algorithm is described in Subsection 3.4.

Student multi-index inverse regression model
In the spirit of [9,10] the following regression model is considered where µ ∈ R p is a non random vector, B is a non random p × d matrix with B T B = I d , ε ∈ R p is a centered generalized Student random vector following the distribution given in (4), ε is assumed independent of Y , with scale matrix V, c : R → R d is a non random function. It directly follows from (5) that and thus, after translation by µ, the conditional expectation of X given Y is a random vector located in the space spanned by the columns of VB. When ε is assumed to be Gaussian distributed, Proposition 6 in [9] states that B corresponds to the directions of the central subspace β. In [9,10], it appears then that, under appropriate conditions, the maximum likelihood estimator of B is (up to a full rank linear transformation) the SIR estimator of β, i.e. Span{B} = Span{β}. Proposition 6 in [9] can be generalized to our Student setting, so that B still corresponds to the central subspace. The generalization of Proposition 6 of [9] is given below.
Proposition 1. Let X y be a random variable distributed as X|Y = y, let us assume that with ε following a generalized Student distribution S p (0, V, α), c(y) ∈ R d is function of y and VB is a p × d matrix of rank d. Under model (7), the distribution of Y |X = x is the same as the distribution of Y |B T X = B T x for all values x.
The proof is given in Appendix 7.1. According to this proposition, X can be replaced by B T X without loss of information on the regression of Y on X.
A procedure to estimate B is then proposed in the next Section

Maximum likelihood estimation via EM algorithm
Let (X i , Y i ), i = 1, . . . , n be a set of independent random variables distributed according to the distribution of (X, Y ) as defined in (5). The unknown quantities to be estimated in model (5) where the coefficients c jk , j = 1, . . . , h and k = 1, . . . , d are unknown and to be estimated while h is supposed to be known. Let C be a h × d matrix with the kth column given by (c 1k , . . . , c hk ) T and s(.) = (s 1 (.), . . . , s h (.)) T . Then, model (5) can be rewritten as The density of the generalized Student distribution is available in closed form and given in (4). However to perform the estimation, a more useful representation of this distribution is given by its Gaussian scale mixture representation (3). Introducing an additional set of latent variables U = {U 1 , . . . , U n } with U i independent of Y i , one can equivalently write: Let us denote by θ = {µ, V, B, C, α} the parameters to estimate from realizations {x i , y i , i = 1, . . . , n}. In contrast to the Gaussian case, the maximum likelihood estimates are not available in closed-form for the tdistributions. However, they are reachable using an Expectation-Maximization (EM) algorithm. More specifically, at iteration (t) of the algorithm, θ is updated from a current value θ (t−1) to a new value θ (t) defined as θ (t) = arg max θ Q(θ, θ (t−1) ). Considering the scale mixture representation above, a natural choice for Q is the following expected value of the complete loglikelihood: Note that all computations are conditionally to the Y i 's and no assumption 8 is made on the distribution of the Y i 's. The E-step therefore consists of computing the quantities while the M-step divides into two-independent M-steps involving separately parameters (µ, V, B, C) and α. The second quantity (14) is needed only in the estimation of α. The following notation is introduced for the next sections:ū E-step. The quantities (13) and (14) above require the posterior distribution of the U i 's. This distribution can be easily determined using the well known conjugacy of the Gamma and Gaussian distributions for the mean. It follows then from standard Bayesian computations that the posterior distribution is still a Gamma distribution with parameters specified below, The required moments (13) and (14) are then well known for a Gamma distribution, so that it comes, where Ψ is the Digamma function. As it will become clear in the following M-step,ū (t) i acts as a weight for x i . Whenever the Mahalanobis distance of i of x i decreases and the influence of x i in the estimation of the parameters will be downweighted in the next iteration. The idea of using weights to handle outliers is common in the literature, Weighted Inverse Regression (WIRE) [15] gives weights through a deterministic kernel function to ensure the existence of the first moment. Our approach does not require previous knowledge to select an appropriate kernel and refers to the wide range of t-distributions (the Cauchy distribution for which the first moment is not defined lies in this family).
M-step. The M-step divides into the following two independent sub-steps. M-(µ, V, B, C) substep. Omitting terms that do not depend on the parameters in (12), estimating (µ, V, B, C) by maximization of Q consists, at iteration (t), of minimizing with respect to (µ, V, B, C) the following G function, To this aim, let us introduce (omitting the index iteration (t) in the notation) the h × h weighted covariance matrix W of s(Y ) defined by: We derive then the following lemma.
Lemma 1. Using the above notations, G(µ, V, B, C) can be rewritten as The proof is given in Appendix 7.2. Thanks to this representation of G(.) it is possible to derive the following proposition which is a generalization to the multi-index case and Student setting of the result obtained in case of Gaussian error in [10].
The proof is detailed in Appendix 7.3. Regarding parameter α it can be updated using an independent part of Q as detailed in the next M-step.

M-α substep.
Parameter α can be estimated by maximizing independently with regards to α, Then, since setting the derivative with respect to α to zero, we obtain thatα = Ψ −1 (ũ), where Ψ(.) is the Digamma function.
In practice, for the procedure to be complete, the choice of the h basis functions s j needs to be specified. Many possibilities for basis functions are available in the literature such as classical Fourier series, polynomials, etc. In the next section, we discuss a choice of basis functions which provides the connection with Sliced Inverse Regression (SIR) [3].

Connection to Sliced Inverse Regression
As in the Gaussian case [9,10], a clear connection with SIR can be established for a specific choice of the h basis functions. When Y is univariate a natural approach is to first partition the range of Y into h + 1 bins S j for j = 1, . . . , h + 1 also referred to as slices, and then defining h basis functions by considering the first h slices as follows, where 1I is the indicator function. Note that it is important to remove one of the slices so that the basis functions remain independent. However, the following related quantities are defined for j = 1, . . . , h + 1: 12 They represent respectively the number of y i in slice j weighted by theū i and the weighted proportion in slice j. The following weighted mean of X given Y ∈ S j is then denoted bȳ and the p × p "between slices" covariance matrix by In this context, the following consequence of Proposition 2 can be established.
Corollary 1. Under (9) and (23), if Σ is regular, then the updated estima-tionB of B is given by the eigenvectors associated to the largest eigenvalues of Σ −1 Γ. In addition, The proof is given in Appendix 7.4. When allū i = 1, the iterative EM algorithm reduces to one M-step and the quantities defined in this section correspond to the standard SIR estimators. The EM algorithm resulting from this choice of basis functions is referred to as the Student SIR algorithm. It is outlined in the next section.

Central subspace estimation via Student SIR algorithm
The EM algorithm can be outlined using Proposition 2 and Corollary 1. It relies on two additional features to be specified, initialization and stopping rule. As the algorithm alternates the E and M steps, it is equivalent to start with one of this step. It is convenient to start with the Maximization step since the initialization of quantitiesū i ,ũ i can be better interpreted. Ifū i is constant andũ i = 0, the first M-step of the algorithm results in performing standard SIR. Regarding an appropriate stopping rule of the algorithm, EM's fundamental property is to increase the log-likelihood at each iteration. A standard criteria is then the relative increase in log-likelihood, denoted by ∆(θ (t) , θ (t−1) ), between two iterations. At each iteration, for current parameters values, the log-likelihood is easy to compute using (4) and (9). Another natural criterion is to assess when parameter estimation stabilizes. Typically, focusing on the central subspace B, the following proximity measure [20,21] can be considered: The above quantity r ranges from 0 to 1 and evaluates the distance between the subspaces spanned by the columns of B andB. If d = 1, r is the squared cosine between the two spanning vectors. Although not directly related to the EM algorithm, in practice this criterion gave similar results in terms of parameter estimation. Experiments on simulated and real data are reported in the next two sections.

Determination of the central subspace dimension
Determining the dimension d of the central subspace is an important issue for which different solutions have been proposed in the literature. Most users rely on graphical considerations, e.g. [22]. A more quantitative approach is to use cross validation after the link function is found. Although in that case, d may vary depending on the specific regression approach that the user selected. Other methods that can be easily used on real data, are mainly based on (sequential) tests [3,23,24,20,25,11]. An alternative that uses a penalized likelihood criterion has been proposed in [26]. In our setting, formulated as a maximum likelihood problem, the penalized likelihood approach is the most natural. For a given value d of the central subspace dimension, we therefore propose to compute the Bayesian information criterion [27] is the number of free parameters in the model and L(d) is the maximized loglikelihood computed at the parameters values obtained at convergence of the EM algorithm. Computing L(d) is a straightforward byproduct of the algorithm described above as this quantity is already used in our stopping

E-step
Update theū i ,ũ i 's using the quantities estimated in the M-step: end while criterion. Following the BIC principle, an estimator of d can then be defined as the minimizer of BIC(d) over d ∈ {1, . . . , min(p, h)}. The performance of this criterion is investigated in the simulation study in Section 4 and used on the real data example of Section 5. The simulation study reveals that BIC can provide correct selections but requires large enough sample sizes. This limitation has been already pointed out in the literature (see e.g. [28]).

Simulation study
Student SIR is tested on simulated data under a variety of different models and distributions for the p-dimensional random variable X. The behavior of Student SIR is compared to SIR and four other techniques arising from the literature that claim some robustness. For comparison, the simulation setup described in [15,14] is adopted.

Simulation setup
Three different regression models are considered: III : Y = X 1 /(0.5 + (X 2 + 1.5) 2 )) + 0.2ε, where ε follows a standard normal distribution. The three models are combined with three possible distributions for the predictors X: (i) X is multivariate normal distributed with mean vector 0 and covariance matrix defined by its entries as σ ij = 0.5 |i−j| ; (ii) X is standard multivariate Cauchy distributed; (iii) X = (X 1 , . . . , X p ) T , where each X i is generated independently from a mixture of normal and uniform distributions denoted by 0.8N (0, 1) + 0.2U(−ν, ν) where ν is a positive scalar value.
Models I, III are homoscedastic while model II is heteroscedastic. Case (ii) is built to test the sensitivity to outliers while the distribution of X is elliptical. In (iii) a non-elliptical distribution of X is considered. The dimension is set to p = 10, the dimension of the e.d.r. space is d = 1 for I, II and d = 2 for III. The nine different configurations of X and Y are simulated with a number of samples varying depending on the experiment. In all tables Student SIR is compared with standard SIR and four other approaches. Contour Projection (CP-SIR) [29,30] applies the SIR procedure on a rescaled version of the predictors. Weighted Canonical Correlation (WCAN) [31] uses a basis of B-splines first estimating the dimension d of the central subspace and then the directions from the nonzero robustified version of the correlation matrices between the predictors and the B-splines basis functions. The idea of Weighted Inverse Regression (WIRE) [15] is to use a different weight function capable of dealing with both outliers and inliers. SIR is a particular case of WIRE with constant weighting function. Slice Inverse Median Estimation (SIME) replaces the intra slice mean estimator with the median which is well known to be more robust. All values referring to CP-SIR, WCAN, WIRE, SIME in the tables are directly extracted from [15]. Values relative to SIR have been recomputed using [32].

Results
To assess the sensitivity of the compared methods to different setting parameters, four sets of tests are carried out and reported respectively in Tables 1 and 2. First, the 9 configurations of X and Y models are tested for fixed sample size n = 200, number of slices h = 5 and p = 10 (Table 1 (a)). Then, the effect of the sample size is illustrated for model I (Table 1 (b)). The number of slices is varied to evaluate the sensitivity to the h value (Table reftb3 (a)) and at last, different values of ν are tested in the model (iii) case (Table 2 (b)). In all cases and tables, the different methods performance is assessed based on their ability to recover the central subspace which is measured via the value of the proximity measure r (26).
Student SIR shows its capability to deal with different configurations.
The proximity criterion (26) in Table 1 (a) is very close to one, for the first two regression models independently of the distributions of the predictors.
In the Gaussian case, Student SIR and SIR are performing equally well showing that our approach has no undesirable effects when dealing with simple cases. For configuration III − (iii), a slightly different value has been found for SIR compared to [15]. In this configuration however the trend is clear: standard SIR, Student SIR, WIRE and SIME show similar performance. In contrast, configurations I − (ii), II − (ii), III − (ii) illustrate that Student SIR can significantly outperform SIR. This is not surprising since the standard multivariate Cauchy has heavy tails and SIR is sensitive to outliers [14]. Table 1 (b) illustrates on model I the effect of the sample size n: Student SIR exhibits the best performance among all methods. It is interesting to observe that, in case (ii), the smaller value of r for standard SIR does not depend on the sample size n. In contrast, adding observations results in a better estimation for Student SIR.
It is then known that SIR is not very sensitive to the number of slices h [22]. In Table 2 (a), an analysis is performed with varying h. Student SIR appears to be as well not very sensitive to the number of slices.
Extra inliers as well as outliers can affect the estimation. In case (iii), parameter ν is controlling the extra observations magnitude. Under different values of ν = 0.5, 0.2, 0.1, 0.05 ,Table 2 (b) shows that both SIR and Student SIR are robust to inliers while CP-SIR and WCAN fail when ν is small and extra observations behave as inliers concentrated around the average.
In addition, a study on the behavior of SIR and Student SIR when X follows a standard multivariate Student distribution, with different degrees of freedom (df ), is shown in Table 3 (a). The multivariate Cauchy of model (ii) coincides with the multivariate Student with one degree of freedom. This setting is favorable to our model which is designed to handle heavy tails. Not surprisingly, Student SIR provides better results for small degrees of freedom but the difference with SIR is reduced as the degree of freedom increases and the multivariate Student gets closer to a Gaussian. The stan-dard deviation follows the same trend. In case III − (ii) the convergence of SIR becomes extremely slow. Regarding computational time, results are reported in Table 3 (b). Student SIR has multiple iterations, which increases computational time compared to SIR. It is interesting that, in the cases in which SIR fails (I − (ii), II − (ii), III − (ii) see Table 1), the convergence of Student SIR is fast, requiring less than a second on a standard laptop (Our Matlab code is available at https://hal.inria.fr/hal-01294982). All reported results have been obtained using a threshold of 0.01 for the relative increase of the Log-likelihood.
At last, the use of BIC as a selection criterion for the central subspace dimension d is investigated. As an illustration, last column of Table 3 (b) shows the number of times the criterion succeeded in selecting the correct dimension (i.e. d = 2 in this example) over 200 repetitions. BIC performs very well provided the sample size is large enough, this phenomenon being more critical as the number of outlying data increases. This is not surprising as this limitation of BIC has often been reported in the literature.
To summarize, through these simulations Student SIR shows good performance, outperforming SIR when the distribution of X is heavy-tailed (case (ii)) and preserving good properties such as insensitivity to the number of slices or robustness to inliers that are peculiar of SIR.

Data
The Galaxy dataset corresponds to n = 362, 887 different galaxies. This dataset has been already used in [33] with a preprocessing based on expert supervision to remove outliers. In this study all the original observations are considered, removing only points with missing values, which requires no expertise. The response variable Y is the stellar formation rate. The predictor X is made of spectral characteristics of the galaxies and is of dimension p = 46.

Evaluation setting
The number of samples n is very large and the proportion of outliers is very small compared to the whole dataset. The following strategy is adopted: 1000 random subsets of X of size n a = 3, 000, X a i , i = 1, . . . , 1000 and size n b = 30, 000, X b i , i = 1, . . . , 1000 are considered to compare the performance of SIR and Student SIR. First a reference resultB SIR ,B st-SIR is obtained using the whole dataset X, using respectively SIR and Student SIR with the dimension of the e.d.r. space set to d = 3 and the number of slices to h = 1000. The value d = 3 was selected via BIC computed for d = 1, ..., 20, which is reliable for such a large sample size. The proximity measure r (26) between the two reference spaces is r(B SIR ,B st-SIR ) = 0.95. SIR and st-SIR are identifying approximately the same same e.d.r. space.  Figure 1 (a) where histograms show that Student SIR performs better than SIR most of the time. As expected SIR is less robust than Student SIR, obtaining with a higher frequency low values of r. The histograms show a difference between values around r = 0.96 (23.8% of random subsets for Student SIR, 17.2% for SIR).

LetB
In the second test, the sample size of the subsets is increased to n b = 30, 000. Accordingly, the number of slices is increased to h = 100. Not surprisingly, the means (and standard deviations) of r SIR i and r st-SIR i are increasing to 0.97(0.04) and 0.99(0.00). Student SIR however still performs better than SIR (Figure 1 (b)) with some low values of the proximity measure for SIR while Student SIR has almost all the values (93.4% of random subsets) concentrated around r = 0.98. The difference between the two approaches is then further emphasized in Figure 2 where the cloud of points in the upper left corner of the plot corresponds to datasets for which SIR was not able to estimate a correct basis of the e.d.r space while Student SIR shows good performance. Even if the true e.d.r space is unknown, this analysis suggests that Student SIR is robust to outliers and can be profitably used in real applications.

Conclusion and future work
We proposed a new approach referred to as Student SIR to robustify SIR. In contrast to most existing approaches which aim at replacing the standard SIR estimators by robust versions, we considered the intrinsic characterization of SIR as a Gaussian inverse regression model [9] and modified it into a Student model with heavier tails. While SIR is not robust to outliers, Student SIR has shown to be able to deal with different kind of situations that depart from normality. As expected, when SIR provides good results, Student SIR is performing similarly but at a higher computational cost due to the need for an EM iterative algorithm for estimation.
Limitations of the approach include the difficulty in dealing with the case p > n or when there are strong correlations between variables. Student SIR as well as SIR still suffer from the need to inverse large covariance matrices. A regularization, to overcome this problem, has been proposed in [10] and could be extended to our Student setting. Another practical issue is how to set the dimension d of the central subspace. We have proposed the use of BIC as a natural tool in our maximum likelihood setting. It provided good results but may be not suited when the sample size is too small. A more complete study and comparison with other solutions would be interesting.
To conclude, Student SIR shows good performance in the presence of outliers and is performing equally well in case of Gaussian errors. In our experiments, the algorithm has shown fast convergence being a promising alternative to SIR since nowadays most datasets include outliers. Future work would be to extend this setting to a multivariate response following the lead of [34,35].

Proof of Proposition 1
The proof generalizes the proof of Proposition 6 in [9] to the generalized Student case. It comes from (7) that X y follows a generalized Student distribution S p (µ y , V, α) where µ y = µ + VBc(y). Generalized Student distributions have similar properties to Gaussian distributions (see for instance section 5.5 in [16]). In particular any affine transformation of a generalized Student distribution remains in this family. It follows that B T X|Y = y is distributed as S d (B T µ y , B T VB, α). Similarly, marginals and conditional distributions are retained in the family. It follows that X|B T X = B T x, Y = y is also a generalized Student distribution S p (μ,Ṽ,α,γ) with from which it is clear thatṼ,α,γ andμ do not depend on y. It follows that X|B T X = B T x, Y = y has the same distribution as X|B T X = B T x for all values x. Consequently Y is independent on X conditionally to B T X which implies that Y |X = x and Y |B T X = B T x have identical distributions for all values x.
Note that for the proof of the proposition, it was necessary to show that the independence on y holds for each parameter of the distribution and not only for the mean. The independence on y of the mean is actually straightforward using [9] where it appears that the proof that E[X|B T X, Y = y] does not depend on y is independent on the distribution of ε. Indeed the proof uses only the properties of the conditional expectation seen as a projection operator. This means that in our case also, B corresponds to the mean central subspace as defined by E[X|B T X, Y = y] = E[X|B T X].

Proof of Lemma 1
The proof is adapted from the proof of lemma 1 in [10] taking into account the additional quantitiesū i 's. Let us remark that where we have defined for i = 1, . . . , n, Since Z 2,. and Z 3,. are centered, replacing the previous expansion in (27)