Model-based multivariate discretization for logistic regression - Archive ouverte HAL Accéder directement au contenu
Poster De Conférence Année : 2017

Model-based multivariate discretization for logistic regression

Résumé

Credit institutions are interested in the refunding probability of a loan given the applicant’s characteristics in order to assess the worthiness of the credit. For regulatory and interpretability reasons, the logistic regression is still widely used to learn this probability from the data. Although logistic regression handles naturally both quantitative and qualitative data, two pre-processing steps are usually performed: first, continuous features are discretized by assigning factor levels to pre-determined intervals; second, qualitative features, if they take numerous values, are regrouped into variables taking fewer factor levels. In this communication focus will be given on the discretization of continuous variables which is performed for two main reasons: first, it produces a “scorecard” with a direct correspondence from intervals to score “points”; second, it allows do deal with non linearity of the score with respect to the continuous variables. There already exists many discretization algorithms (see the review from Ramírez‐Gallego et al. (2016)). To the best of our knowledge, the few multivariate supervised algorithms are unsatisfactory in our setup mainly because they are not fully automated, their optimized criterion does not produce suitable discretized features for logistic regression and their approach are empirical. By reinterpreting discretized features as latent variables, we are able, through the use of a Stochastic Expectation-Maximization (SEM) algorithm and a Gibbs sampler, to overcome those shortcomings and to find the best discretization scheme w.r.t. the logistic regression loss. The good performances of this approach are illustrated on simulated and real data from Crédit Agricole Consumer Finance.
Fichier non déposé

Dates et versions

hal-02075126 , version 1 (21-03-2019)

Licence

Paternité

Identifiants

  • HAL Id : hal-02075126 , version 1

Citer

Adrien Ehrhardt, Christophe Biernacki, Vincent Vandewalle, Philippe Heinrich. Model-based multivariate discretization for logistic regression. Data Science Summer School, Aug 2017, Paris, France. 2017. ⟨hal-02075126⟩
70 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More