Semi-supervised co-selection : instances and features: application to diagnosis of dry port by rail

Raywat Makkhongkaew 1
1 DM2L - Data Mining and Machine Learning
LIRIS - Laboratoire d'InfoRmatique en Image et Systèmes d'information
Abstract : We are drowning in massive data but starved for knowledge retrieval. It is well known through the dimensionality tradeoff that more data increase informative but pay a price in computational complexity, which has to be made up in some way. When the labeled sample size is too little to bring sufficient information about the target concept, supervised learning fail with this serious challenge. Unsupervised learning can be an alternative in this problem. However, as these algorithms ignore label information, important hints from labeled data are left out and this will generally downgrades the performance of unsupervised learning algorithms. Using both labeled and unlabeled data is expected to better procedure in semi-supervised learning, which is more adapted for large domain applications when labels are hardly and costly to obtain. In addition, when data are large, feature selection and instance selection are two important dual operations for removing irrelevant information. Both of tasks with semisupervised learning are different challenges for machine learning and data mining communities for data dimensionality reduction and knowledge retrieval. In this thesis, we focus on co-selection of instances and features in the context of semi-supervised learning. In this context, co-selection becomes a more challenging problem as the data contains labeled and unlabeled examples sampled from the same population. To do such semi-supervised coselection, we propose two unified frameworks, which efficiently integrate labeled and unlabeled parts into the co-selection process. The first framework is based on weighting constrained clustering and the second one is based on similarity preserving selection. Both approaches evaluate the usefulness of features and instances in order to select the most relevant ones, simultaneously. Finally, we present a variety of empirical studies over high-dimensional data sets, which are well-known in the literature. The results are promising and prove the efficiency and effectiveness of the proposed approaches. In addition, the developed methods are validated on a real world application, over data provided by the State Railway of Thailand (SRT). The purpose is to propose the application models from our methodological contributions to diagnose the performance of rail dry port systems. First, we present the results of some ensemble methods applied on a first data set, which is fully labeled. Second, we show how can our co-selection approaches improve the performance of learning algorithms over partially labeled data provided by SRT
Document type :
Preprints, Working Papers, ...
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01512600
Contributor : Équipe Gestionnaire Des Publications Si Liris <>
Submitted on : Monday, April 24, 2017 - 9:54:29 AM
Last modification on : Wednesday, November 20, 2019 - 3:04:40 AM

Identifiers

  • HAL Id : hal-01512600, version 1

Citation

Raywat Makkhongkaew. Semi-supervised co-selection : instances and features: application to diagnosis of dry port by rail. 2016. ⟨hal-01512600⟩

Share

Metrics

Record views

320