A Systematic Analysis of Data Preprocessing for Machine Learning- based Software Cost Estimation - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue Information and Software Technology Année : 2015

A Systematic Analysis of Data Preprocessing for Machine Learning- based Software Cost Estimation

Résumé

Context Due to the complex nature of software development process, traditional parametric models and statistical methods often appear to be inadequate to model the increasingly complicated relationship between project development cost and the project features (or cost drivers). Machine learning (ML) methods, with several reported successful applications, have gained popularity for software cost estimation in recent years. Data preprocessing has been claimed by many researchers as a fundamental stage of ML methods; however, very few works have been focused on the effects of data preprocessing techniques. Objective This study aims for an empirical assessment of the effectiveness of data preprocessing techniques on ML methods in the context of software cost estimation. Method In this work, we first conduct a literature survey of the recent publications using data preprocessing techniques, followed by a systematic empirical study to analyze the strengths and weaknesses of individual data preprocessing techniques as well as their combinations. Results Our results indicate that data preprocessing techniques may significantly influence the final prediction. They sometimes might have negative impacts on prediction performance of ML methods. Conclusion In order to reduce prediction errors and improve efficiency, a careful selection is necessary according to the characteristics of machine learning methods, as well as the datasets used for software cost estimation.
Fichier principal
Vignette du fichier
IST_hal_submission.pdf (1.23 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01340341 , version 1 (30-06-2016)

Identifiants

Citer

Jianglin Huang, Yan-Fu Li, Min Xie. A Systematic Analysis of Data Preprocessing for Machine Learning- based Software Cost Estimation. Information and Software Technology, 2015, 67, pp.108-127. ⟨10.1016/j.infsof.2015.07.004⟩. ⟨hal-01340341⟩
193 Consultations
1336 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More