Partially synthesised dataset to improve prediction accuracy

Abstract : The real world data sources, such as statistical agencies, library databanks and research institutes are the major data sources for researchers. Using this type of data involves several advantages including, the improvement of credibility and validity of the experiment and more importantly, it is related to a real world problems and typically unbiased. However, this type of data is most likely unavailable or inaccessible for everyone due to the following reasons. First, privacy and confidentiality concerns, since the data must to be protected on legal and ethical basis. Second, collecting real world data is costly and time consuming. Third, the data may be unavailable, particularly in the newly arises research subjects. Therefore, many studies have attributed the use of fully and/or partially synthesised data instead of real world data due to simplicity of creation, requires a relatively small amount of time and sufficient quantity can be generated to fit the requirements. In this context, this study introduces the use of partially synthesised data to improve the prediction of heart diseases from risk factors. We are proposing the generation of partially synthetic data from agreed principles using rule-based method, in which an extra risk factor will be added to the real-world data. In the conducted experiment, more than 85 % of the data was derived from observed values (i.e., real-world data), while the remaining data has been synthetically generated using a rule-based method and in accordance with the World Health Organisation criteria. The analysis revealed an improvement of the variance in the data using the first two principal components of partially synthesised data. A further evaluation has been conducted using five popular supervised machine-learning classifiers. In which, partially synthesised data considerably improves the prediction of heart diseases. Where the majority of classifiers have approximately doubled their predictive performance using an extra risk factor.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01377390
Contributor : Hani Hamdan <>
Submitted on : Friday, October 7, 2016 - 1:12:13 AM
Last modification on : Thursday, April 5, 2018 - 12:30:05 PM

Links full text

Identifiers

Citation

Ahmed Aljaaf, Dhiya Al-Jumeily, Abir Jaafar Hussain, Paul Fergus, Mohammed Al-Jumaily, et al.. Partially synthesised dataset to improve prediction accuracy. International Conference on Intelligent Computing (ICIC 2016), Aug 2016, Lanzhou, China. pp.855-866, ⟨10.1007/978-3-319-42291-6_84 ⟩. ⟨hal-01377390⟩

Share

Metrics

Record views

146