MTCopula: Synthetic Complex Data Generation Using Copula
Résumé
Nowadays, marketing strategies are data-driven, and their quality depends significantly on the quality and quantity of available data. As it is not always possible to access this data, there is a need for synthetic data generation. Most of the existing techniques work well for low-dimensional data and may fail to capture complex dependencies between data dimensions. Moreover, the tedious task of identifying the right combination of models and their respective parameters is still an open problem. In this paper, we present MTCopula, a novel approach for synthetic complex data generation based on Copula functions. MTCopula is a flexible and extendable solution that automatically chooses the best Copula model, between Gaussian Copula and T-Copula models, and the best-fitted marginals to catch the data complexity. It relies on Maximum Likelihood Estimation to fit the possible marginal distribution models and introduces Akaike Information Criterion to choose both the best marginals and Copula models, thus removing the need for a tedious manual exploration of their possible combinations. Comparisons with state-of-art synthetic data generators on a real use case private dataset, called AdWanted, and literature datasets show that our approach preserves better the variable behaviors and the dependencies between variables in the generated synthetic datasets.
Domaines
Informatique [cs]
Origine : Fichiers produits par l'(les) auteur(s)