Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Maria del Mar Muniz Moreno; Claire Gaveriaux-Ruff; Yann Hérault

doi:10.1186/s12859-022-05111-0

Article Dans Une Revue BMC Bioinformatics Année : 2023

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

(1) , (1) , (1)

Maria del Mar Muniz Moreno

Fonction : Auteur
PersonId : 1103125

Institut de Génétique et de Biologie Moléculaire et Cellulaire

Claire Gaveriaux-Ruff

Fonction : Auteur
PersonId : 1066528

Institut de Génétique et de Biologie Moléculaire et Cellulaire

Yann Hérault

Fonction : Auteur
PersonId : 741744
IdHAL : yann-herault
ORCID : 0000-0001-7049-6900
IdRef : 077172639

Institut de Génétique et de Biologie Moléculaire et Cellulaire

Résumé

BACKGROUND: In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians. RESULT: We present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier's predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes. CONCLUSIONS: Gdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen , together with vignettes, documentation for the functions and examples to guide you in each own implementation.

Mots clés

R package Phenotypic data Clinical data Discrimination Generalized linear models Random forest Imputation Model Prediction Machine learning Bootstrapping

Domaines

Génétique Bio-Informatique, Biologie Systémique [q-bio.QM]

Fichier principal

islandora_167399.pdf (2.95 Mo)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Archive ouverte univOAK : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04219165

Soumis le : mardi 26 septembre 2023-23:05:20

Dernière modification le : lundi 11 mars 2024-10:38:22

Archivage à long terme le : mercredi 27 décembre 2023-19:36:42

Dates et versions

hal-04219165 , version 1 (26-09-2023)

Identifiants

HAL Id : hal-04219165 , version 1
DOI : 10.1186/s12859-022-05111-0

Citer

Maria del Mar Muniz Moreno, Claire Gaveriaux-Ruff, Yann Hérault. Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data. BMC Bioinformatics, 2023, 24 (1), pp.28. ⟨10.1186/s12859-022-05111-0⟩. ⟨hal-04219165⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS IGBMC SITE-ALSACE ANR

13 Consultations

7 Téléchargements

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager