Skip to Main content Skip to Navigation
Book sections

Latent Forests to Model Genetical Data for the Purpose of Multilocus Genome-wide Association Studies. Which clustering should be chosen?

Abstract : The aim of genetic association studies, and in particular genome-wide association stu- dies (GWASs), is to unravel the genetics of complex diseases. In this domain, machine learning offers an attractive alternative to classical statistical approaches. The seminal works of Mourad et al. (2011) have led to the proposal of a novel class of probabilistic graphical models, the forest of latent trees (FLTM). The design of this model was motivated by the necessity to model genet- ical data at the genome scale, prior to a multilocus GWAS. A multilocus GWAS fully exploits information about the complex dependences existing within genetical data, to help detect the loci associated with the studied pathology. The FLTM framework also allows data dimension reduc- tion. The FLTM model is a hierarchical Bayesian network with latent variables. Central to the FLTM construction is the recursive clustering of variables, in a bottom up subsuming process. This article focuses on the analysis of the impact of the choice of the clustering method used in the FLTM learning algorithm, in a GWAS context. We rely on a real GWAS data set describing 41400 variables for each of 3004 controls and 2005 cases affected by Crohn’s disease, and compare the impact of three clustering methods. We compare statistics about data dimension reduction as well as trends concerning the ability to split or group putative causal SNPs in agreement with the underlying biological reality. To assess the risk of missing significant association results due to subsumption, we also compare the clustering methods through the corresponding FLTM-based GWASs. In the GWAS context and in this framework, the choice of the clustering method does not influence the satisfying performance of the GWAS.
Document type :
Book sections
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-01204956
Contributor : Christine Sinoquet <>
Submitted on : Thursday, September 24, 2015 - 5:32:02 PM
Last modification on : Tuesday, December 8, 2020 - 9:47:03 AM

Identifiers

  • HAL Id : hal-01204956, version 1

Collections

Citation

Duc-Thanh Phan, Philippe Leray, Christine Sinoquet. Latent Forests to Model Genetical Data for the Purpose of Multilocus Genome-wide Association Studies. Which clustering should be chosen?. Communication in Computer and Information Science, Springer, pp.17, 2015, BIOSTEC2015. ⟨hal-01204956⟩

Share

Metrics

Record views

171