Abstract : Introduction : Complex diseases are known to be highly heterogeneous in nature. This heterogeneity can be due to various factors including genetic heterogeneity (eg: population stratification), phenotypic heterogeneity (ex: clinical diagnosis of schizophrenia), exposure heterogeneity to various environmental factors (eg: alcohol, drugs, pollution, etc.), and recruitment heterogeneity over time (the so-called « cohort-effect »). In the context of case-control studies, detecting and accounting for this heterogeneity can help to identify high-risk subgroups in the population and provide a better understanding of the disease.
Method : In this context, we introduce a new way to detect and account any source of heterogeneity by introducing a breakpoint model for logistic regression.
Our model is based on a constrained Hidden Markov Modelling using a constrained Markov model for the hidden segmentation and a logistic regression model for the observed part. Parameter training is performed by combining Forward-Backward recursion with the Expectation-Maximization algorithm. The model output includes both regression estimate in each segment and the full posterior distribution of the breakpoints.
Results : We validate and illustrate the usefulness of our model both on simulated and realistic dataset. In particular, we show that if individuals are ordered according to some proximity space (eg: by increasing BMI (Body Mass Index)) we can use our model to detect interactions between genes and latent exposures by using a simple likelihood ratio testing framework. This last result seems particularly promising since it provides an unique way to distinguish between confounding factors (eg: sex for smoking) and genuine non-observed causal exposures.