Skip to Main content Skip to Navigation
Habilitation à diriger des recherches

Machine learning tools for biomarker discovery

Abstract : My research focuses on the development of machine learning tools for therapeutic research. In particular, my goal is to propose computational tools that can exploit data sets to extract biological hypotheses that explain, at a genomic or molecular level, the differences between samples that can be observed at a macroscopic scale. Such tools are necessary to the development of precision medicine, which requires identifying the characteristics, genomic or otherwise, that explain the differences in prognostic or therapeutic response between patients who exhibit the same symptoms. These questions can often be formulated as feature selection problems. However, the typical data sets contain many more features than samples, which poses statistical challenges. To address these challenges, my work is organized in three axes. First, knowledge accumulated on biological system can often be represented as biological networks. Under the hypothesis that features connected on these networks are likely to work together towards a phenotype, we propose to use biological networks to guide feature selection algorithms. The idea here is to define constraints that encourage the selected features to be connected on a given network. The formulation we proposed, which can be seen as a special case of what I call regularized relevance, allows us to efficiently select features on data sets containing hundreds of thousands of variables. Second, to compensate the small number of available samples, so-called multitask methods solve several related problems, or tasks, simultaneously. We have generalized regularized relevance to this context. I have also worked on the case where one can define a similarity between tasks, to impose that the more similar two tasks are, the more the two sets of features that are selected for them are. Such approaches can be used to study the response to different drug treatments: one can then used the similarity between the molecular structures of the drugs, a topic I have studied in the course of my PhD. Finally, most feature selection methods used in genomics can only explain the phenomenon of interest by linear effects. However, a large body of literature indicates that regions of the genome can interact nonlinearly. Modeling such interactions, which are called epistatic, exacerbate the aforementioned statistical challenges, and creates computational issues: evaluating all possible combinations of variables becomes intractable. My work in this domain addresses these computational issues, as well as the statistical challenges one encounters when modeling quadratic interactions between pairs of regions of the genome. More recently, we have also developed approaches that allow to model more complex interactions thanks to kernel methods.
Document type :
Habilitation à diriger des recherches
Complete list of metadata

Cited literature [290 references]  Display  Hide  Download
Contributor : Chloé-Agathe Azencott <>
Submitted on : Friday, January 24, 2020 - 5:34:52 PM
Last modification on : Monday, December 14, 2020 - 9:55:24 AM
Long-term archiving on: : Saturday, April 25, 2020 - 12:47:44 PM


  • HAL Id : tel-02354924, version 2


Chloé-Agathe Azencott. Machine learning tools for biomarker discovery. Machine Learning [stat.ML]. Sorbonne Université UPMC, 2020. ⟨tel-02354924v2⟩



Record views


Files downloads