Skip to Main content Skip to Navigation

Statistical modeling of protein sequences beyond structural prediction : high dimensional inference with correlated data

Abstract : Over the last decades, genomic databases have grown exponentially in size thanks to the constant progress of modern DNA sequencing. A large variety of statistical tools have been developed, at the interface between bioinformatics, machine learning, and statistical physics, to extract information from these ever increasing datasets. In the specific context of protein sequence data, several approaches have been recently introduced by statistical physicists, such as direct-coupling analysis, a global statistical inference method based on the maximum-entropy principle, that has proven to be extremely effective in predicting the three-dimensional structure of proteins from purely statistical considerations.In this dissertation, we review the relevant inference methods and, encouraged by their success, discuss their extension to other challenging fields, such as sequence folding prediction and homology detection. Contrary to residue-residue contact prediction, which relies on an intrinsically topological information about the network of interactions, these fields require global energetic considerations and therefore a more quantitative and detailed model. Through an extensive study on both artificial and biological data, we provide a better interpretation of the central inferred parameters, up to now poorly understood, especially in the limited sampling regime. Finally, we present a new and more precise procedure for the inference of generative models, which leads to further improvements on real, finitely sampled data.
Document type :
Complete list of metadatas
Contributor : Abes Star :  Contact
Submitted on : Monday, March 19, 2018 - 9:25:08 AM
Last modification on : Monday, December 14, 2020 - 9:50:24 AM
Long-term archiving on: : Tuesday, September 11, 2018 - 8:10:20 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01736980, version 1


Alice Coucke. Statistical modeling of protein sequences beyond structural prediction : high dimensional inference with correlated data. Mathematical Physics [math-ph]. Université Paris sciences et lettres, 2016. English. ⟨NNT : 2016PSLEE034⟩. ⟨tel-01736980⟩



Record views


Files downloads