Automatic Discovery of Hidden Associations Using Vector Similarity : Application to Biological Annotation Prediction

Seyed Ziaeddin Alborzi 1, 2
1 CAPSID - Computational Algorithms for Protein Structures and Interactions
Inria Nancy - Grand Est, LORIA - AIS - Department of Complex Systems, Artificial Intelligence & Robotics
Abstract : This thesis presents: 1) the development of a novel approach to find direct associations between pairs of elements linked indirectly through various common features, 2) the use of this approach to directly associate biological functions to protein domains (ECDomainMiner and GODomainMiner), and to discover domain-domain interactions, and finally 3) the extension of this approach to comprehensively annotate protein structures and sequences. ECDomainMiner and GODomainMiner are two applications to discover new associations between EC Numbers and GO terms to protein domains, respectively. They find a total of 20,728 and 20,318 non-redundant EC-Pfam and GO-Pfam associations, respectively, with F-measures of more than 0.95 with respect to a “Gold Standard” test set extracted from InterPro. Compared to around 1500 manually curated associations in InterPro, ECDomainMiner and GODomainMiner infer a 13-fold increase in the number of available EC-Pfam and GO-Pfam associations. These function-domain associations are then used to annotate thousands of protein structures and millions of protein sequences for which their domain composition is known but that currently lack experimental functional annotations. Using inferred function-domain associations and considering taxonomy information, thousands of annotation rules have automatically been generated. Then, these rules have been utilized to annotate millions of protein sequences in the TrEMBL database
Complete list of metadatas
Contributor : Abes Star <>
Submitted on : Tuesday, May 15, 2018 - 1:04:29 PM
Last modification on : Tuesday, December 18, 2018 - 4:40:22 PM
Long-term archiving on : Tuesday, September 25, 2018 - 12:20:46 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01792299, version 1


Seyed Ziaeddin Alborzi. Automatic Discovery of Hidden Associations Using Vector Similarity : Application to Biological Annotation Prediction. Bioinformatics [q-bio.QM]. Université de Lorraine, 2018. English. ⟨NNT : 2018LORR0035⟩. ⟨tel-01792299⟩



Record views


Files downloads