Skip to Main content Skip to Navigation
Journal articles

High-quality sequence clustering guided by network topology and multiple alignment likelihood.

Abstract : MOTIVATION: Proteins can be naturally classified into families of homologous sequences that derive from a common ancestor. The comparison of homologous sequences and the analysis of their phylogenetic relationships provide useful information regarding the function and evolution of genes. One important difficulty of clustering methods is to distinguish highly divergent homologous sequences from sequences that only share partial homology due to evolution by protein domain rearrangements. Existing clustering methods require parameters that have to be set a priori. Given the variability in the evolution pattern among proteins, these parameters cannot be optimal for all gene families. RESULTS: We propose a strategy that aims at clustering sequences homologous over their entire length, and that takes into account the pattern of substitution specific to each gene family. Sequences are first all compared with each other and clustered into pre-families, based on pairwise similarity criteria, with permissive parameters to optimize sensitivity. Pre-families are then divided into homogeneous clusters, based on the topology of the similarity network. Finally, clusters are progressively merged into families, for which we compute multiple alignments, and we use a model selection technique to find the optimal tradeoff between the number of families and multiple alignment likelihood. To evaluate this method, called HiFiX, we analyzed simulated sequences and manually curated datasets. These tests showed that HiFiX is the only method robust to both sequence divergence and domain rearrangements. HiFiX is fast enough to be used on very large datasets. AVAILABILITY AND IMPLEMENTATION: The Python software HiFiX is freely available at http://lbbe.univ-lyon1.fr/hifix.
Document type :
Journal articles
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-00965711
Contributor : Laurence Naiglin <>
Submitted on : Tuesday, March 25, 2014 - 3:29:39 PM
Last modification on : Wednesday, July 21, 2021 - 3:49:50 AM

Links full text

Identifiers

Collections

Citation

Vincent Miele, Simon Penel, Vincent Daubin, Franck Picard, Daniel Kahn, et al.. High-quality sequence clustering guided by network topology and multiple alignment likelihood.. Bioinformatics, Oxford University Press (OUP), 2012, 28 (8), pp.1078-85. ⟨10.1093/bioinformatics/bts098⟩. ⟨hal-00965711⟩

Share

Metrics

Record views

356