Counting Patterns in Degenerated Sequences

Abstract : Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach. Keywords Forward-Backward algorithm - Expectation-Maximization algorithmn - Markov chain embedding - Deterministic Finite state Automaton
Type de document :
Article dans une revue
Lecture Notes in Computer Science: Pattern Recognition in Bioinformatics, 2009, pp.222-232
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-00539552
Contributeur : Grégory Nuel <>
Soumis le : mercredi 24 novembre 2010 - 15:49:49
Dernière modification le : mardi 10 octobre 2017 - 11:22:03

Identifiants

  • HAL Id : hal-00539552, version 1

Collections

Citation

Gregory Nuel. Counting Patterns in Degenerated Sequences. Lecture Notes in Computer Science: Pattern Recognition in Bioinformatics, 2009, pp.222-232. 〈hal-00539552〉

Partager

Métriques

Consultations de la notice

44