Counting Patterns in Degenerated Sequences

Abstract : Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach. Keywords Forward-Backward algorithm - Expectation-Maximization algorithmn - Markov chain embedding - Deterministic Finite state Automaton
Document type :
Journal articles
Complete list of metadatas
Contributor : Grégory Nuel <>
Submitted on : Wednesday, November 24, 2010 - 3:49:49 PM
Last modification on : Friday, September 20, 2019 - 4:34:02 PM


  • HAL Id : hal-00539552, version 1



Gregory Nuel. Counting Patterns in Degenerated Sequences. Lecture Notes in Computer Science: Pattern Recognition in Bioinformatics, 2009, pp.222-232. ⟨hal-00539552⟩



Record views