A highly scalable algorithm for the extraction of cis-regulatory regions
Résumé
In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The proposed algorithm uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi.
Domaines
Autre [q-bio.OT]
Origine : Fichiers produits par l'(les) auteur(s)