Algorithms for ab intio and large scale prediction and classification of ncRNAs

Ludovic Platon

Résumé

The analysis of very large volumes of data generated by NGS (next-generation sequencing) requires the use of efficient bioinformatics tools. One of the aspects of this analysis is the identification of the non-coding RNAs (ncRNAs) that play important roles in many biological processes. The identification of ncRNAs by bioinformatics and computational tools raises two challenges: (i) prediction and classification (ab initio) of different types of ncRNAs, and (ii) large-scale processing of these data. Most currently existing tools for ncRNA prediction are specialized to one type of ncRNA, the largest number being dedicated to microRNAs (miRNAs). This is particularly the case of tools that we developed previously (and available on our software platform EvryRNA (http://EvryRNA.ibisc.univ-evry.fr)). Some tools of the literature can also determine other types of ncRNAs by comparison with sequences listed in various databases dedicated to ncRNAs (homology-based approach). In addition, there are tools to predict different types of ARNncs but without classification or by homology-based classification. The very few ab initio methods (very recently published) are very insufficient in term of prediction and time running. The goal of this project is to develop an ab initio algorithm for predicting and classifying at a large scale several classes of ncRNAs from NGS data, using both combinatory optimization and machine learning methods, and considering different types of ncRNAs features: features on sequence, secondary structure, genomic position, neighborhood, etc. One of the principal characteristics of ncARNs is its structure, notably the secondary structure. It is therefore important to take into account the structure in the ncRNA prediction algorithms, and the challenge is to develop fast algorithms to handle with huge volumes of NGS data. The developed algorithms will be applied for the identification of non-coding RNAs involved in sex determination in plants, particularly in cucurbit (melon, cucumber, …), where large volumes of data are available at IPS2.

L'analyse des très gros volumes de données générées par les nouvelles technologies de séquençage (NGS) nécessite l'utilisation d'outils bioinformatiques efficaces. L'un des aspects de cette analyse est d'identifier des ARNs non codants (ARNncs) qui jouent des rôles importants dans de nombreux processus biologiques. L'identification des ARNncs par les outils informatiques soulève deux défis: (i) la prédiction et la classification (ab initio) de différents types d'ARNncs, et (ii) le traitement à grande échelle de ces données. La plupart des outils existants actuellement pour la prédiction d'ARNncs sont spécialisés pour un type d'ARNnc, le plus grand nombre étant dédié aux microARNs (miARN). C'est notamment le cas des outils que nous avons développés précédemment (et disponibles sur notre plate-forme logicielle EvryRNA (http://EvryRNA.ibisc.univ-evry.fr)). Certains outils de la littérature peuvent également déterminer d'autres types d'ARNncs par comparaison avec des séquences figurant dans diverses bases de données dédiées aux ARNncs (approche par homologie). En outre, il existe des outils pour prédire les différents types d'ARNncs mais sans classification ou classification basée sur l'homologie. Le très peu de méthodes ab initio (très récemment publiées) sont très insuffisantes en terme de prédiction et de temps d'exécution. Le but de ce projet est de développer un algorithme ab initio de prédiction et de classification à grande échelle d'ARNncs à partir de données NGS, en utilisant des méthodes d'optimisation combinatoire et d'apprentissage automatique. Une des caractéristiques très importantes des ARNncs est la structure, notamment la structure secondaire, lorsque celle-ci existe et est connue. Il est donc important de la prendre en compte dans les algorithmes de prédiction, dont le défi est notamment d'être rapides afin de pouvoir traiter de très grand volumes de données issues des NGS. Ces algorithmes seront appliqués à la problématique d'identification des ARNncs impliquées dans le déterminisme sexuel chez les plantes, notamment chez les cucurbitacées (melon, concombre, …) où de très gros volumes de données sont disponibles au sein de l'IPS2.

Algorithms for ab intio and large scale prediction and classification of ncRNAs

Algorithmes pour la prédiction et la classification ab initio et à grande échelle des ARNs non-codants

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager