Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters

Abstract : Heterogeneity of data and data formats in bioinformatics entail mismatches between inputs and outputs of different services, making it difficult to compose them into workflows. To reduce those mismatches, bioinformatics platforms propose ad'hoc converters, called shims. When shims are written by hand, they are time-consuming to develop, and cannot anticipate all needs. When shims are automatically generated, they miss transformations, for example data composition from multiple parts, or parallel conversion of list elements. This article proposes to systematically detect convertibility from output types to input types. Convertibility detection relies on a rule system based on abstract types, close to XML Schema. Types allow to abstract data while precisely accounting for their composite structure. Detection is accompanied by an automatic generation of converters between input and output XML data. % We show the applicability of our approach by abstracting concrete bioinformatics types (e.g., complex biosequences) for a number of bioinformatics services (e.g., blast). We illustrate how our automatically generated converters help to resolve data mismatches when composing workflows. % We conducted an experiment on bioinformatics services and datatypes, using an implementation of our approach, as well as a survey with domain experts. The detected convertibilities and produced converters were validated as relevant from a biological point of view. Furthermore the automatically produced graph of potentially compatible services exhibited a connectivity higher than with the ad'hoc approaches. Indeed, the experts discovered unknown possible connexions.
Type de document :
Article dans une revue
Transactions on Large-Scale Data- and Knowledge-Centered Systems, Springer Berlin / Heidelberg, 2016, LNCS, 9510, pp.88-115
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01485059
Contributeur : Mireille Ducassé <>
Soumis le : mercredi 8 mars 2017 - 10:31:09
Dernière modification le : jeudi 15 novembre 2018 - 11:57:45

Identifiants

  • HAL Id : hal-01485059, version 1

Citation

Mouhamadou Ba, Sébastien Ferré, Mireille Ducassé. Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters. Transactions on Large-Scale Data- and Knowledge-Centered Systems, Springer Berlin / Heidelberg, 2016, LNCS, 9510, pp.88-115. 〈hal-01485059〉

Partager

Métriques

Consultations de la notice

557