Skip to Main content Skip to Navigation
Journal articles

Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters

Abstract : Heterogeneity of data and data formats in bioinformatics entail mismatches between inputs and outputs of different services, making it difficult to compose them into workflows. To reduce those mismatches, bioinformatics platforms propose ad'hoc converters, called shims. When shims are written by hand, they are time-consuming to develop, and cannot anticipate all needs. When shims are automatically generated, they miss transformations, for example data composition from multiple parts, or parallel conversion of list elements. This article proposes to systematically detect convertibility from output types to input types. Convertibility detection relies on a rule system based on abstract types, close to XML Schema. Types allow to abstract data while precisely accounting for their composite structure. Detection is accompanied by an automatic generation of converters between input and output XML data. % We show the applicability of our approach by abstracting concrete bioinformatics types (e.g., complex biosequences) for a number of bioinformatics services (e.g., blast). We illustrate how our automatically generated converters help to resolve data mismatches when composing workflows. % We conducted an experiment on bioinformatics services and datatypes, using an implementation of our approach, as well as a survey with domain experts. The detected convertibilities and produced converters were validated as relevant from a biological point of view. Furthermore the automatically produced graph of potentially compatible services exhibited a connectivity higher than with the ad'hoc approaches. Indeed, the experts discovered unknown possible connexions.
Complete list of metadatas

Cited literature [37 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01485059
Contributor : Mireille Ducassé <>
Submitted on : Monday, January 13, 2020 - 6:30:41 PM
Last modification on : Tuesday, March 10, 2020 - 10:10:03 AM
Document(s) archivé(s) le : Tuesday, April 14, 2020 - 7:16:48 PM

File

paper Ba.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01485059, version 1

Citation

Mouhamadou Ba, Sébastien Ferré, Mireille Ducassé. Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters. Transactions on Large-Scale Data- and Knowledge-Centered Systems, Springer Berlin / Heidelberg, 2016, LNCS, 9510, pp.88-115. ⟨hal-01485059⟩

Share

Metrics

Record views

651

Files downloads

68