Alignment of Noisy Unstructured Text Data

Abstract : This paper describes a textual aligner named MEDITE whose specificity is the detection of moves. It was developed to solve a problem from textual genetic criticism, a humanities discipline that compares different versions of authors’ texts in order to highlight invariants and differences between them. Our aligner handles this task and it is general enough to handle others. The algorithm, based on the edit distance with moves, aligns duplicated character blocks with an A∗ heuristic algorithm. We present an experimental evaluation of our algorithm by comparing it with similar ones in four experiments. The first one deals with the alignment of texts with a large amount of repetitions; we show it is a very difficult problem. Two other experiments are duplicate linkage and text reuse detection. Finally, the algorithm is tested with synthetic data.
Document type :
Conference papers
Complete list of metadatas

https://hal.archives-ouvertes.fr/hal-01306265
Contributor : Lip6 Publications <>
Submitted on : Friday, April 22, 2016 - 4:12:01 PM
Last modification on : Thursday, March 21, 2019 - 1:19:14 PM

Identifiers

  • HAL Id : hal-01306265, version 1

Citation

Julien Bourdaillet, Jean-Gabriel Ganascia. Alignment of Noisy Unstructured Text Data. 20th International Joint Conference on Artificial Intelligence (IJCAI). Workshop on Analytics for Noisy Unstructured Text Data (AND 2007), Jan 2007, Hyderabad, India. pp.139-146. ⟨hal-01306265⟩

Share

Metrics

Record views

154