Alignment of Monolingual Corpus by Reduction of the Search Space - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Revue TAL : traitement automatique des langues Année : 2011

Alignment of Monolingual Corpus by Reduction of the Search Space

Résumé

Monolingual comparable corpora annotated with alignments between text segments (paragraphs, sentences, etc.) based on similarity are used in a wide range of natural language processing applications like plagiarism detection, information retrieval, summarization and so on. The drawback wanting to use them is that there aren't many standard corpora which are aligned. Due to this drawback, the corpus is manually created, which is a time consuming and costly task. In this paper, we propose a method to significantly reduce the search space for manual alignment of the monolingual comparable corpus which in turn makes the alignment process faster and easier. This method can be used in making alignments on different levels of text segments. Using this method we create our own gold corpus aligned on the level of paragraph, which will be used for testing and building our algorithms for automatic alignment. We also present some experiments for the reduction of search space on the basis of stem overlap, word overlap, and cosine similarity measure which help us automatize the process to some extent and reduce human effort for alignment.
Fichier principal
Vignette du fichier
taln11_submission_101.pdf (83.21 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00609902 , version 1 (20-07-2011)

Identifiants

  • HAL Id : hal-00609902 , version 1

Citer

Prajol Shrestha. Alignment of Monolingual Corpus by Reduction of the Search Space. Traitement Automatique des Langues Naturelles, Jun 2011, Montpellier, France. pp.543. ⟨hal-00609902⟩
86 Consultations
104 Téléchargements

Partager

Gmail Facebook X LinkedIn More