Automatic Compilation of Comparable corpora
Résumé
The exploitation of comparable corpora has proven to be a valuable alternative to rare parallel corpora in various Natural Language Processing tasks. Therefore many researchers have stressed the need for large quantities of such corpora and the scarcity of works on their compilation. Our purpose in this paper is to address this issue by using the CLIR-based method for the automatic acquisition of French-English comparable documents. At the start of the process, source documents are translated and most representative terms are extracted. The resulting keyword list is further enlarged with synonyms on the assumption that keyword expansion might improve the retrieval of such documents. Retrieval is performed on the indexed target collection and a further filtering step based mainly on temporal information and document length takes place. Results are fair and suggest that the use of ontology may improve the performance of the system.
Origine : Fichiers produits par l'(les) auteur(s)
Loading...