Compression de textes en langue naturelle

Abstract : In this Ph. D. Thesis we investigate several data compression methods on text in natural language. Our study is focused on algorithms that use the word as the basic units, they are usually called word-based text compression algorithms. We have developed algorithms that allow to divide original size of the text by an average factor of 3. 5 and keeps (medium an index) direct access to the compressed form of the text. The set of words of a text, (the lexicon) is not a priori known. An efficient compression of the text requires an efficient compression of its lexicon. For this purpose, we have developed a compact representation of the lexicon that allows, by the application of Markov chain based compression algorithms, to get very high compression rates. The early algorithms dedicated to compress text in natural language have been elaborated to process very large text databases in which the size of the lexicon is very small versus the data one. Our algorithms can be apply also to every day text size (from some fifty Ko up to some Mo) for which the size of the lexicon is an important part of the size of the text
Complete list of metadatas

Cited literature [20 references]  Display  Hide  Download
Contributor : Claude Martineau <>
Submitted on : Friday, March 22, 2019 - 11:45:06 AM
Last modification on : Friday, May 10, 2019 - 5:46:48 PM


Files produced by the author(s)


  • HAL Id : tel-02076650, version 1


Claude Martineau. Compression de textes en langue naturelle. Informatique et langage [cs.CL]. Université de Marne-la-Vallée, 2001. Français. ⟨NNT : 2001MARN0123⟩. ⟨tel-02076650⟩



Record views


Files downloads