Skip to Main content Skip to Navigation
Theses

Unsupervised Word Segmentation and Wordhood Assessment: The case for Mandarin Chinese

Abstract : This dissertation addresses the question of wordhood and unsupervised word identification in written corpora, with a focus on Modern Standard Chinese (MSC). The first part discusses the linguistic aspects of the question. It reviews previous works related to the notion of ``word" in MSC and Chinese script and relates it to general linguistics issues, especially that of Multi-Words Expressions. It then sketches the development of Chinese Word Segmentation in NLP and traditional evaluation procedures. We argue that a part of arbitrariness in the corpora annotation biases the evaluations in favor of supervised machine learning methods which are less relevant for linguistic studies compared to unsupervised ones. This first part advocate for a corpus-based definition of the minimal units based on a measure of the combinatoric autonomy of a form and its degree of membership in a distributional class. The second part presents a new unsupervised learning method to estimate this autonomy inspired by Harris theories. With a simple and fast segmentation algorithm solely based on this measure, we already achieve near state-of-the-art performances on the task of Unsupervised Chinese Word Segmentation. We discuss the importance of pre-processing and report experiments on the use of the Minimum Description Length (MDL) paradigm in unsupervised segmentation. Finally, we provide a refined methodology and tools for a qualitative evaluation of our output and results on languages others that MSC.
Complete list of metadatas

Cited literature [115 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/tel-01573561
Contributor : Pierre Magistry <>
Submitted on : Wednesday, August 9, 2017 - 8:36:57 PM
Last modification on : Friday, March 27, 2020 - 3:14:56 AM

Licence


Distributed under a Creative Commons Attribution - NoDerivatives 4.0 International License

Identifiers

  • HAL Id : tel-01573561, version 1

Collections

Citation

Pierre Magistry. Unsupervised Word Segmentation and Wordhood Assessment: The case for Mandarin Chinese. Linguistics. Paris Diderot; Inria, 2013. English. ⟨NNT : 2013PA070077⟩. ⟨tel-01573561⟩

Share

Metrics

Record views

409

Files downloads

355