Grouping conversational markers across languages by exploiting large comparable corpora and unsupervised segmentation

Abstract : This work approaches Conversational and Discourse Markers (hereafter DM) from a radical data-driven perspective grounded in large comparable corpora of French, English and Taiwan Mandarin conversations. The key features of our approach are (i) to account for lexicalization as a by-product of unsupervised segmentation applied to our corpora, (ii) to exploit simple metrics for clustering DM (both within a language and within multilingual clusters). We explore the benefits and the drawbacks of such a radical approach to DM. In particular we compare the DM clusters obtained from traditional segmentation into tokens (as given by manual transcription of the corpora) vs. unsupervised segmentation. The metrics on which we ground the clustering experiments are based on contrast between (i) short vs. longer utterances distribution and (ii) position within longer utterances.
Document type :
Conference papers
Complete list of metadatas

Cited literature [19 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-01807804
Contributor : Laurent Prévot <>
Submitted on : Tuesday, June 5, 2018 - 11:04:17 AM
Last modification on : Wednesday, April 3, 2019 - 2:06:24 AM
Long-term archiving on : Thursday, September 6, 2018 - 2:01:12 PM

File

BUCC2018(5).pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01807804, version 1

Collections

Citation

Laurent Prevot, Matthieu Stali, Shu-Chuan Tseng. Grouping conversational markers across languages by exploiting large comparable corpora and unsupervised segmentation. 11th Workshop on Building and Using Comparable Corpora, May 2018, Miyazaki, Japan. ⟨hal-01807804⟩

Share

Metrics

Record views

53

Files downloads

41