Leveraging lexical cohesion and disruption for topic segmentation

Anca-Roxana Simon 1 Guillaume Gravier 1 Pascale Sébillot 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Topic segmentation classically relies on one of two criteria, either finding areas with coherent vocabulary use or detecting discontinuities. In this paper, we propose a segmentation criterion combining both lexical cohesion and disruption, enabling a trade-off between the two. We provide the mathematical formulation of the criterion and an efficient graph based decoding algorithm for topic segmentation. Experimental results on standard textual data sets and on a more challenging corpus of automatically transcribed broadcast news shows demonstrate the benefit of such a combination. Gains were observed in all conditions, with segments of either regular or varying length and abrupt or smooth topic shifts. Long segments benefit more than short segments.However the algorithm has proven robust on automatic transcripts with short segments and limited vocabulary reoccurrences.
Document type :
Conference papers
Complete list of metadatas

Cited literature [26 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00867011
Contributor : Pascale Sébillot <>
Submitted on : Friday, September 27, 2013 - 3:11:50 PM
Last modification on : Friday, November 16, 2018 - 1:25:11 AM
Long-term archiving on : Saturday, December 28, 2013 - 4:31:46 AM

File

emnlp.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00867011, version 1

Citation

Anca-Roxana Simon, Guillaume Gravier, Pascale Sébillot. Leveraging lexical cohesion and disruption for topic segmentation. International Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, Oct 2013, Seattle, United States. pp.1314--1324. ⟨hal-00867011⟩

Share

Metrics

Record views

2001

Files downloads

461