Topic segmentation: application of mathematical morphology to textual data

Sébastien Lefèvre 1 Vincent Claveau 2
2 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Mathematical Morphology (MM) offers a generic theoretical framework for data processing and analysis. Nevertheless, it remains es- sentially used in the context of image analysis and processing, and the attempts to use MM on other kinds of data are still quite rare. We believe MM can provide relevant solutions for data analysis and processing in a far broader range of application fields. To illustrate, we focus here on textual data and we show how morphological operators (here the mor- phological segmentation using watershed transform) may be applied on these data. We thus provide an original MM-based solution to the the- matic segmentation problem, which is a typical problem in the fields of natural language processing and information retrieval (IR). More precisely, we consider here TV broadcasts through their transcrip- tion obtained by automatic speech recognition. To perform topic seg- mentation, we compute the similarity between successive segments using a technique called vectorization which has recently introduced in the IR field. We then apply a gradient operator to build a topographic surface to be segmented using the watershed transform. This new topic segmenta- tion technique is evaluated on two corpora of TV broadcasts on which it outperforms other existing approaches. Despite using very common mor- phological operators (i.e., the standard Watershed Transform), we thus show the potential interest of MM to be applied on non-image data.
Type de document :
Communication dans un congrès
ISMM, Internation Symposium on Mathematical Morphology, 2011, Italy. 2011
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-00643913
Contributeur : Vincent Claveau <>
Soumis le : mercredi 23 novembre 2011 - 11:34:45
Dernière modification le : vendredi 13 janvier 2017 - 14:21:31

Identifiants

  • HAL Id : hal-00643913, version 1

Collections

Citation

Sébastien Lefèvre, Vincent Claveau. Topic segmentation: application of mathematical morphology to textual data. ISMM, Internation Symposium on Mathematical Morphology, 2011, Italy. 2011. <hal-00643913>

Partager

Métriques

Consultations de la notice

232