Any Language Early Detection of Epidemic Diseases from Web News Streams - Archive ouverte HAL Accéder directement au contenu
Communication Dans Un Congrès Année : 2013

Any Language Early Detection of Epidemic Diseases from Web News Streams

Résumé

In this paper, we introduce a multilingual epidemiological news surveillance system. Its main contribution is its ability to extract epidemic events in any language, hence succeeding where state-of-the-art in surveillance systems usually fails : the objective of reactivity. Most systems indeed focus on a selected list of languages, deemed important. However, evidence shows that events are first described in the local language, and translated to other languages later, if and only if they contained important information. Hence, while systems handling only a sample of human languages may indeed succeed at extracting epidemic events, they will only do so after someone else detected the importance of the news, and made the decision to translate it. Thus, with events first described in other languages, such automated systems, that may only detect events that were already detected by humans, are essentially irrelevant for early detection. To overcome this weakness of the state-of-the-art in terms of reactivity, we designed a system that can detect epidemiological events in any language, without requiring any translation, be it automated or human-written. The solution presented in this paper relies on properties that may be called language universals. First, we observe and exploit properties of the news genre that remain unchanged, whatever the writing language. Second, we handle language variations, such as declensions, by processing text at the character-level, rather than at the word level. This additionally allows to handle various writing systems in a similar fashion. We present experiments with 5 languages, steoreotypical of different language families and writing systems : English, Chinese, Greek, Polish and Russian. Our system, DAnIEL, achieves an average F-measure score around 85%, slightly below top-performing systems for the languages that such systems are able to handle. However, its performance is superior for morphologically-rich languages. And it performs of- course infinitely better for the languages that other systems are not able to handle : The richest system in the state-of-the-art handles around 10 languages, while there exists about 6,000 languages in the world, 300 of which are spoken by more than one million people. The DAnIEL system is able to process each of them.

Dates et versions

hal-01073195 , version 1 (09-10-2014)

Identifiants

Citer

Romain Brixtel, Gaël Lejeune, Antoine Doucet, Nadine Lucas. Any Language Early Detection of Epidemic Diseases from Web News Streams. Healthcare Informatics (ICHI), 2013 IEEE International Conference on, Sep 2013, philadelphie, United States. pp.159 - 168, ⟨10.1109/ICHI.2013.94⟩. ⟨hal-01073195⟩
185 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More