Skip to Main content Skip to Navigation
Journal articles

The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing

Abstract : The NLP4NLP corpus contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965–2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ~270 million words. This paper presents an analysis of this corpus regarding the evolution of the research topics, with the identification of the authors who introduced them and of the publication where they were first presented, and the detection of epistemological ruptures. Linking the metadata, the paper content and the references allowed us to propose a measure of innovation for the research topics, the authors and the publications. In addition, it allowed us to study the use of language resources, in the framework of the paradigm shift between knowledge-based approaches and content-based approaches, and the reuse of articles and plagiarism between sources over time. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, articles, resources or publications.
Complete list of metadatas
Contributor : Limsi Publications <>
Submitted on : Monday, December 16, 2019 - 12:41:44 PM
Last modification on : Monday, February 10, 2020 - 6:14:09 PM


  • HAL Id : hal-02413749, version 1


Joseph Mariani, Gil Francopoulo, Patrick Paroubek, Frã©dã©ric Vernier. The NLP4NLP Corpus (II): 50 Years of Research in Speech and Language Processing. Frontiers in Research Metrics and Analytics, Frontiers Media, 2019, 3, pp.1-30. ⟨hal-02413749⟩



Record views