Detection of computer generated papers in scientific literature

Abstract : Meaningless computer generated scientific texts can be used in several ways. For example, they have allowed Ike Antkare to become one of the most highly cited scientists of the modern world. Such fake publications are also appearing in real scientific conferences and, as a result, in the bibliographic services (Scopus, ISI-Web of Knowledge, Google Scholar,...). Recently, more than 120 papers have been withdrawn from subscription databases of two high-profile publishers, IEEE and Springer, because they were computer generated thanks to the SCIgen software. This software, based on a Probabilistic Context Free Grammar (PCFG), was designed to randomly generate computer science research papers. Together with PCFG, Markov Chains (MC) are the mains ways to generated Meaningless texts. This paper presents the mains characteristic of texts generated by PCFG and MC. For the time being, PCFG generators are quite easy to spot by an automatic way, using intertextual distance combined with automatic clustering, because these generators are behaving like authors with specifics features such as a very low vocabulary richness and unusual sentence structures. This shows that quantitative tools are effective to characterize originality (or banality) of authors' language.
Document type :
Book sections
Liste complète des métadonnées

Cited literature [37 references]  Display  Hide  Download
Contributor : Cyril Labbé <>
Submitted on : Tuesday, March 24, 2015 - 8:44:30 AM
Last modification on : Monday, February 11, 2019 - 4:36:02 PM
Document(s) archivé(s) le : Monday, April 17, 2017 - 10:55:05 PM


Files produced by the author(s)


  • HAL Id : hal-01134598, version 1


Cyril Labbé, Dominique Labbé, François Portet. Detection of computer generated papers in scientific literature. Mirko Degli Esposti; Eduardo G. Altmann; François Pachet. Creativity and Universality in Language, 2016. ⟨hal-01134598⟩



Record views


Files downloads