Ten Years of Knowledge Harvesting: Lessons and Challenges

Gerhard Weikum; Johannes Hoffart; Fabian Suchanek

Article Dans Une Revue Bulletin of the Technical Committee on Data Engineering Année : 2016

Ten Years of Knowledge Harvesting: Lessons and Challenges

(1) , (2) ,

1
2

Gerhard Weikum

Fonction : Auteur
PersonId : 1023122

Max-Planck-Institut

Johannes Hoffart

Fonction : Auteur

Max-Planck-Institut für Informatik

Fabian Suchanek

Fonction : Auteur
PersonId : 12540
IdHAL : fabian-suchanek
ORCID : 0000-0001-7189-2796
IdRef : 203477707

Résumé

This article is a retrospective on the theme of knowledge harvesting: automatically constructing large high-quality knowledge bases from Internet sources. We draw on our experience in the Yago-Naga project over the last decade, but consider other projects as well. The article discusses lessons learned on the architecture of a knowledge harvesting system, and points out open challenges and research opportunities. 1 Large High-Quality Knowledge Bases Turning Internet content, with its wealth of latent-value but noisy text and data sources, into crisp " machine knowledge " that can power intelligent applications is a long-standing goal of computer science. Over the last ten years, knowledge harvesting has made tremendous progress, leveraging advances in scalable information extraction and the availability of curated knowledge-sharing sources such as Wikipedia. Unlike the seminal projects on manually crafted knowledge bases and ontologies, like Cyc [27] and WordNet [14], knowledge harvesting is automated and operates at Web scale. Automatically constructed knowledge bases – KB's for short – have become a powerful asset for search, analytics, recommendations, and data integration, with intensive use at big industrial stakeholders. Prominent examples are the Google Knowledge Graph, Facebook's Graph Search, Microsoft Satori as well as domain-specific knowledge bases in business, finance, life sciences, and more. These achievements are rooted in academic research and community projects starting ten years ago, most notably, DBpedia [2], Freebase [5], KnowItAll [13], WikiTaxonomy [34] and Yago [41]. More recent major projects along these lines include BabelNet [31] ConceptNet [40], DeepDive [39], EntityCube (aka. Renlifang) [33], KnowledgeVault [9], Nell [6] Probase [50], Wikidata [47], XLore [48]. The largest of the KB's from these projects contain many millions of entities (i.e., people, places, products etc.) and billions of facts about them (i.e., attribute values and relationships with other entities). Moreover, entities are organized into a taxonomy of semantic classes, sometimes with hundred thousands of fine-grained types. All this is often represented in the form of subject-predicate-object (SPO) triples, following the RDF data model, and some of the KB's – most notably DBpedia – are central to the Web of Linked Open Data [18]. For illustration, here are some examples of SPO triples about Steve Jobs:

Domaines

Web Base de données [cs.DB]

Fichier principal

debull-2016.pdf (295.54 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Fabian Suchanek : Connectez-vous pour contacter le contributeur

https://imt.hal.science/hal-01699054

Soumis le : jeudi 1 février 2018-23:09:41

Dernière modification le : mardi 13 juillet 2021-16:04:04

Archivage à long terme le : mercredi 2 mai 2018-15:38:08

Dates et versions

hal-01699054 , version 1 (01-02-2018)

Identifiants

HAL Id : hal-01699054 , version 1

Citer

Gerhard Weikum, Johannes Hoffart, Fabian Suchanek. Ten Years of Knowledge Harvesting: Lessons and Challenges. Bulletin of the Technical Committee on Data Engineering, 2016. ⟨hal-01699054⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

165 Consultations

123 Téléchargements

Ten Years of Knowledge Harvesting: Lessons and Challenges

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager