Ten Years of Knowledge Harvesting: Lessons and Challenges - Archive ouverte HAL Accéder directement au contenu
Article Dans Une Revue Bulletin of the Technical Committee on Data Engineering Année : 2016

Ten Years of Knowledge Harvesting: Lessons and Challenges

Gerhard Weikum
  • Fonction : Auteur
  • PersonId : 1023122
Fabian Suchanek

Résumé

This article is a retrospective on the theme of knowledge harvesting: automatically constructing large high-quality knowledge bases from Internet sources. We draw on our experience in the Yago-Naga project over the last decade, but consider other projects as well. The article discusses lessons learned on the architecture of a knowledge harvesting system, and points out open challenges and research opportunities. 1 Large High-Quality Knowledge Bases Turning Internet content, with its wealth of latent-value but noisy text and data sources, into crisp " machine knowledge " that can power intelligent applications is a long-standing goal of computer science. Over the last ten years, knowledge harvesting has made tremendous progress, leveraging advances in scalable information extraction and the availability of curated knowledge-sharing sources such as Wikipedia. Unlike the seminal projects on manually crafted knowledge bases and ontologies, like Cyc [27] and WordNet [14], knowledge harvesting is automated and operates at Web scale. Automatically constructed knowledge bases – KB's for short – have become a powerful asset for search, analytics, recommendations, and data integration, with intensive use at big industrial stakeholders. Prominent examples are the Google Knowledge Graph, Facebook's Graph Search, Microsoft Satori as well as domain-specific knowledge bases in business, finance, life sciences, and more. These achievements are rooted in academic research and community projects starting ten years ago, most notably, DBpedia [2], Freebase [5], KnowItAll [13], WikiTaxonomy [34] and Yago [41]. More recent major projects along these lines include BabelNet [31] ConceptNet [40], DeepDive [39], EntityCube (aka. Renlifang) [33], KnowledgeVault [9], Nell [6] Probase [50], Wikidata [47], XLore [48]. The largest of the KB's from these projects contain many millions of entities (i.e., people, places, products etc.) and billions of facts about them (i.e., attribute values and relationships with other entities). Moreover, entities are organized into a taxonomy of semantic classes, sometimes with hundred thousands of fine-grained types. All this is often represented in the form of subject-predicate-object (SPO) triples, following the RDF data model, and some of the KB's – most notably DBpedia – are central to the Web of Linked Open Data [18]. For illustration, here are some examples of SPO triples about Steve Jobs:
Fichier principal
Vignette du fichier
debull-2016.pdf (295.54 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01699054 , version 1 (01-02-2018)

Identifiants

  • HAL Id : hal-01699054 , version 1

Citer

Gerhard Weikum, Johannes Hoffart, Fabian Suchanek. Ten Years of Knowledge Harvesting: Lessons and Challenges. Bulletin of the Technical Committee on Data Engineering, 2016. ⟨hal-01699054⟩
165 Consultations
123 Téléchargements

Partager

Gmail Facebook X LinkedIn More