Ten Years of Knowledge Harvesting: Lessons and Challenges
Résumé
This article is a retrospective on the theme of knowledge harvesting: automatically constructing large high-quality knowledge bases from Internet sources. We draw on our experience in the Yago-Naga project over the last decade, but consider other projects as well. The article discusses lessons learned on the architecture of a knowledge harvesting system, and points out open challenges and research opportunities. 1 Large High-Quality Knowledge Bases Turning Internet content, with its wealth of latent-value but noisy text and data sources, into crisp " machine knowledge " that can power intelligent applications is a long-standing goal of computer science. Over the last ten years, knowledge harvesting has made tremendous progress, leveraging advances in scalable information extraction and the availability of curated knowledge-sharing sources such as Wikipedia. Unlike the seminal projects on manually crafted knowledge bases and ontologies, like Cyc [27] and WordNet [14], knowledge harvesting is automated and operates at Web scale. Automatically constructed knowledge bases – KB's for short – have become a powerful asset for search, analytics, recommendations, and data integration, with intensive use at big industrial stakeholders. Prominent examples are the Google Knowledge Graph, Facebook's Graph Search, Microsoft Satori as well as domain-specific knowledge bases in business, finance, life sciences, and more. These achievements are rooted in academic research and community projects starting ten years ago, most notably, DBpedia [2], Freebase [5], KnowItAll [13], WikiTaxonomy [34] and Yago [41]. More recent major projects along these lines include BabelNet [31] ConceptNet [40], DeepDive [39], EntityCube (aka. Renlifang) [33], KnowledgeVault [9], Nell [6] Probase [50], Wikidata [47], XLore [48]. The largest of the KB's from these projects contain many millions of entities (i.e., people, places, products etc.) and billions of facts about them (i.e., attribute values and relationships with other entities). Moreover, entities are organized into a taxonomy of semantic classes, sometimes with hundred thousands of fine-grained types. All this is often represented in the form of subject-predicate-object (SPO) triples, following the RDF data model, and some of the KB's – most notably DBpedia – are central to the Web of Linked Open Data [18]. For illustration, here are some examples of SPO triples about Steve Jobs:
Origine : Fichiers produits par l'(les) auteur(s)
Loading...