Availability of Cultural Heritage Structured Metadata in the World Wide Web

We would like to acknowledge the supporting work by Antoine Isaac and Valentine Charles, from the Europeana Foundation, for their reviews and discussions regarding our work. This work was partially supported by Portuguese national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013, and by the European Commission under contract number 30-CE-0885387/00-80.


Introduction 1
In the World Wide Web, a very large number of online cultural heritage (CH) resources is made available through digital libraries websites.The discoverability of these resources through Internet search engines is still a challenge.Many CH resources are not of a textual nature (e.g., images, video or sound).Those that are textual, often lack machine readable full-text, of which search engines are highly dependent, because they consist of digitized images where the application of optical character recognition (OCR) was not performed, due to lack of funding or availability of a mature OCR technology (e.g., for manuscripts or early printed materials).For discoverability, CH Institutions have always relied on the creation of data records describing the resources.

2
These descriptive records are the basis for accessing and retrieving the resources through each institutional digital library website, which are specifically built for retrieval of this kind of data.The existence of many individual digital libraries, maintained by different organizations, brings challenges to the discoverability and usage of the resources by potential users, making the adequate indexing of cultural heritage metadata in internet search engines even more relevant.
Across institutions, the discoverability problem is addressed by an organizational architecture based on a central organization (a role often fulfilled by a CH institution, but not always).These organizations approach discoverability of the resources by collecting their associated metadata descriptive records.The central organization has the possibility to further promote the usage of the resources by means that cannot be efficiently undertaken by each digital library in isolation.They typically provide Web portals that contain CH focused search engines, also specifically built for this kind of data records [1].
In the particular domain of CH, the data aggregation technologies used are not the same as for Internet search engines.OAI-PMH [2] has been the embraced aggregation solution, since it is highly specialized in fulfilling the requirements for the aggregation of metadata datasets.However, the technological landscape around our domain has changed.Nowadays, with the technological improvements accomplished by network communications, computational capacity, Internet search engines, and semantic data interoperability, the motivation for adopting OAI-PMH is not as clear as it used to be in CH [3].
In the last years, the CH domain has been able to create sustainable aggregation initiatives, with self-sustaining business models.Examples are Europeana, DPLA, DigitalNZ, Trove and Digital Library of India, which are collecting and providing access to the public digitized cultural assets from Europe, United States of America, New Zealand, Australia and India, respectively.However, the costs related to the implementation of the technical solution for aggregation are high for data providers.For these aggregation initiatives, reducing the effort required for data providers would bring more participants to their networks and lower the overall costs, therefore increasing the sustainability of the whole network [7].In this context, if cultural heritage aggregators were able to re-use the technological solutions in use for indexing by Internet search engines, data providers could benefit from several advantages.In particular, it would give data providers the following motivations: • For those already implementing these technologies in their digital libraries, the process for sharing their data with CH aggregators would become extremely simple.
• For those that do not yet have these technologies in use, implementing the technical requirements for CH aggregation would be more rewarding, since discoverability through Internet search engines would come as a valuable extra benefit.This paper presents a study of the current application by CH data providers of technological solutions in use for making structured data (or metadata, in the CH context) available for re-use in the Internet.We investigated the use of both linked data and technologies related with indexing of resources by Internet search engines.We have conducted a harvesting experiment of the landing pages from websites of CH digital libraries that participate in Europeana, and collected statistics about the usage of these particular technologies.These technologies allow for representing structured data within HTML, or allow for structured data to be referred to by links within HTML or through HTTP headers.An analysis and discussion of the collected statistics is also presented.
We conclude with a discussion, based on the outcomes of this study, regarding future work for establishing a solution for CH aggregation based on the current CH scenario and the available technologies.
Although the use of linked data in CH has been the focus of much research, most of published literature addresses mainly the aspect of the publication of linked data [11] [12] [13] and do not fully address how the common aggregation approach of CH can be based on the existing published CH linked data.
The most similar work to ours is that of the Dutch Digital Heritage Network (NDE) [9] and the Research and Education Space project 1 (RES).NDE is a Dutch national level program aiming to increase the social value of the collections maintained by the libraries, archives and museums in the Netherlands.NDE is still an ongoing project, and its initial proposals are based on specific APIs to enable data providers to centrally register the linked data URIs of their resources [10].The current proposal of NDE, by being based in its own defined API, does not yet provide a solution purely based on linked data.
The Research and Education Space project has finalized in 2017 but its results are still available.It has successfully aggregated a considerable number of linked data resources from CH sources.The resulting aggregated dataset can be accessed online 2 , but an evaluation of its aggregation procedures and results was not published.From the technical documentation available [19], we can see that RES managed to give significant steps in the specification of key tasks to enable the aggregation of linked data.Some tasks however were not fully specified by the end of the project, and no further information has been published afterwards.
Generic technical solutions have been proposed by others for enabling aggregation of linked data (for example [14]).However, a standards-based approach has not yet been put into practice within CH.
The work presented in this paper is done in the context of the research activities, being carried out within the Europeana Network 3 , for improving the network's efficiency and sustainability [7].Linked data has been identified in our past work as one of the technical solutions with application potential [1].The work described in this paper is part of a series of experiments addressing several Internet technologies for this purpose [15] [16].

The experimental setup
In our harvesting experiment from the landing pages of resources from Europeana data providers, we have harvested samples from 31 different sources.In order to setup this test sample, we used the Europeana Search 4 and Record 5 APIs.The Search API was used first to discover the existing data providers of Europeana and their collections.Afterwards, on a second set of requests, Search API was used to discover a list of records from each collection.In subsequent requests, made on the Record API, we requested the complete metadata records of a sample of records per collection of each data provider.At most 100 records per collection were obtained.From these records we collected the URLs of the landing pages on the data providers' digital libraries.The metadata records were obtained in the Europeana Data Model (EDM) [8] format and the URLs were obtained from the EDM isShownAt 6 property of the ORE 7 Aggregation 8 element.In total, the sample comprehended URLs from 31 data providers, 609 collections and 52.866 resources.
We issued two requests on each of the 52.866 landing pages: one request for the human readable version in HTML and a second request for machine readable representation of the resource using HTTP content negotiation [5].We then processed the responses and collected statistics on the usage of three possible ways that these digital libraries could be encoding the metadata descriptions of the cultural heritage objects: HTML5 meta tags, RDFa/RDFa lite and RDF data (in any of the commonly used serialization formats).For the analysis of the HTML5 meta tags, we have excluded the standard HTTP tags, since they are not meant to provide any descriptive data regarding the content of HTML pages.
An additional aspect is addressed in the experiment -the data model or namespaces of the structured data encoded in the landing pages.In particular, we are interested in gathering statistics regarding the use of two data models: Dublin Core Metadata Elements [4]; and Schema.org 9 (used by Google and several other companies).These data models are the most likely to be nowadays in use by CH institutions to represent the metadata of their resources.

Results
The totals responses obtained from the requests issued to the sample of 52.866 URLs are shown in Table 1.None of the responses for linked data resulted in valid RDF.The most frequent response was the HTML page, instead of RDF, therefore hinting that HTTP content negotiation was not even supported.In some cases an error "Unsupported content-type" was received, and for some sporadic cases a JSON response was received, but it was not in a JSON-LD form, therefore, no RDF triples could be obtained from them.Our visual inspection of some of these cases detected that the JSON data was under a specific format, probably defined by a particular JSON API of a digital library system.
In the responses to the HTML requests, we detected a total of 25.276 HTML pages containing some form of structured data: 14.407 pages with HTML5 meta tags; and 10.869 with Schema.orgdata, encoded in RDFa, RDFa Lite or JSON-LD (Table 1).Table 2 summarizes the usage of the HTML5 meta tags.Whenever meta tags were present on the HTML pages, at least one of the standard HTML5 meta tags was in use.In some cases, meta tags with properties using prefixes were also present.Although none of the HTML pages specified the namespace of the prefixes in use (it can be done by using RDFa), some of the prefixes are well-known, and typically they refer to the following namespaces: • "dc" -Dublin Core Metadata Element Set 10 • "dcterms" -DCMI Metadata Terms 11 • "og" -The Open Graph Protocol 12   The prefixes "eprints" and "egms" prefixes where found as well, but we cannot be certain to which namespaces they refer.In order to make use of CH linked data for metadata aggregations, less automated approaches need to be employed to discover, link and adapt the aggregation systems to each dataset of the participating CHI data sources (SPARQL end points, data dumps, etc.).Alternatively, aggregators may start to define the technical mechanisms for making linked data automatically discoverable, accessible and usable for aggregation.
Another aspect we also conclude from the experiment, is that it supports the beneficial value of CH aggregation initiatives, such as Europeana and DPLA, for promoting the discoverability of the CH objects through both the WWW and linked data.The activities of aggregators in the publication of open linked data, such as [17], are likely to be the most interoperable source of CH linked data currently available.The results of the experiment provide further motivation for the development of Europeana's activities towards Schema.orgpublication of its dataset and CH metadata in general [18].
The next steps of our work will be to survey technologies of the Semantic Web, linked data and vocabularies for the description of datasets.We will analyze these technologies in search for a solution that will enable the aggregation of linked data in fully automatized ways or requiring very little human intervention.Table 4 shows a list of those technologies that we have identified at this stage of our work.

Linked Data
Platform [20] "Linked Data Platform (LDP) defines a set of rules for HTTP operations on web resources, some based on RDF, to provide an architecture for read-write Linked Data on the web" [20].

VoID -Vocabulary of Interlinked
Datasets [21] "VoID is an RDF Schema vocabulary for expressing metadata about RDF datasets.
It is intended as a bridge between the publishers and users of RDF data, with applications ranging from data discovery to cataloging and archiving of datasets."[21].

Catalogue
Vocabulary [22] "DCAT is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web.Publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs.It further enables decentralized publishing of catalogs and facilitates federated dataset search across sites."[22] Schema.org 13  The Schema.org vocabulary defines classes representing Datasets 14 and their distribution 15 .

EDM Datasets
Profile [23] This profile defines the elements used to represent datasets ingested by Europeana.The profile is mainly intended to be used to disseminate dataset level information via the Europeana API.

Figure 1 .
Figure 1.The experimental setup

Table 1 .
Structured data obtained from the requests issued to the sample of 52.866 URLs from Europeana providers

Table 2 .
The rdf:type of Schema.orgRDF resources present in the HTML pages

Table 4 .
Technologies of the Semantic Web, linked data, and vocabularies for the description of datasets, which may enable the aggregation of linked data in fully automatized ways or with little human intervention In the World Wide Web, a very large number of resources is made available through digital libraries.The existence of many individual digital libraries, maintained by different organizations, brings challenges to the discoverability, sharing and reuse of the resources.A widely-used approach is metadata aggregation, where centralized efforts like Europeana facilitate the discoverability and use of the resources by collecting their associated metadata.The cultural heritage domain embraced the aggregation approach while, at the same time, the technological landscape kept evolving.Nowadays, cultural heritage institutions are increasingly applying technologies designed for the wider interoperability on the Web.This paper presents a study of the current application by cultural heritage data providers of technological solutions in use for making structured metadata available for re-use in the Internet.We investigated the use of both linked data and technologies related with indexing of resources by Internet search engines.We have conducted a harvesting experiment of the landing pages from websites of digital libraries that participate in Europeana, and collected statistics about the usage these particular technologies.These technologies allow for representing structured data within HTML, or for structured data to be referred to by links within HTML or through HTTP headers capabilities.We conclude with a discussion of future work for establishing a solution for cultural heritage aggregation based on the current situation and the available technologies.