ShExStatements: Simplifying Shape Expressions for Wikidata

Wikidata recently supported entity schemas based on shape expressions (ShEx). They play an important role in the validation of items belonging to a multitude of domains on Wikidata. However, the number of entity schemas created by the contributors is relatively low compared to the number of WikiProjects. The past couple of years have seen attempts at simplifying the shape expressions and building tools for creating them. In this article, ShExStatements is presented with the goal of simplifying writing the shape expressions for Wikidata.


INTRODUCTION
Entity schemas based on ShEx (Shape Expressions) [10,13,15] were recently introduced on Wikidata [17]. One of the main advantages of Shape Expressions is that they can be used for RDF validation [7,14,16]. Several tools and scripts currently exist that can be used to visualize and validate a subset of data on Wikidata using ShEx. One such tool is shex.js [4], which let the Wikidata contributors easily check entities against any particular schema. A SPARQL query is used to select a subset of relevant data from Wikidata and the validation is run on this prefetched data. Thus users can both test and explore the current state of the data related to the SPARQL query. They may propose new modifications to the entity schema or even correct the data items.
In the case of Wikidata, WikiProjects are used to identify and discuss relevant properties for the items to a particular domain. For example, WikiProject Informatics 1 identifies properties for software, hardware, programming languages, file sytems, algorithms, etc. The number of Wikiprojects is an interesting indicator for measuring the use of entity schemas since WikiProjects are managed by dedicated contributors interested in a particular domain. At the time of writing, there are only less than 300 shape expressions 2 on Wikidata. This number is quite low compared to the number of WikiProjects on Wikipedia [9] and Wikidata [6].
Therefore any tool for shape expressions must take into consideration such WikiProjects and propose ways to integrate the information present in these tools to build shape expressions. One possible approach is to propose a smaller subset of shape expressions that can be used to build simple shape expressions in a manner that closely resembles some of the existing templates 3 . These simple expressions can take into account the WikiProjects for validating whether the items belonging to a given domain have all the necessary statements. Considering the multilingual nature, another important aspect is to let the communities describe relevant domains in their local languages. ShExStatements 4 [12] was developed to answer these requirements. It was developed in a manner similar to the QuickStatements 5 and OpenRefine 6 [3] to ensure a simpler interface using tabular formats or CSV files.
In this article ShExStatements [12] is presented, explaining how a tabular format or a CSV file format was developed for simplifying writing shape expressions, especially for the new comers. In section 2, state of the art is presented. Taking an example, the grammar of ShExStatements is described in section 3. Section 4 presents the development and use of ShExStatements. Section 6 concludes the article.

RELATED WORKS
Several WikiProjects 7 are currently available on Wikidata related to open government data, culture, history, sports, birds, agriculture, tourism, etc. Some of these WikiProjects take into consideration the infoboxes of Wikipedia [8] belonging to different languages to identify the different properties used to describe the objects belonging to a certain class. These infobox properties are then mapped to appropriate Wikidata properties. WikiProjects, therefore, play an important role in identifying the key Wikidata properties. However, WikiProjects alone cannot be used to automatically validate the existing Wikidata items.
Though Wikidata supports property constraints 8 , their usage is limited to specifying how properties can be used. Wikidata items use multiple properties and schemas are needed to describe and validate the items belonging to different classes. This is very important in multilingual and multi-domain context. Therefore, validation of RDF [7,14,16] is important to ensure data present in a semantic knowledge base is following the proposed ontology or schema.
Several tools have been proposed that take into consideration the expressivity [15] of shape expressions. These tools can be classified in the way shape expressions can be created. The first approach is to automatically generate schema expressions from existing RDF data. Designer [1], Wikidata Shape Expressions Inference 9 , and sheXer 10 are some examples. Visual interfaces have also been suggested to understand, modify and create new shape expressions. YASHE 11 and ShExAuthor 12 are examples of some visual tools for creating Shape Expressions. Finally, there are approaches that propose a smaller subset of the ShEx language. Shex-Lite 13 [2], ShExML [5], and ShExStatements belong to this category. ShExML is a language developed to integrate multiple heterogeneous data sources. Shex-Lite is meant to be an independent language, maintaining compatibility with ShEx, and can be used to generate object models in object-oriented programming languages. ShExStatements, on the other hand, is a language developed to generate ShEx from CSV files and tabular formats.

SHEXSTATEMENTS
To explain the grammar of ShExStatements, an example is given below in Figure 6. It describes the ShExStatements of a human language on Wikidata.
(1) Node name (2)   If these five columns are present in the CSV file, column 1 is used for specifying the node name, column 2 for specifying the property value, column 3 for possible values, column 4 for cardinality (+,), and column 5 for comments. Comments start with #. Columns 1, 2, 3 are mandatory. Column 3 can be a special value like . (period to say 'any' value). Columns 3,4 and 5 are empty for prefixes.
Consider the first statement in the second part. It states that a language must be an instance of (wdt:P31) a language (wd:Q34770). The fourth value, cardinality is intenionally left blank. The fifth value starts with a # indicating a comment.
Cardinality can be any one of the following values (1) * : zero or more values (2) ? : zero or one (3) + : one or more values (4) m : m number of values (5) m,n : any number of values between m and n (including m and n). Take the fifth statement that states that a language can have one or more writing systems, hence the use of + in the fourth column.
But the third column can also be another node. A ShExStatements file can also use delimiters like vertical bar (|) or semicolons (;). The following example in Figure 3 shows these two cases.
This example is a ShExStatements of a TV series. The first statement describes that a TV series is an instance of wd:Q5398426 (television series). The second statement states that a TV series has zero or more genres wdt:P136. However, to describe a genre, we need additional statements. The third statement describes a genre to be @tvseries|wdt:P31|wd:Q5398426||# instance of a tvseries @tvseries|wdt:P136|@genre|*|# genre @genre|wdt:P31|wd:Q201658,wd:Q15961987|#instance of genre  PERIOD .
an instance of wd:Q201658 (film genre) or wd:Q15961987 (television genre). This statement is interesting since it demonstrates the use of different separators. The above example uses vertical bar (|) for separating the columns. The multiple possible values in the third column are separated by comma (,). Now, the grammar of ShExStatements can be formalized. A simplified version of the grammar of ShExStatements is given below. For the complete grammar (for example, optional comments, shape constraints, import statements, etc.), the readers can take a look at shexstatementsparser.py in [12].
ShExStatements consists of one or more statements, often preceded by prefix statements. There may exist blank lines (NEWLINE) between the statements. along with the cardinality. For example, a language has a property value of . (period) with the cardinality + for the property wdt:P282 (writing system). These are detailed below in the grammar. Table 1 can be used as a reference for understanding the terms in upper case.

SEP comment | n o d e p r o p e r t y p r o p e r t y v a l u e SEP c a r d i n a l i t y SEP comment
As described above, in this article, we have given a grammar that shows a statement must have a comment. However, a comment can be omitted. A nodeproperty is a combination of node and property, separated by SEP(|).

Listing 5: ShExStatements: prefix
p r e f i x : STRING SEP STRING A propertyvalue in the third column may be a value, a node, a type, or a special term (e.g., LITERAL above). To specify types other than LITERAL, we need a special case to distinguish values from types.
Take, for example, in the example given below, we want to specify that a painting must have creation date of type xsd:string. Unlike values, this is a special case. Here we do not know any possible value, but we know the type of those values. @painting,wdt:P571,@@xsd:dateTime,#date of creation A prop is just a value or value followed by ∧. This is interesting to specify cases, where we wish to specify that the statement with the given property must not hold.

DEVELOPMENT
ShExStatements is developed in Python and has multiple interfaces. It can be executed from the command line. There is also a web interface as shown in Figure 5 and an API that allows users to generate shape expressions from CSV files.
It uses the library ply 14 for writing the grammar as described above and the parser for parsing CSV files or input. The web interface is built using Flask 15 and pyshex 16 is used to generate ShExj 17 from ShExStatements.

ShEx generation
ShExStatements is also available on Python package index 18 and therefore can be installed using pip. Once ShExStatements is installed, run the following command with the above example written in a file (for example, language.csv). This file contains an example description of a language on Wikidata and uses comma as a delimiter to separate the values.
$ . / s h e x s t a t e m e n t s . sh l a n g u a g e . c s v ShExStatements will generate the following Shape Expression (ShEx). It is also possible to use shexstatements in Python programs. The method _ ℎ _ _ takes as input a CSV file containing shexstatements and a delimiter. In this example, we use ", " as a delimiter. ShExStatements has also a public API that can be easily accessed both on a local installation as well as on the public interface. It has one operation that takes as input a JSON array with two elements as given below: • delimiter  It returns a JSON array with one element containing the ShEx (shape expression).

RESULTS
ShExStatements is also available on Toolforge 19 along with a detailed documentation 20 . A number of shape expressions were created during COVID-19 Biohackathon April 5-11 2020 21 using ShExStatements. For example, pandemic (EntitySchema:E184 22 ), hospital, preprint, lockdown, etc. The primary goal was to identify the key properties for these entities, which could later be improved and extended. During this hackathon, the possibility of using such simple shape expressions for selecting data from Wikidata was also discussed.
Wikidata is a multilingual knowledge base. One of the main objectives of ShExStatements is to ensure its use by multilingual users. ShExStatements mainly makes use of symbols and positions for specifying prefixes or even imports. EntitySchema:E210 23 is one such example which was generated from examples/hospital.csv in [12]. Other ShExStatements related to Biohackathon can also be found in the folder examples/ [12].

Limitations and Future Works
Currently, ShExStatements only works with CSV files or tabular formats. Future works include supporting formats like Office Open XML format and Excel files. User evaluation tests are required to understand the challenges associated with writing shape expressions. Even though ShExStatements was tested mainly for data on Wikidata and for creating new entity schemas, it can also be used for generating ShEx for other RDF data sources. This needs to be further explored.
Another possible major work is to integrate ShExStatements in such a manner that users can directly create new entity schemas from the ShExStatements application. Currently, users need to manually copy the generated shape expression from ShExStatements and then create a new entity schema on Wikidata. ShExStatements applications can also be integrated with other works that support tabular formats for generating shape expressions. Finally, WikiProjects can also play an important role in the greater use of shape expressions for data validation. Contributors can develop simple shape expressions and link them to the appropriate WikiProject page.

CONCLUSION
Validation of data is important, especially for Wikidata considering the multilingual and multi-domain nature of the knowledge base. The recently introduced shape expressions (ShEx) is a major step in this direction. To promote its use, more tools may be required. Tabular formats, especially CSV files are commonly used file formats by Wikidata contributors while using tools like OpenRefine and QuickStatements. In this article, ShExStatements tool was presented to simplify writing shape expressions using CSV files. A subset of ShEx was used for building ShExStatements. With a command-line interface, a Python library, and a web interface, ShExStatements provide a wide variety of ways to generate shape expressions using this simpler subset.