A License-Based Search Engine

,


Introduction and motivation
To facilitate reuse on the Web, resource producers should systematically associate licenses with resources before sharing or publishing them [3]. Licenses specify precisely the conditions of reuse of resources, i.e., what actions are permitted, obliged and prohibited when using the resource.
For a resource producer, choosing the appropriate license for a combined resource or choosing the appropriate licensed resources for a combination involves choosing a license compliant with all the licenses of combined resources as well as analysing the reusability of the resulting resource through the compatibility of its license.
We consider simplified definitions of compliance and compatibility [1], a license l j is compliant with a license l i if a resource licensed under l i can be licensed under l j without violating l i . If a license l j is compliant with l i then we consider that l i is compatible with l j and that resources licensed under l i are reusable with resources licensed under l j . In general, if l i is compatible with l j then l j is more (or equally) restrictive than l i . We also consider that a license l j is more (or equally) restrictive than a license l i if l j allows at most the same permissions and has at least the same prohibitions/obligations than l i . But producing resources whose licenses are compliant with all reused resource licenses is difficult. It is necessary to know (1) the set of licenses with which the license of the produced resource is compliant and (2) what are the pertinent and available resources whose licenses are part of this set.
With CaLi [1], we provide an answer to the first concern. CaLi is a latticebased model to define compatibility and compliance relations among licenses. It is based on a restrictiveness relation that is refined with constraints to take into account the semantics of actions existing in licenses.
For the second concern, imagine a license-based search engine that can answer questions such as "find all resources that can be reused under the CC BY-NC license". The answer must contain resources licensed under licenses such as CC BY and CC BY-NC itself that are less or as restrictive as CC BY-NC and compatible with it.
There exist search engines in services such as GitHub 3 , APISearch 4 , CC search 5 , LODAtlas 6 , DataHub 7 , Google Dataset Search 8 or OpenDataSoft 9 that can find resources licensed under a particular license. However they can not find resources whose licenses are compatible or compliant with a particular license.
We illustrate the usability of CaLi by answering the second concern. We developed a prototype of a search engine based on a CaLi ordering of licenses, ODRL_CaLi. The goal is to be able find resources whose licenses are compatible or compliant with a particular license. Our prototype can answer questions such as: "find licensed resources that can be reused under a given license" or "find licensed resources that can reuse a resource that has a particular license".
In our search engine, resources (linked data and source code) are associated to licenses. Licenses are described in RDF with the ODRL vocabulary 10 and ordered in terms of compatibility according to the ODRL_CaLi ordering. In addition to indexing licenses, the titles, descriptions and uri of each licensed resources are also indexed to enable full-text search. We remark that we are not interested in implementing ODRL. We use the ODRL vocabulary because it is the most complete vocabulary for licenses and it is well accepted by the community.
In the following, Section 2 overviews the CaLi model and the ODRL_CaLi ordering used in our search engine, and Section 3 describes the demonstration.

Modelling the compatibility of licenses
Inspired by lattice-based access control models, we propose a CaLi model as a tuple A, LS, C L , C → that partially orders licenses, such that [1]:

1.
A is a set of actions (e.g., read, modify, distribute, etc.); 2. LS is a restrictiveness lattice of status that defines (i) all possible status (e.g., permissions, obligations, prohibitions, etc.) of an action in a license and (ii) the restrictiveness relation among status denoted by S ; 3. C → is a set of compatibility constraints to identify if a restrictiveness relation between two licenses is also a compatibility relation; and 4. C L is a set of license constraints to identify non-valid licenses.
In CaLi, L A,LS defines the set of all licenses that can be expressed with A and LS. (L A,LS , R ) is the restrictiveness lattice of licenses that defines the restrictiveness relation R over the set of all licenses L A,LS . With C L non-valid licenses are identified. We consider a license l i as non-valid if a resource can not be licensed under l i . If two valid licenses have a restrictiveness relation then it is possible that they have a compatibility relation too. To identify the compatibility among licenses, CaLi refines the restrictiveness relation with compatibility constraints C → .
ODRL_CaLi, is a CaLi ordering A, LS, C L , C → such that: -A is the set of 72 actions considered by ODRL 11 ; -LS is the restrictiveness lattice of status where (i) the possible status are Permission, Duty, Prohibition 12 or Undefined (for actions that do not appear in the license), and (ii) the restrictiveness relation is U ndef ined S P ermission S Duty S P rohibition; and -C L , C → are the sets of constraints, inspired from the ODRL information model, defined below. C → = {ω →1 , ω →2 } allows to identify (1) when cc:ShareAlike is required and (2) when cc:DerivativeWorks is prohibited. That is because cc:ShareAlike requires that the distribution of derivative works be under the same license only, and cc:DerivativeWorks, when prohibited, does not allow the distribution of a derivative resource, regardless of the license. Other constraints could be defined to be closer to the ODRL information model but for the purposes of this demonstration these constraints are enough.
The size growth of CaLi orderings is exponential |LS| |A| , so the size of ODRL_CaLi is 4 72 , which makes it impossible to build. Nevertheless, it is not necessary to explicitly build a lattice to use it. Our search engine uses a sorting algorithm that can sort any set of licenses according to the LS defined above, in approximatively n 2 /2 comparisons of restrictiveness, n being the number of licenses to sort, i.e., O(n 2 ). This algorithm is able to insert a license in a graph in linear time O(n) without sorting again the graph (see [1] for more details). Thus, our algorithm produces compatibility graphs of licenses conform to the ODRL_CaLi ordering of licenses. This algorithm is available on GitHub under the MIT license 13 .

Demonstration
Using ODRL_CaLi and the sorting algorithm described in the previous section, we generated two compatibility graphs of licences. One for licenses that are the most used in DataHub 14 and another for the most used licenses in GitHub. Licenses are in RDF. We use the dataset of licenses proposed by [2].
Resources associated to licenses refer to some licensed RDF datasets from DataHub, from OpenDataSoft 15 and from licensed repositories from GitHub.
The source code of the search engine is available on GitHub 16 under the MIT license. Our demonstration is available online at http://cali.priloo. univ-nantes.fr.
Both compatibility graphs of licences are visually available. Figure 1a shows the compatibility graph of the CaLi ordering for some licensed RDF datasets. Blue nodes are licenses, grey arrows are compatibility relations among licenses and orange nodes are RDF datasets associated to licenses. Licenses that have the same actions in the same status are represented in the same node. In the graph, licenses that are compatible with a particular license l i are below l i and licenses that are compliant with l i are above l i . We recall that the ordering relations of compatibility and compliance that we define are reflexive, transitive and asymmetric.
During the demonstration, attendees will be able to search for resources licensed under licenses compliant or compatible with a particular license. Figure  1b shows the search bar of our search engine. It enables full-text and licensecompliant searches over each graph, for RDF datasets 17 or repositories 18 . For example, users can search for datasets about 'bikes' whose licenses are compatible with the CC BY-NC license (i.e. datasets about 'bikes' that can be reused under the CC BY-NC license). The result contains all RDF datasets indexed in the search engine where title or description contains the word 'bikes' and whose license is compatible with CC BY-NC (e.g. CC BY, MIT, CC-Ze, etc.).  Both compatibility graphs of licences are available online through a documented API. Finally, these graphs are also accessible through a TPF server 1920 or can be exported in RDF (turtle, xml, n3 and json-ld).
A possible extension of our search engine is to allow the collaborative addition of licenses and licensed resources. That is, to allow users to add new licenses and resources to increase the size and therefore the interest of these two graphs.