Gossiping correspondences to reduce semantic heterogeneity of unstructured P2P systems

. In this paper we consider P2P data sharing systems in which each participant uses an ontology to represent its data. If all the participants do not use the same ontology, the system is said to be semantically heterogeneous. This situation of heterogeneity prevents perfect interoperability. Indeed participants could be unable to treat queries for which they do not understand some concepts. Intuitively, the more heterogeneous a system, the harder to communicate. We ﬁrst deﬁne several measures to characterize the semantic heterogeneity of P2P systems according to diﬀerent facets. Then, we propose a solution, called CorDis, to reduce the heterogeneity by decreasing the gap between peers. The idea is to gossip correspondences through the system so that peers become less disparate from each other. The experiments use the PeerSim simulator and ontologies from OntoFarm. The results show that CorDis signiﬁcantly reduces some facets of semantic heterogeneity while the network traﬃc and the storage space are bounded.


Introduction
We consider peer-to-peer (P2P) data sharing systems where semantic meta-data are used to represent information and to enhance search.This general setting can be instanciated in different ways depending on the kind of meta-data used.We focus on applications where each peer uses an ontology to represent the information it stores.Typical examples are indexing documents or data sets with respect to the concepts of the ontology, or annotating the elements of a database schema with entities of the ontology.
The use of different ontologies results in semantic heterogeneity of the system.Because some peers are unable to precisely understand each others, some semantic interoperability has to be reached in some way.It is generally assumed that neighbour peers use alignments between their ontologies [7].Then, knowing correspondences between entities of ontologies, each peer translates incoming queries before forwarding them.This kind of approach works well in some cases, although it suffers from information losses due to several translations [8,3].Moreover, it mainly focuses on leveraging interoperability without considering reducing semantic heterogeneity.Our goal is to define a class of algorithms that reduce the semantic heterogeneity of the P2P system, thus leveraging interoperability as a consequence.We proceed in two steps.
The first step consists in characterizing semantic heterogeneity.Apart from some intuitions like "the more different ontologies are used in the system, the higher heterogeneity is", or "the more alignments are known, the lower heterogeneity is", no definition of semantic heterogeneity exists (at least to our knowledge).Based on the observation that the concept of heterogeneity has several dimensions (or facets), we propose several definitions to capture them.
The goal of the second step is to define algorithms that make semantic heterogeneity decrease along some dimensions.Of course, a simple way to decrease heterogeneity is having the peers use exactly the same ontology.We believe that this is not realistic when peers are numerous and with different backgrounds.Hence we focus on solutions that have the peers increase their knowledge of alignments.Assuming that they join the system with already known alignments, the probability that all of them know exactly the same alignments is very low.Thus the idea is to make the peers share their knowledge by disseminating correspondences between entities of different ontologies.We consider a case where peers trust each others: no correspondence should be disregarded because it has been forwarded by an untrusted peer.
In order to implement dissemination of correspondences we use a gossiping algorithm in the sense of [11]: each peer regularly picks up some other peer for a two-way information exchange.In our case, each peer selects some correspondences to send to another peer.This latter also selects correspondences and send them to the former.After several rounds correspondences disseminate across the system.The CorDis protocol is based on this idea.In addition, because peers generally have limited local storage, a scoring function is used to order the correspondences and store the most relevant ones.Relevance is computed considering a history of the incoming queries.We propose to favour the correspondences that involve entities that appeared in recent queries, and, to some extent chosen by the programmer, those involving entities belonging to ontologies referred to in recent queries.The scores of the correspondences are regularly updated, so that the CorDis protocol adapts the information exchange to the current queries.
In this paper, we bring several contributions.After presenting our formal model (section 2), we first propose several definitions of semantic heterogeneity measures, corresponding to different facets of this notion (section 3).Second, we propose the CorDis gossip-based protocol to disseminate correspondences across the system (section 4).It considers a history of queries to score the correspondences.Thus it ensures some flexibility with respect to current queries.Third, we report on several experiments conducted with the PeerSim simulator and fifteen ontologies from OntoFarm (section 5).The CorDis protocol is evaluated with respect to the proposed measures of semantic heterogeneity.The results show that CorDis significantly reduces several facets of heterogeneity while the network traffic and the storage space are bounded.This work builds on previous results concerning ontology mapping, ontology distances and gossiping algorithms.However, it does not have any equivalent among the previously proposed solutions to improve semantic interoperability (section 6).

The P2P system
We assume that each peer p has a unique identifier, denoted by id(p).To ensure relationships with other peers, peer p maintains a routing table table(p), composed of a set of peer identifiers which are called p's neighbours.Definition 1 (Unstructured P2P system) An unstructured P2P system is defined by a graph S = P, N , where P is a set of peers and N represents a neighbourhood relation defined by: N = {(p i , p j ) ∈ P 2 : p j ∈ table(p i )}.
In the system presented in Fig. 1 the neighbourhood of p 1 within a radius equal to 2 is composed of p 2 , p 3 , p 4 and p 5 .

Ontologies and alignments
We consider that an ontology is composed of a set of concepts C o , a set of relations R o (linking concepts) and a set of properties P o (assigned to concepts).The union of these three sets of entities is denoted by E o .In practice OWL [13] allows to represent ontologies by defining classes, datatype properties and object properties.We assume that each ontology is uniquely identified by an URI.Thus two ontologies are equal if and only if their URIs are the same.We assume that a peer uses the same ontology during its life-time.
An alignment process aims at identifying a set of correspondences between the entities of two ontologies [7].
Definition 2 (Correspondence) A correspondence is a 4-tuple e, e ′ , r, n such that e (resp.e ′ ) is an entity from o (resp.o ′ ), r is a relation between e and e ′ , and n is a confidence value.
An alignment between ontologies of Fig. 2 could contain the correspondences: T hing 1 , T hing 2 , ≡, 1 , F lower 1 , F lower 2 , ≡, 1 , odour 1 , f ragrance 2 , ≡, 1 and Edelweiss 1 , F lower 2 , isA, 1 .Notice that an alignment is not necessarly perfect in the sense that some correct correspondences may be missing and others may be incorrect.Here, we assume that an alignment does not contain incorrect correspondences.Definition 3 (Peer-to-ontology mapping) Given a P2P system S = P, N and a set of ontologies O, a peer-to-ontology mapping is a function µ : P → O, mapping each peer to one ontology.
In order to understand incoming queries each peer must know correspondences.We denote by κ p the set of correspondences stored by a peer p and κ p (o, o ′ ) denotes the subset of κ p concerning ontologies o and o ′ .

Disparity between two peers
We introduce the notion of disparity function to quantify the difference between two peers.Definition 4 A disparity function d : P × P → [0, 1] is a function that assigns a real value in [0, 1] to a couple p, p ′ representing how much p ′ differs from p.It satisfies the minimality property: ∀p ∈ P, d(p, p) = 0, but we do not assume it is a mathematical distance.
There are different ways to define disparity and several proposals exist [12,5].Some consider the alignments between the peers' ontologies [5].

Semantic heterogeneity of a system
The following definition states what a semantic heterogeneity function is.It does not mean that heterogeneity can be captured by a single measure.Rather, depending on the application, several complementary measures could be used.
Definition 5 Let SM be a set of models M = S, O, µ, d where S is a P2P system, O is a set of ontologies, µ is a peer-to-ontology mapping, and d is a disparity function between peers.A semantic heterogeneity measure is a function The conditions express that (i) homogeneity occurs when the same ontology is used by all the peers and that (ii) maximal heterogeneity occurs when all the disparities between peers are maximal.
In this section, we propose measures which are general enough to be used in many application domains while still being meaningful.

Disparity unaware measure
Notion of diversity is commonly used to measure the heterogeneity of a population (e.g. in biology).Richness partly characterizes the diversity of a population.In our context it depends on the number of different ontologies used in the system.If all the peers use the same ontology, then the system is completely homogeneous.By cons, the more ontologies there are, the more heterogeneous it is.This idea can be expressed by the following measure: where |o S | is the number of different ontologies used in the system S, and |P| the number of peers.In the system presented on Figure 1, four different ontologies are used by the ten participants: H Rich (M) = 4−1 10−1 = 0.33.Measuring richness allows to draw preliminary conclusions.In particular it gives information about the need of alignments to reach interoperability.A richness value equal to 0 means that heterogeneity is null: no alignment is needed to ensure interoperability in the system.A value equal to 1 means that heterogeneity is total: alignments are needed between each pair of participants to communicate.

Disparity aware measures
Topology unaware measure We propose to consider disparity between peers rather than only consider the ontologies they use.If the disparity between peers is globally important, it means that peers have important knowledge differences.The more different their knowledge, the harder to communicate (i.e.answering queries).Indeed an important loss of information will occur during query translation.As we do not take into account the system topology, we consider the disparity between each pair of peers: The H Disp measure determines if peers are globally disparate from each other.
Topology aware measure We propose to take into account how disparate peers are with regards to their neighbourhoods.If peers are globally far (semantically speaking) from their respective neighbourhoods, the system is highly heterogeneous.Contrariwise, if peers are close to their neighbourhoods, the system is weakly heterogeneous, even if the diversity of the system is not null.
We denote by N r (p) the neighbourhood of a peer p within a radius r.It is the set of peers accessible from p with l hops, where 1 l r.We consider that p does not belong to N r (p).We first propose a measure that focuses on a given peer and determines how this latter is understood by its neighbours: A global measure can be obtained: If H DapAvg 's value is weak, it means that peers are globally close to their neighbours: each peer is surrounded by peers able to "understand" it.
Proposition 1.All the measures introduced in this section satisfy both properties of minimality and maximality (proof is trivial).

Principles of gossip-based protocols
Our approach is based on a gossip-based protocol that disseminates data [11].
In such a protocol, each peer consists of two threads: an active and a passive one.The active thread is used to initiate communications with another peer.We assume that the peer selection is ensured by a peer sampling service, allowing peers to uniformaly and randomly select another peer [9].Thus, each peer regularly contacts another peer to exchange information.We consider that the size of a message does not exceed m max .When a peer is contacted by another one (through the passive thread), the former has to answer by sending some information.Thus, both peers treat the received information.This principle is explicited by algorithms 1 and 2. In these algorithms, peers have to process two crucial tasks: data selection and data processing.

The CorDis protocol
The main idea of this protocol is to disseminate information over the network to share correspondences known by some but ignored by others in order to reduce some facets of semantic heterogeneity of the system.In the remaining of this work, we do not make any assumption about the way queries are transmetted in the system, but we consider that they are unchanged during the propagation: each peer receives the same query, and is responsible to translate it if necessary.When the process starts, each peer p knows some correspondences, a subset of which involves its own ontology (noted init p ).This subset of o p -correspondences (correspondences that involve its own ontology) should always be recorded by the peer.The purpose of dissemination is that each peer learns additional correspondences that might be useful to it to translate the queries it receives into its own ontology.We disseminate the correspondences by gossiping: Each peer p regurlarly initiates an exchange of correspondences with another peer p ′ .It selects some correspondences it knows and sends them to p ′ .In turn, p ′ chooses among the correspondences it stores and send them to p.

Storage of correspondences
Each peer must store the correspondences it has been informed of in some cache, of limited size, thus preventing the peer from storing all the correspondences.Choice of the correspondences to keep is obtained by a scoring function which enables to order the correspondences: only the best ones are kept.In theory, the scoring function could be specific to each peer.Here we propose that each of them consider a history of the received queries.
A history of received queries is made of two lists L 1 and L 2 .List L 1 contains the entities used in the last k received queries, while L 2 contains the ontologies used to express the last k ′ received queries.Notice that an item can appear several time in a list if it has been involved in several queries.The intuition of the scoring function is that peers favour the correspondences that might be useful for translating queries (it can be useful locally, or for others).The coefficient ω ∈ [0, 1] is used for giving more or less importance to a correspondence involving entities that do not appear in recent queries, but that belong to ontologies used recently.If the focus of interest of the queries changes, the scoring values of the correspondences will change, giving more importance to relevant correspondences.Scores are regularly calculated to take dynamicity into account.
Because the correspondences involving its own ontology are of prime importance for the peer, we propose that it tries to store as much possible of them (or all of them if possible) in a specific repository, including init p , distinct from the cache which is then devoted to the other correspondences.If the repository is too small for storing all the o p -correspondences, the peer can use the scoring function to eliminate some of them.We denote by repository(p) the repository of a peer p, and by cache(p) its cache (respectively limited to r max and c max entries).
Data selection When a peer has to send correspondences, it selects them from both the cache and the repository.We introduce the number π ∈ [0, 1] to repre-sent the ratio of correspondences to select in both sets.Thus, a peer randomly selects [π • m max ] correspondences in its repository, and [(1 − π) • m max ] in its cache.Fig. 3 summarizes this process.Random selection is used to ensure that two correspondences of the repository (resp.the cache) have the same probability to be sent.Data processing When a peer p receives a message, it executes two main tasks.First it computes the score of the correspondences in msg and then merges them with its local data.It only consists in adding o p -correspondences in repository(p) and the others in cache(p) and re-order the correspondences.If a correspondence is already stored, the newest score is used.Then, the best r max (resp.c max ) correspondences are kept in the repository (resp. in the cache).

Preliminary experiments
In this section we study the performances of our protocol w.r.t.application parameters, initial heterogeneity, and dynamicity of the system.We used the PeerSim simulator [10] to generate P2P systems as directed graphs.In order to simulate real-world situations we use the OntoFarm dataset [16,5].It is composed of fifteen ontologies, expressed in OWL, dealing with the conference organization domain.Ontologies are composed of 51 concepts in average (between 14 and 141) and their average volume is 41.3 KB (between 7.2 KB and 100.7 KB).We use a Poisson law to distribute ontologies in the system.Thus some ontologies are more used than others.We consider it as a realistic situation.As we only have fifteen ontologies, we consider relatively small systems (i.e. with 100 peers) to ensure a sufficient degree of heterogeneity.Moreover each peer has three other peers as neighbours.
We set π = 0.5 so that correspondences are fairly kept from the repository and the cache.Furthermore we consider that histories constantly change over time: scoring function values vary continually.It is considered as a critical situation.
We exploit the alignments used in [5] as reference alignments between ontologies.In average 98 correspondences are available from one ontology to others (altogether 1470 correspondences).As each correspondence is an equivalence between two concepts (with n = 1), we adapt the coverage measure presented in [5] as the measure of disparity between two peers.It is defined as: where o and o ′ are the ontologies of p and p ′ , and κ p ′ (o, o ′ ) is the set of correspondences that p ′ knows between o and o ′ .This definition expresses how p ′ can understand p's queries.In all experiments we measure the extent and speed of heterogeneity decrease enabled by CorDis considering H Disp and H DapAvg .Because of space limitation we only report on H Disp , as H DapAvg behaves the same way.

Impact of application parameters
In these experiments we study the impact of the volume of stored data (r max and c max ), the network traffic (m max ) and the initial knowledge of peers (we set different quantities of known correspondences: init p ).We consider five configurations (see Table 4).The configuration c ref serves as a reference.For these experiments we consider that fifteen different ontologies are used.Consequently H Rich equals 0.14.
Given the alignments of reference, the heterogeneity H Disp cannot be reduced below a certain theoric limit equal to 0.704 (cf. the solid black line on Fig. 5).This limit can be reached if the storage capacity of peers is unlimited, and if each peer p knows all the o p -correspondences available in the system.We anticipate that CorDis will not reduce the heterogeneity below this limit.
The graph of Fig. 5 shows that the CorDis protocol reduces H Disp in all the configurations we set.These results allows to draw (predictable) conclusions: (i) the less peers know initially, the harder it is to reduce the heterogeneity (cf.c 1 ), (ii) the more useful information peers store, the less heterogeneous the system becomes (cf.c 2 ), and (iii) the less information peers share, the slower the heterogeneity decreases (cf.c 4 ).Nevertheless we can see that the increase of peers' cache (cf.c 3 ) does not have an important impact on heterogeneity decrease.After 50 cycles, H Disp does not significantly vary anymore.

Impact of semantic richness
In these experiments we study the impact of richness heterogeneity.We vary the number of used ontologies in the system from 1 (homogenous system) to 15 (number of available ontologies in OntoFarm).As a consequence, the richness  value H Rich varies between 0 and 0.14.We set the other parameters as in the configuration c ref of the section 5.1.
Fig. 6 shows that CorDis is efficient for all the situations considered in these experiments.We plan to conduce additional experiments to show that CorDis is also efficient in highly heterogeneous systems.

Impact of new arrivals
In these experiments we study the impact of peers arrival in an existing system.We consider four configurations.The first one (ref 1 ), represents a system of 100 peers (using 10 different ontologies: H Rich = 0.09) in which CorDis is running.The second configuration (ref 2 ) is similar to the first one but represents a system of 110 peers (using 15 different ontologies: H Rich = 0.14).They both serve as references.In the other scenarios, 10 peers join the system simultaneously at the 50 th cycle (c 1 ) or one after the other between the 50 th cycle and the 95 th cycle: every 5 cycle a new peer joins the system (c 2 ).In both configurations, arriving peers use ontologies that are not already used, so H Rich grows up to 0.14.Fig. 7 shows that when a group of peers join the system, an important disruption occurs.But after 40 cycles, arriving peers are integrated in the system as if they were in it from the beginning.When peers join the system progressively, they are quickly integrated (20 cycles).As a conclusion, we can say that CorDis is robust to new arrivals.

Related work
Our measures of semantic heterogeneity assume the existence of a disparity measure between peers.Distance measures proposed in the field of ontology matching [7,12] can be adapted, even if they do not take into account alignments between ontologies.In [5] distances between ontologies are defined in the alignment space.They can be used if we consider that queries are translated at each hop.In [3], authors define criteria to characterize the interoperability of a P2P system, but no measure is proposed to define the semantic heterogeneity of P2P systems.
CorDis aims to improve interoperability of the system by reducing some facets of the heterogeneity.Other methods have been proposed to improve interoperability.For instance in [1] authors aim to achieve a form of semantic agreement to enable queries to be forwarded to the peers that understand them best, i.e. with a good degree of comprehension and with correct mappings.In order to build such a system, queries are enriched with the translations used during the propagation.It enables peers to assign confidence values to the mappings.In [1,2], the term semantic gossiping refers to the action of "propagating queries toward nodes for which no direct translation link exists".This is a very specific approach of gossiping which mixes both queries propagation and their translations dissemination.On the contrary our approach is independent of queries propagation and only focuses on the dissemination of correspondences.In [4] authors propose a system ensuring interoperability by offering several functionnalities to automatically organize the network of mappings at a mediation layer.Again this work can be considered as complementary to ours in the sense that the mecanism to detect the condition of strong connectivity [3] could also be put in place in the systems we consider.Others try to improve interoperability by creating a global ontology that serves as an intermediary between all peers of the system [6].Pires et al. [15] present a semantic matcher which identifies correspondences between ontologies used in a PDMS.This method could be used in our context to discover correspondences, i.e. to initialize peers' alignments or to enrich them.In [14] authors propose to group related peers in SONs to improve interoperability.This approach is complementary to ours because they can be combined: one aims to reduce heterogeneity, and the other one aims to improve information retrieval performances.

Conclusion
With the aim of improving semantic interoparability in P2P data sharing systems we presented a new approach that consists in decreasing semantic heterogeneity.As none existed before, at least to our knowledge, we defined several measures to characterize different facets of the semantic heterogeneity of a P2P system.These measures are general enough to be used in several application domains.We proposed a new protocol, called CorDis, which relies on a gossip-based dissemination of correspondences across the system.It ensures some flexibility with respect to current queries.We conducted preliminary experiments which show that CorDis significantly reduces several facets of semantic heterogeneity.Finally, CorDis does not have any equivalent among the previously proposed solutions to improve semantic interoperability.
As future work, we first plan to conduct additional experiments with real query sets and more ontologies, as in some way, the number of ontologies limits the number of peers in the simulations.In addition, our proposal provides a basis that may be extended in several complementary directions.First, we could add a mechanism of deduction to discover new correspondences.Second, knowing correspondences might incite some peers to change their neighbourhood, thus leading to a dynamic evolution of connections.Finally, a good knowledge of alignments between its own ontology and another one might result in a peer to adopt an additional ontology, or to change it.All these directions may help in reducing more and faster some facets of heterogeneity.

i n i t p r m a x c m a x m m aTab. 4 . 5 .
Configurations studied in section 5.1, and theorical analysis of the local storage (LS) per peer, and the network traffic (NT) per cycle.Fig. Decrease of HDisp heterogeneity.