Distance Induction in First Order Logic

A distance on the problem domain allows one to tackle some typical goals of machine learning, e.g. classification or conceptual clustering, via robust data analysis algorithms (e.g. k-nearest neighbors or k-means).


Introduction
The expert indeed knows to which extent any two examples or hypotheses on a problem domain, are similar: a relevant distance indeed represents a powerful, even if implicit, background knowledge. Distances can support many machine learning tasks: • A distance or similarity fvnction is needed to duster the examples, which is the core of unsupervised learning [9,4]. Clustering also constitutes a main stage of knowledge discovery in databases (KDD) [8]: one must somehow divide the enormous amount of available data, in order for knowledge to be conquered. Inductive logic programming (ILP) (15] can benefit from clustering, too: e.g. KEG uses a similarity function specifically designed for first-order languages, and gradually constructs hypotheses by generalizing the most similar examples and/or hypotheses [2]. • A distance allows the retrieval of the examples or hypotheses most similar to the instance at hand. In case-based reasoning (CBR), the retrieval stage com mands the success of the whole process; hence much attention has been paid in CBR to developing :flexible distances or similarity fwnctions on structured domains [l]. Retrieving the nearest neighbors of the instance at hand also con stitutes the core of instance-based learning. The ILP system RIEL [7] consists of a k-NN classifier relying on an extended version of the first-order distance of KBG.
A fruitful combination of inductive learning and k-NN classifier in attribute value domains is described in [5]: RISE uses as default rule the majority vote of the k rules whose hypotheses are the closest to the instance at hand [5].
• In the field of analogy, one looks for "optimal" mappings from the source onto the target context; the optimality criterion most often refers to a relational or structural distance [10,3].
In this paper, we first compare the respective advantages and weaknesses of rules and distances in regard to supervised learning. We then discuss previous work devoted to constructing distances on first-order languages [2,7]. Section 3 presents an alternative to distances based on syntax and weights, namely hypothesis-driven distances (HDD). We show that a set of d hypotheses induces a mapping 7r from the problem domain £h onto the space of vectors of integers Nd. A distance on £,h then follows, by defining the distance between two any examples or further hypotheses E and Fas the Euclidean distance between 7r(E) and ?r(F). The properties and biases of HDDs are studied. DISTILL (for Distance Induction with STILL) uses the ILP system STILL [18] to construct rather blindly d hypotheses, where dis supplied by the user. These hypotheses only serve here as system of coordinates: further examples or hy potheses are given a numerical description within this system. DISTILL finally computes the distance between any two examples with same polynomial com plexity as in STILL (section 4). This approach is validated on the mutagenesis problem: the 1-NN classifier based on the distance constructed by DISTILL, demonstrates to be quite competitive with respect to prominent ILP learners such as FOIL [16] and PROGOL [14] on this problem. DISTILL also improves on STILL [18]: it involves one less param eter and shows little sensitivity with respect to parameter d for d � 30. We last conclude with some perspectives for further research. 2 State of the art This section first presents our motivation for constructing distances on first-order logic space, and briefly recalls some previous work devoted to this aim.

Rules ve rsus Distances
The main advantages of instance-based (e.g. k-NN) classifiers versus standard rule learning are extensively discussed in (7]: simply put, k-NN classifiers accu rately deal with both symbolic and numerical data, on one hand, and with noisy data, on the other hand. Further, the predictive accuracy obtained by a k-NN classifier (in leave-one-out evaluation mode) gives hints into the quality of the data, and derives lower bounds on the optimal predictive accuracy (6]. Practically, a k-NN classifier allows for a flexible modeling of the target con cept, more easily than rules or even oblique decision trees [11]. This can be exemplified as follows: in the bidimensional space 1R2, a set of n rules character izes the target concept as the union of n rectangles; an oblique decision tree with n leaves characterizes it as the union of n polygons. And a set of N examples, plus a distance, induces a fine grained partition of the problem domain into N cells (the Voronof cells); the target concept is characterized as the union of those cells that are centered on a positive example.
Compared to rules, instance-based classifiers suffer from their low intelligi bility: the classification of an instance is justified by exhibiting the most similar example{s), rather than a high-level hypothesis.

Related work
Most distances on attribute-value languages are computed as the weighted sum of the elementary distances di defined on the attribute domains: The distance accuracy (evaluated as the predictive accuracy of the corresponding k-NN classifier) critically depends on weights Wi, usually adjusted by trial and error. These can also be determined by an optimization algorithm [12].
Weight-based distances have been first extended to first-order logic languages in [2] and later refined in {7]. In both cases, the distance between any two con junctive formulae is basically computed from that of their literals; the distance between two literals (built on the same symbol of predicate) is computed from the distance between their arguments, the weight of the predicate, and the weights of the predicate arguments. A global perspective on the examples, accounting for the semantics of the domain, is offered by computing the distance between two terms from the distance between the literals where they both appear. (Combina torial explosion is prevented via syntactic restrictions on the literals examined). In KEG [2], the distances between terms are computed via a fixed point method, whereas RIEL [7] uses an iterative resolution. The resulting similarity map critically depends on both the syntax and the weights. This limitation is partly addressed by RIEL, which iteratively refines the weights proposed by the expert.
To sum up, these distances combine built-in knowledge (the elementary dis tances on the domains of attributes or predicate arguments), with weights, i.e. non-declarative biases either manually or automatically adjusted.

Principle
Let Ch denote the language of hypotheses {including the language of instances via the single representation trick). Let 'H = {h1 , ... h d } denote a set of d hypotheses.
One notices [19] that 'H induces a mapping 7r from Ch onto the boolean space of dimension d, by associating to any example or hypothesis E the vector of booleans coding whether Ei s subsumed by hi, noted E-< hi : Note that this projection onto {O, l}d does not make any assumption on Ch : besides 7-£, it only invokes the covering test {checking whether E-< hi)· And {O, 1 }d is a metric space; a distance on Ch thus naturally follows, by setting: d By construction, dist is symmetrical and satisfies the triangular inequality: Still, it does not satisfy the identity relation 1 : (dist(E, F) = 0) ":/? (E = F).

Local behavior of HDD
Hypotheses-based distances locally depend upon the context. Consider examples E and F, together with the single hypothesis h ( Table 1). As E is covered by h (7r(E) = 1), and Fis not (7r(F) = 0), one has dist(E, F) = l. Consider examples E' and F' constructed from E and F via replacing a common feature (Atom = carbon) by another feature (say Atom =oxy ge n). Any weight based distance distw would give distw(E, F) = distw(E', F'). More generally, weight-based distances are invariant by translation (consistently modifying a feature shared by any two examples does not modify their distance). This is not necessary the case for hypotheses-based distances, due to the fact that 7r(E) globally depends on E (since 7r(E') = 7 r(F') = 0, dist(E ', F' ) = 0). A modification of any given feature of E may, or not, have an effect on 7r(E) depending on the other features. A hypothesis-driven distance thereby encodes local discontinuities of the problem domain, corresponding to the frontiers of hypotheses hi.
The property of non invariance by translation is desirable as it enables to em ulate the "versatile similarities" of experts. An expert may consider two devices manufactured by a given firm, as very similar; what s/he really means is that same failures are likely observed on these devices. But {rather unexpectedly for the naive knowledge engineer) the same devices manufactured by another fi rm, happen to be judged quite dissimilar ...

Limitations of HDDs
HDDs do not present any interest whenever they are based on a concise set of hypotheses '}{: e.g. dist gets rather coarse if any example is covered by a single hypothesis, such as happens if '}{ is a decision tree (either E and F are covered by the same hypothesis, and dist(E, F) = 0, or dist(E, F) 2).
The granularity of a HDD increases with the redundancy of '}{ (i.e. the av erage number of hi covering any example) and more precisely with the number and diversity of hypotheses hi. Still, a HDD does not involve in any way the conclusions associated to hypotheses hi; this suggests that the relevance of a HDD is potentially independent from the relevance of'}{ (see section 4.3).
Still, the structure of the boolean space does not reflect the structure of the problem domain. A hypothesis hi usually covers less than half the problem space: 7ri(E) = 1 is thus less frequent than 7ri{E) = 0, whilst 1 and 0 play equivalent roles in the boolean space.

Projection onto N d
We therefore consider more complex hypotheses. Let hi now be a disjunction of formulae in .Ch, with hi Si,1 V ... V si,ni, and let 7ri(E) (section 3.1) be now defined as the number of formulae Si,j covering E. This allows 7f to map the problem domain .Ch onto a richer metric space, that of integer vectors N d . The corresponding HDD is naturally defined as: The ordered structure of :JN" reflects a logical structure on the problem domain. Let hf" denote the M o f -N hypothesis constructed from the disjunctive hi, defined as: E -< hf" iff Eis covered by at least M formulae Bi,j. One easily shows that hfH 1 is covered by hf". The set of hypotheses {hf", for M = 1.. ni}, is a sequence of nested hypotheses which can be viewed as neighborhoods, or balls, of increasing specifi city; 1fi thereby corresponds to a" dimension" of the problem domain, and the coordinate 7ri(E) of E on this dimension precisely gives the rank of the most specific ball E belongs to.

Distance Induction based on Disjunctive Version Space
This section is devoted to learning a HDD from examples expressed as definite or constrained clauses.

Principle
The presented mechanism relies on the disjunctive version space (DiVS) ap proach; more details on Di VS in attribute-value and fi rst-order logic languages are respectively found in [17] and [18]. The elementary step in DiVS consists of characterizing the most general hypothesis D(E, F) covering example E and discriminating example F, where E and F satisfy distinct target concepts.
In attribute-value languages, D(E, F) simply is the disjunction of the maxi mally general selectors2 covering E and rejecting F: Given the user-supplied number d of dimensions, 1-l is iteratively constructed by setting hi = D(Ei , Fi ), where Ei and Fi are randomly selected in the training set such that they satisfy distinct target concepts.
Construction of 1-l = {hi, ... , hd} For i = 1 to d, Randomly select Ei and Fi in the training set with Class(Ei ) =f:. Class(Fi ) Construct hi discriminating Ei from Fi.
For any further example I, the coordinate 7 ri(I) on dimension D(Ei ,Fi) is com puted as the number of selectors in D(Ei , Fi ), satisfied by I. Ei and Fi respec tively get the highest and lowest coordinates on this dimension.

DISTILL
Di VS has been extended and adapted to first order logic via the STILL algorithm [18]. Due to space limitations, STILL will only be illustrated on a short example. Let E and F be definite clauses; let C be constructed from E by turning any occurence of a term ti in E into a distinct variable Xj, and let substitution e be defined as 0 A constrained clause G'Y in the chosen language belongs to the set D(E, F), iff either G or 'Y discriminate F. G is discriminant iff it includes a discriminant predicate (e.g. cc). Otherwise, G subsumes F and the set of substitutions map ping G onto F is denoted E; then, 'Y is discriminant iff it is incompatible with all substitutions in E, or equivalently belongs to all D(O, a) within an equivalent attribute-value representation:  Deciding whether D'fJ(E, F) covers a further instance I is similarly intractable, as it requires to explore the set E' of substitutions mapping C onto I. A poly nomial approximation of the covering test is similarly provided by considering only K substitutions randomly selected in E'.
The coordinate of I on dimension D'fJ(E, F) is the number of discriminant predicates involved in I, augmented with the maximal value of Cr* D'fJ(E, F), taken over K substitutions r randomly selected in IJ'. And Cr*D'fJ(E,F) is the minimum number of selectors in D( e, IT j) satisfied by r , for j 1 . . . rJ. Finally, the distance between any two examples has complexity O(d x K x rJ x V2).

Experimentation
This approach is evaluated on the well-studied mutagenesis problem (13,21]. Table 4.(a) reports the best results obtained by FOIL, PROGOL and STILL (20,18]. FOIL and PROGOL have been evaluated via 10-fold crossvalidation; STILL was evaluated in a similar way, only including 25 runs (with different random seeds) instead of 10, as recommended for evaluating stochastic processes. Run times (in seconds) are measured on HP-735 workstations.
DISTILL is evaluated from the average predictive accuracy of the 1-NN clas sifier based on di st, via the same protocol as STILL. The experiments focus on the influence of the number d of constructed hypotheses, varied in 10 .. 100. The two other parameters of DISTILL, inherited from STILL, are set to their default value ('TJ = 300 and K = 3).
Another experimentation goal is to study what happens if the provided ex amples are not classified at all, by removing the test Class(E) # Class(F) in the construction of 1i (section 4.1). The corresponding algorithm is termed UNDISTILL, for Unsupervised Distance Induction.
Tables 4.b and 4.c respectively give the results obtained by DISTILL and UNDISTILL (with run times in seconds on a HP-710). It was conjectured that the relevance of 1i was not a necessary condition to derive a relevant HDD (section 3.3); one is nevertheless surprised that DISTILL and UNDISTILL obtain comparable results. In retrospect, it appears that hy potheses are used to make distinctions on the problem domain: the soundness of these distinctions does not matter provided they allow for a sufficiently precise scattering of the problem domain.
Practically, the good performances of UNDISTILL suggest that distance in duction does not depend on the noise of the data, and can be employed for supervised learning.

Conclusion
Rather than syntactically comparing two examples, we propose to compare the way these respectively behave with respect to a set of hypotheses. Hypothesis driven distances strongly depend on the selection of the hypotheses: HDDs typi cally bring no further information if these hypotheses are concise and intelligible (section 3.3). We therefore used a disjunctive version space approach: a set of d hypotheses is constructed as the maximally general hypotheses discriminating d pairs of examples (Ei , Fi)· Ei and Fi are randomly selected in UNDISTILL, and they are further required to satisfy distinct target concepts in DISTILL. Experimental validation shows that both DISTILL and UNDISTILL super sede other ILP learners on the mutagenesis dataset, for d 2'.: 30. Incidentally, this confirms that a stochastic bias (meant as the selection of Ei and Fi ) can be a sound alternative to knowledge-demanding biases.
Further work will consider how the set of hypotheses can be pruned or aug mented. Other perspectives are offered by coupling this distance with standard data analysis algorithms (e.g. k-means or factorial analysis) to achieve concep tual clustering or graphical representation of the data.