Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Jakub Waszczuk; Agata Savary

Résumé

Natural language parsing is known to potentially produce a high number of syntactic interpretations for a sentence. Some of them may contain multiword expressions (MWEs) and achieving them faster than compositional alternatives proved efficient in symbolic parsing (see below). We propose to apply this strategy to symbolic LTAG (Lexicalized Tree Adjoining Grammar) parsing using an architecture adaptable to probabilistic parsing. We are particularly interested in LTAGs because , according to (Abeillé and Schabes 1989), they show several advantages with respect to parsing MWEs. Firstly, unification constraints on feature structures attached to tree nodes allow one to naturally express dependencies between arguments at different depths in the elementary trees (as in NP 0 vider DET sac 'to express one's secret thoughts', where the determiner DET embedded in the direct object must agree in person and number with the subject NP 0). Secondly, the so-called extended domain of locality offers a natural framework for representing two different kinds of discontinuities. Namely, discontinuities coming from the internal structure of a MWE are directly visible in elementary trees and are handled in parsing mostly by substitution. Disconti-nuities coming from insertion of modifiers (e.g. a bunch of NP, a whole bunch of NP) are invisible in elementary trees but are handled in parsing by adjunction. Consider the sentence in example (1). (1) Acid rains in Ghana are equally grim. When it is being scanned by a left-to-right parser, two competing interpretations are syntactically valid for the first 4 words. One of them considers rains as a verb whose subject is acid while, according to the other, rains is the head noun of the NN compound acid rains. Our objective is to propose a parsing strategy which would promote the latter interpretation due the fact that it contains a known MWE. More precisely, the parser should: (i) trivially, admit only grammar-compliant analyses of a sentence, (ii) achieve MWE-oriented interpretations more rapidly than potential compositional interpretations , (iii) eliminate no grammar-compliant interpretations. Note that all these conditions could rather easily be met for sentence (1) in a pre-processing-based approach in which potential MWEs are identified prior to parsing and conflated into word-with-spaces tokens. Such an approach might however lead to a parsing failure in the case of sentence (2) if the two initial tokens are wrongly merged into a nominal compound in the pre-parsing step. In order to avoid errors of this kind, MWE identification and parsing should be performed jointly. (2) Hunger strikes the civilians since 2001. Seminal works, such as (Finkel and Manning 2009, Green et al. 2011, 2013, Constant et al. 2013), show that the results of probabilis-tic MWE identification and/or parsing are improved when both tasks are performed simultaneously. (Wehrli et al. 2010) point out that such an improvement (also within further parsing-based applications, e.g. machine translation) occurs in symbolic parsing (here: in a Chomskian grammar-based approach) when the knowledge about a potential occurrence of MWEs guides the parsing process. Our goal is to apply a similar strategy to the one in (Wehrli et al. 2010), i.e. to systematically promote MWE-oriented interpretations, within LTAG parsing 1 We additionally wish to design the parser architecture in such a way that corpus-based probabilities about MWE contexts can be 1 The parsing algorithm should of course abstract away from the way the input LTAG grammar was obtained (manually crafted, generated from a metagram-mar, or learned from a treebank).

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager