File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/w96-0209_intro.xml

Size: 4,865 bytes

Last Modified: 2025-10-06 14:06:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0209">
  <Title>APPORTIONING DEVELOPMENT EFFORT IN A PROBABILISTIC LR PARSING SYSTEM THROUGH EVALUATION</Title>
  <Section position="3" start_page="0" end_page="92" type="intro">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> This work is part of an effort to develop a robust, domain-independent syntactic parser capable of yielding the unique correct analysis for unrestricted naturally-occurring input. Our goal is to develop a system with performance comparable to extant part-of-speech taggers, returning a syntactic analysis from which predicate-argument structure can be recovered, and which can support semantic interpretation. The requirement for a domain-independent analyser favours statistical  second author was visiting Rank Xerox, Grenoble.</Paragraph>
    <Paragraph position="1"> The work was also supported by UK DTI/SALT project 41/5808 'Integrated Language Database', and by SERC/EPSRC Advanced Fellowships to both authors. Geoff Nunberg provided encouragement and much advice on the analysis of punctuation, and Greg Grefenstette undertook the original corpus tokenisation and segmentation for the punctuation experiments. Bernie .\]ones and Kiku Ribas made helpful comments on an earlier draft. We are responsible for any mistakes.</Paragraph>
    <Paragraph position="2">  techniques to resolve ambiguities, whilst the latter goal favours a more sophisticated grammatical formalism than is typical in statistical approaches to robust analysis of corpus material.</Paragraph>
    <Paragraph position="3"> Briscoe ~ Carroll (1993) describe a probablistic parser using a wide-coverage unification-based grammar of English written in the Alvey Natural Language Tools (ANLT) metagrammatical formalism (Briscoe et al., 1987), generating around 800 rules in a syntactic variant of the Definite Clause Grammar formalism (DCG, Pereira Warren, 1980) extended with iterative (Kleene) operators. The ANLT grammar is linked to a lexicon containing about 64K entries for 40K lexemes, including detailed subcategorisation information appropriate for the grammar, built semi-automatically from a learners' dictionary (Carroll L= Grover, 1989). The resulting parser is efficient, constructing a parse forest in roughly quadratic time (empirically), and efficiently returning the ranked n-most likely analyses (Carroll, 1993, 1994). The probabilistic model is a refinement of probabilistic context-free grammar (PCFG) conditioning CF 'backbone' rule application on LR state and lookahead item. Unification of the 'residue' of features not incorporated into the backbone is performed at parse time in conjunction with reduce operations. Unification failure results in the associated derivation being assigned a probability of zero. Probabilities are assigned to transitions in the LALR(1) action table via a process of supervised training based on computing the frequency with which transitions are traversed in a corpus of parse histories. The result is a probabilistic parser which, unlike a PCFG, is capable of probabilistically discriminating derivations which differ only in terms of order of application of the same set of CF backbone rules, due to the parse context defined by the LR table.</Paragraph>
    <Paragraph position="4"> Experiments with this system revealed three major problems which our current research is addressing. Firstly, improvements in probabilistic parse selection will require a 'lexicalised' grammar/parser in which (minimally) probabilities are associated with alternative subcategorisation possibilities of individual lexical items. Currently, the relative frequency of subcategorisation possibilities for individual lexical items is not recorded in wide-coverage lexicons, such as ANLT or COM-LEX (Grishman eC/ al., 1994). Secondly, removal of punctuation from the input (after segmentation into text sentences) worsens performance as punctuation both reduces syntactic ambiguity (Jones, 1994) and signals non-syntactic (discourse) relations between text units (Nunberg, 1990). Thirdly, the largest source of error on unseen input is the omission of appropriate subcategorisation values for lexical items (mostly verbs), preventing the system from finding the correct analysis. The current coverage--the proportion of sentences for which at least one analysis was foundS--of this system on a general corpus (e.g.</Paragraph>
    <Paragraph position="5"> Brown or LOB) is estimated to be around 20% by Briscoe (1994). Therefore, we have developed a variant probabilistic LR parser which does not rely on subcategorisation and uses punctuation to reduce ambiguity, The analyses produced by this parser can be utilised for phrase-finding applications, recovery of subcategorisation frames, and other 'intermediate' level parsing problems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML