XML Viewer - p04-1070

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1070_intro.xml
Size: 8,764 bytes
Last Modified: 2025-10-06 14:02:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1070">
  <Title>An alternative method of training probabilistic LR parsers</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The LR parsing strategy was originally devised for programming languages (Sippu and Soisalon-Soininen, 1990), but has been used in a wide range of other areas as well, such as for natural language processing (Lavie and Tomita, 1993; Briscoe and Carroll, 1993; Ruland, 2000). The main difference between the application to programming languages and the application to natural languages is that in the latter case the parsers should be nondeterministic, in order to deal with ambiguous context-free grammars (CFGs). Nondeterminism can be handled in a number of ways, but the most efficient is tabulation, which allows processing in polynomial time. Tabular LR parsing is known from the work by (Tomita, 1986), but can also be achieved by the generic tabulation technique due to (Lang, 1974; Billot and Lang, 1989), which assumes an input pushdown transducer (PDT). In this context, the LR parsing strategy can be seen as a particular mapping from context-free grammars to PDTs.</Paragraph>
    <Paragraph position="1"> The acronym 'LR' stands for 'Left-to-right processing of the input, producing a Right-most derivation (in reverse)'. When we construct a PDTAfrom a CFGGby the LR parsing strategy and apply it on an input sentence, then the set of output strings ofA represents the set of all right-most derivations thatG allows for that sentence. Such an output string enumerates the rules (or labels that identify the rules uniquely) that occur in the corresponding right-most derivation, in reversed order.</Paragraph>
    <Paragraph position="2"> If LR parsers do not use lookahead to decide between alternative transitions, they are called LR(0) parsers. More generally, if LR parsers look ahead k symbols, they are called LR(k) parsers; some simplified LR parsing models that use lookahead are called SLR(k) and LALR(k) parsing (Sippu and Soisalon-Soininen, 1990). In order to simplify the discussion, we abstain from using lookahead in this article, and 'LR parsing' can further be read as 'LR(0) parsing'. We would like to point out however that our observations carry over to LR parsing with lookahead.</Paragraph>
    <Paragraph position="3"> The theory of probabilistic pushdown automata (Santos, 1972) can be easily applied to LR parsing.</Paragraph>
    <Paragraph position="4"> A probability is then assigned to each transition, by a function that we will call the probability function pA, and the probability of an accepting computation of A is the product of the probabilities of the applied transitions. As each accepting computation produces a right-most derivation as output string, a probabilistic LR parser defines a probability distribution on the set of parses, and thereby also a probability distribution on the set of sentences generated by grammar G. Disambiguation of an ambiguous sentence can be achieved on the basis of a comparison between the probabilities assigned to the respective parses by the probabilistic LR model.</Paragraph>
    <Paragraph position="5"> The probability function can be obtained on the basis of a treebank, as proposed by (Briscoe and Carroll, 1993) (see also (Su et al., 1991)). The model by (Briscoe and Carroll, 1993) however incorporated a mistake involving lookahead, which was corrected by (Inui et al., 2000). As we will not discuss lookahead here, this matter does not play a significant role in the current study. Noteworthy is that (Sornlertlamvanich et al., 1999) showed empirically that an LR parser may be more accurate than the original CFG, if both are trained on the basis of the same treebank. In other words, the resulting probability function pA on transitions of the PDT allows better disambiguation than the corresponding function pG on rules of the original grammar. A plausible explanation of this is that stack symbols of an LR parser encode some amount of left context, i.e. information on rules applied earlier, so that the probability function on transitions may encode dependencies between rules that cannot be encoded in terms of the original CFG extended with rule probabilities. The explicit use of left context in probabilistic context-free models was investigated by e.g. (Chitrao and Grishman, 1990; Johnson, 1998), who also demonstrated that this may significantly improve accuracy. Note that the probability distributions of language may be beyond the reach of a given context-free grammar, as pointed out by e.g. (Collins, 2001). Therefore, the use of left context, and the resulting increase in the number of parameters of the model, may narrow the gap between the given grammar and ill-understood mechanisms underlying actual language.</Paragraph>
    <Paragraph position="6"> One important assumption that is made by (Briscoe and Carroll, 1993) and (Inui et al., 2000) is that trained probabilistic LR parsers should be proper, i.e. if several transitions are applicable for a given stack, then the sum of probabilities assigned to those transitions by probability function pA should be 1. This assumption may be motivated by pragmatic considerations, as such a proper model is easy to train by relative frequency estimation: count the number of times a transition is applied with respect to a treebank, and divide it by the number of times the relevant stack symbol (or pair of stack symbols) occurs at the top of the stack.</Paragraph>
    <Paragraph position="7"> Let us call the resulting probability function prfe.</Paragraph>
    <Paragraph position="8"> This function is provably optimal in the sense that the likelihood it assigns to the training corpus is maximal among all probability functionspAthat are proper in the above sense.</Paragraph>
    <Paragraph position="9"> However, properness restricts the space of probability distributions that a PDT allows. This means that a (consistent) probability function pA may exist that is not proper and that assigns a higher likelihood to the training corpus than prfe does. (By 'consistent' we mean that the probabilities of all strings that are accepted sum to 1.) It may even be the case that a (proper and consistent) probability function pG on the rules of the input grammarG exists that assigns a higher likelihood to the corpus than prfe, and therefore it is not guaranteed that LR parsers allow better probability estimates than the CFGs from which they were constructed, if we constrain probability functions pA to be proper. In this respect, LR parsing differs from at least one other well-known parsing strategy, viz. left-corner parsing. See (Nederhof and Satta, 2004) for a discussion of a property that is shared by left-corner parsing but not by LR parsing, and which explains the above difference.</Paragraph>
    <Paragraph position="10"> As main contribution of this paper we establish that this restriction on expressible probability distributions can be dispensed with, without losing the ability to perform training by relative frequency estimation. What comes in place of properness is reverse-properness, which can be seen as properness of the reversed pushdown automaton that processes input from right to left instead of from left to right, interpreting the transitions of A backwards.</Paragraph>
    <Paragraph position="11"> As we will show, reverse-properness does not restrict the space of probability distributions expressible by an LR automaton. More precisely, assume some probability distribution on the set of derivations is specified by a probability function pA on transitions of PDT A that realizes the LR strategy for a given grammar G. Then the same probability distribution can be specified by an alternative such function p0A that is reverse-proper. In addition, for each probability distribution on derivations expressible by a probability function pG forG, there is a reverse-proper probability function pA for A that expresses the same probability distribution.</Paragraph>
    <Paragraph position="12"> Thereby we ensure that LR parsers become at least as powerful as the original CFGs in terms of allowable probability distributions.</Paragraph>
    <Paragraph position="13"> This article is organized as follows. In Section 2 we outline our formalization of LR parsing as a construction of PDTs from CFGs, making some superficial changes with respect to standard formulations. Properness and reverse-properness are discussed in Section 3, where we will show that reverse-properness does not restrict the space of probability distributions. Section 4 reports on experiments, and Section 5 concludes this article.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML