File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1066_intro.xml

Size: 3,389 bytes

Last Modified: 2025-10-06 14:06:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1066">
  <Title>Automatic Compensation for Parser Figure-of-Merit Flaws*</Title>
  <Section position="2" start_page="0" end_page="513" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Sentence parsing is a task which is traditionMly rather computationally intensive.</Paragraph>
    <Paragraph position="1"> The best known practical methods are still roughly cubic in the length of the sentence-less than ideM when deMing with nontriviM sentences of 30 or 40 words in length, as frequently found in the Penn Wall Street Journal treebank corpus.</Paragraph>
    <Paragraph position="2"> Fortunately, there is now a body of literature on methods to reduce parse time so that the exhaustive limit is never reached in practice. 1 For much of the work, the chosen vehicle is chart parsing. In this technique, the parser begins at the word or tag level and uses the rules of a context-free grammar to build larger and larger constituents. Completed constituents are stored in the cells of a chart according to their location and * This research was funded in part by NSF Grant IRI-9319516 and ONR Grant N0014-96-1-0549.</Paragraph>
    <Paragraph position="3"> IAn exhaustive parse always &amp;quot;overgenerates&amp;quot; because the grammar contains thousands of extremely rarely applied rules; these are (correctly) rejected even by the simplest parsers, eventuMly, but it would be better to avoid them entirely.</Paragraph>
    <Paragraph position="4"> length. Incomplete constituents (&amp;quot;edges&amp;quot;) are stored in an agenda. The exhaustion of the agenda definitively marks the completion of the parsing algorithm, but the parse needn't take that long; Mready in the early work on chart parsing, (Kay, 1970) suggests that by ordering the agenda one can find a parse without resorting to an exhaustive search. The introduction of statistical parsing brought with an obvious tactic for ranking the agenda: (Bobrow, 1990) and (Chitrao and Grishman, 1990) first used probabilistic context free grammars (PCFGs) to generate probabilities for use in a figure of merit (FOM). Later work introduced other FOMs formed from PCFG data (Kochman and Kupin, 1991); (Magerman and Marcus, 1991); and (Miller and Fox, 1994).</Paragraph>
    <Paragraph position="5"> More recently, we have seen parse times lowered by several orders of magnitude. The (Caraballo and Charniak, 1998) article considers a number of different figures of merit for ordering the agenda, and ultimately recommends one that reduces the number of edges required for a full parse into the thousands. (Goldwater et al., 1998) (henceforth \[Gold98\]) introduces an edge-based technique, (instead of constituent-based), which drops the average edge count into the hundreds. null However, if we establish &amp;quot;perfection&amp;quot; as the minimum number of edges needed to generate the correct parse 47.5 edges on average in our corpus--we can hope for still more improvement. This paper looks at two new figures of merit, both of which take the \[Gold98\] figure (of &amp;quot;independent&amp;quot; merit) as a starting point in cMculating a new figure  of merit for each edge, taking into account some additional information. Our work further lowers the average edge count, bringing it from the hundreds into the dozens.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML