File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1044_intro.xml

Size: 2,874 bytes

Last Modified: 2025-10-06 14:01:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1044">
  <Title>Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Parameters
</SectionTitle>
    <Paragraph position="0"> The parameters we varied were:  The default settings are shown above in bold face. Wedonotdiscussallpossiblecombinationsofthese settings. Rather,wetakethebottom-upparserusingan untransformed grammar with trie rule encodings to be the basic form of the parser. Except where noted, we will discuss how each factor affects this baseline, as most of the effects are orthogonal. When we name a setting, any omitted parameters are assumed to be the defaults.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Tree Transforms
</SectionTitle>
      <Paragraph position="0"> In all cases, the grammar was directly induced from (transformed) Penn treebank trees. The transforms used are shown in figure 1. For all settings, functional tags and crossreferencing annotations were stripped. For NOTRANSFORM, no other modification was made. In particular, empty nodes (represented as -NONE- in the treebank) were turned into rules that generated the empty string (a12 ), and there was no collapsing of categories (such as PRT and ADVP) as is of- null the rules for the category NP. Non-black states are active, non-white states are accepting, and bold transitions are phrasal.</Paragraph>
      <Paragraph position="1"> NOEMPTIES, empties were removed by pruning nonterminalswhichcoverednoovertwords. For NOUNA-RIESHIGH, and NOUNARIESLOW, unary nodes were removedas well, by keeping only the tops and the bottoms of unary chains, respectively.2</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Grammar Rule Encodings
</SectionTitle>
      <Paragraph position="0"> The parser operates on Finite State Automata (FSA) grammar representations. We compiled grammar rules into FSAs in three ways: LISTs, TRIEs, and MINimized FSAs. An example of each representation is given in figure 2. For LIST encodings, each local tree type was encoded in its own, linearly structured FSA, corresponding to Earley (1970)-style dotted rules. For TRIE, there was one FSA per category, encoding together all rule types producing that category. For MIN, state-minimized FSAs were constructed from the trie FSAs. Note that while the rule encoding may dramatically affect the efficiency of a parser, it does not change the actual set of parses for a given sentence in any way.3  pacting grammars. For example, the prefix compacted tries we use are the same as the common practice of ignoring items before the dot in a dotted rule (Moore, 2000). Another</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML