File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1303_metho.xml

Size: 16,676 bytes

Last Modified: 2025-10-06 14:15:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1303">
  <Title>Does tagging help parsing? A case study on finite state parsing</Title>
  <Section position="3" start_page="27" end_page="30" type="metho">
    <SectionTitle>
2 The finite state parser
</SectionTitle>
    <Paragraph position="0"> The finite state parser outlined in this section is described in greater detail in Tapauainen \[11\] and Voutilainen \[14\].</Paragraph>
    <Section position="1" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
2.1 Grammatical representation
</SectionTitle>
      <Paragraph position="0"> Let us describe the syntactic representation with an example. The parser produces the following analysis for the sentence The man who is fond of sin~ng this aria killed his .~er (some morphological information is deleted for readability):  @@ the DET @&gt;N @ man N @SUBJ @&lt; who PRON @SUBJ @ be V @MV N&lt;@ @ fond A @SC Q of PREP @N&lt; @ sing PCP1 ~mv P&lt;&lt;@ @ this DET @&gt;N @ aria N @obj @&gt; kill V ~MV MAINC@ @ he PRON @&gt;N @ father N @OBJ @ @fullstop @@</Paragraph>
      <Paragraph position="2"> The representation co,~i~_~ts of base-forms and various Buds of tags. &amp;quot;@@&amp;quot; indicates sentence boundaries; the centre-embedded finite clause &amp;quot;who is fond of singing this aria&amp;quot; is flanked by the clause boundary tags @&lt; and ~&gt; and its function is postmodifying, as indicated with the second tag N&lt;Q of &amp;quot;be&amp;quot;, the main verb (@MV) of the clause. The pronoun &amp;quot;who&amp;quot; is the subject (~S .UBJ) of this clause, and the adjective &amp;quot;fond&amp;quot; is the subject complement (@sc) that is followed by the postmodifying (@N&lt;) prepositional phrase starting with &amp;quot;oi ~', whose complement is the no,~S-lte main verb (~nv) &amp;quot;sing&amp;quot; that has the noun &amp;quot;aria&amp;quot; as its object (~obj) (note that the lower case is reserved for functions in nonfinite clauses).</Paragraph>
      <Paragraph position="3"> The matrix clause &amp;quot;The man killed his father&amp;quot; is a finite main clause (MAINC~) whose main verb (@MV) is &amp;quot;kill&amp;quot;. The subject (QSUBJ) of the finite clause is the noun &amp;quot;man&amp;quot;, while the noun &amp;quot;father&amp;quot; is the object in the finite clause (@OBJ). The word &amp;quot;father&amp;quot; has one premodifier (~N), namely the genitive pronoun &amp;quot;he&amp;quot;.</Paragraph>
      <Paragraph position="4"> This representation is designed to follow the principle of surface-syntacticity: distinctions not motivated by surface grammatical phenomena, e.g. many attachment and coordination problems, are avoided by making the syntactic representation sufficiently underspecific in the description of grammatically (if not semantically) unresolvable distinctions.</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
2.2 .An~dysls routine
</SectionTitle>
      <Paragraph position="0"> The tokeniser identifies words and punctuation marks. The morphological analyser contains a rule-based lexicon and a guesser that assign one or more morphological analyses to each word, cf. the analysis of the word-form &amp;quot;tries&amp;quot;  compact representation contains 16 * 4,14 * 16 * 4 ---- 57, 344 different sentence readings. Long sentences easily get 10 sdeg- 101degdeg different sentence readings at this stage, i.e. the ambiguity problem with this syntactic representation is considerable.</Paragraph>
      <Paragraph position="1"> The final stage in this setup is resolution of syntactic ambiguities: those sentence readings that violate even one syntactic rule in the gr,--mar are discarded; the rest are proposed as parses of the sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
2.3 Rule formalism I
</SectionTitle>
      <Paragraph position="0"> Grammar rules are basically extended regular expressions. A typical rule is the implication rule whereby contextual requirements can be expressed for a distributional (or functional) category.</Paragraph>
      <Paragraph position="1"> For instance the following partial rule (taken from Voutllainen \[12\]) about a syntactic form category, namely prepositional phrases,</Paragraph>
      <Paragraph position="3"> states a number of alternative contexts in which the expression (given left of the arrow) occurs.</Paragraph>
      <Paragraph position="4"> The underscore shows the position of the expression with regard to the required alternative contexts, expressed as regular expressions. The parser interprets this kind of rule in the following way: whenever a string satisfying the expression left of the arrow is detected, the parser checks whether any of the required contextual expressions are found in the input sentence reading.</Paragraph>
      <Paragraph position="5"> If a contextual licence is found, the sentence reading is accepted by the rule, otherwise the sentence-reading is rejected.</Paragraph>
      <Paragraph position="6"> Another typical rule is the &amp;quot;nowhere&amp;quot; predicate with which the occurrence of a given regular expression can be forbidden. For instance, the predicate nowhere(VFIN .. VFIN); forbids the occurrence of two ilnite verbs in the same finite clause.</Paragraph>
      <Paragraph position="7"> These ilnite-state rules express partial facts about the language, and they are independent of each other in the sense that no particular application order is expected. A sentence reading is accepted by the parser only if it is accepted by each individual rule.</Paragraph>
    </Section>
    <Section position="4" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
2.4 The grammar
</SectionTitle>
      <Paragraph position="0"> The syntactic grammar contains some 2,600 finite-state rules each of which have been tested and corrected against a manually parsed corpus of about 250,000 words (over 10,000 unambiguously parsed sentences). Each rule in the grammar accepts virtually all parses in this corpus (i.e. a rule may disagree with at most one or two sentences in the corpus, usually when the sentence contaln.q a little-used construction).</Paragraph>
      <Paragraph position="1"> The rules are not much restricted by engineering consideratious; linguistic truth has been more important. This shows e.g. in the non-locality of many of the rules: the description of many syntactic phenomena seems to require reference to contextual elements in the scope of a * iiuite clause, often even in the scope of the whole sentence, and this kind of globaUty has been practiced even though this probably results in bigger processing requirements for the iinite state disambiguator (many disambiguating decisions have to be delayed e.g. until the end of the clause or Sentence, therefore more alternatives have to be kept longer 'alive' than might be the case &amp;quot;with very local rules).</Paragraph>
      <Paragraph position="2"> Many rules are lexicalised in the sense that some element in the rule is a word (rather than a tag). Though a small purely feature-based grammar may seem more appe~llng aesthetically or computation_Ally, many useful lexico-grammatical generalisations would be lost if reference to words were not allowed.</Paragraph>
      <Paragraph position="3">  To sum up: the finite state disambiguator's task is facilitated by using a reasonably resolvable surface-syntactic grammatical representation, but the parser's task remains computationally rather demanding because of (i) the high initial ambiguity of the input, especially in the case of long sentences, (fi) considerably high number of rules and rule automata, and (iii) the non-locality of the rules. The finite state syntactic disamhiguator is clearly faced with a computationally and linguistically very demanding task.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="30" end_page="30" type="metho">
    <SectionTitle>
3 Morphological disAmbiguators
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
8.1 Mature disAmbiguator
</SectionTitle>
      <Paragraph position="0"> The mature disamhiguator is an early version of a system presently known as EngCG-2 (Samuelsson and Voutflainen \[8\]). EngCG-2 uses a grP-mmAr Of 3,500 rules according to the Constraint Grammar framework (Karlsson et al., eds., \[4\]). The rules are pattern-action statements that, depending on rule type, select a morphological reading as correct (by discarding other readings) or discard a morphological reading as incorrect, when the ambiguity-forml-g morphological an_~!ysis occurs in a context specified with the context~conditions of the constraint. Contextconditions can refer to tags and words in any sentence position; also certain types of word/tag sequences can be used in context-conditions.</Paragraph>
      <Paragraph position="1"> An evaluation and comparison of EngCG-2 to a state-of-the-art statistical tagger is reported in (Samuelsson and Voutilalnen \[8\]). In similar cir~lm~tances, the error rate of EngCG-2 was an order of magnitude smaller than that of the statistical tagger. On a 266 MI-/z Pentium nmnln~ Linux, EngCG-2 tags around 4,000 words per second. 1</Paragraph>
    </Section>
    <Section position="2" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
3.2 Small disambiguator
</SectionTitle>
      <Paragraph position="0"> To determine the benefit of using a rule set developed in a short time, one long day was spent on writing a constraint gr~trnrnar Of 149 rules for disambiguating frequent and obviously resolvable ambiguities. As the grammarian's empirical basis, a manually disambiguated benchmark corpus of about 300,000 words was used.</Paragraph>
      <Paragraph position="1"> The small grammar was tested agaln~t a held-out manually disambiguated (and several times proofread) corpus of 114,388 words with 87,495 superfluous morphological analyses. After the 149 rules were applied to this corpus, there were still 24,458 superfluous analyses, i.e. about 72% of all extra readings were discarded, and the output word contained an average of 1.21 alternative morphological analyses. Of the 63,037 discarded readings, 79 were analysed as contextually legitimate, i.e. of the predictions made by the new tagger, almost 99.9% were correct.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="30" end_page="33" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> This section reports the application of the following three setups to new text data: (i) Nodi~. the finite state parser is used as such.</Paragraph>
    <Paragraph position="1"> i Information about testing and licensing the present version of the EngCG-2 tagger is given at the foUowing U\]~: hCtp://In~, coaexor, fi/analysers, html.</Paragraph>
    <Paragraph position="2">  (ii) Small: a morphological disambiguation module with 149 rules is used before the finite state parser.</Paragraph>
    <Paragraph position="3"> (fii) Eng: a morphological disambiguation module with 3,500 rules is used before the finite state parser.</Paragraph>
    <Paragraph position="4"> Three text corpora were used as test data: (i) Data 1:200 10-word sentences from The Wall Street Journal (ii) Data 2:200 15-word sentences from The Wall Street Journal (iii) Data 3:200 20-word sentences from The Wall Street Journal  In the word count, punctuation marks were excluded. The data is new to system. The machine used in the tests is Sun SparcStation 10/30, with 64 MB of RAM. In the statistics below, the term 'recognition rate' is used. Recognition rate indicates the percentage of sentences that get at least one analysis, correct or incorrect, from the parser. The parser's correctness rate rema~nq to be determined later (but cf. Section 4.2 above).</Paragraph>
    <Section position="1" start_page="31" end_page="32" type="sub_section">
      <SectionTitle>
4.1 Statistics on input Ambiguity
</SectionTitle>
      <Paragraph position="0"> Before going to detailed examinations, some statistics on input ambiguity are given. The following table indicates how many readings each word received on an average after possible morphological disambiguation and introduction of syntactic ambiguities. The ambiguity rates are given for morphology and syntax separately.</Paragraph>
      <Paragraph position="1">  morphological analyses and 14.33 syntactic analyses.</Paragraph>
      <Paragraph position="2"> At the word level, syntactic ambiguity decreases quite considerably even using the small disambiguator, from about 23 syntactic readings per word to some 16.5 syntactic readings per word. Use of the EngCC~2 disambiguator does not contribute much to further decrease of syntactic ambiguity. Overall, syntactic ambiguity at the word level remain.~ quite large, about 14 analyses per word.</Paragraph>
      <Paragraph position="3"> However, if we consider the ambiguity rate of the finite state parser's input at the sentence level (which is the more common way of looking at ambiguity at the level of syntax), things look more worrying. The following table (next page) presents syntactic ambiguity rates at the sentence level for Data 1 (the 10-word sentences).</Paragraph>
      <Paragraph position="4"> When no morphological disambiguation is done, a typical ambiguity rate is 1017 sentence readings per input sentence; even after EngCG-2 disambiguation, the typical ambiguity rate is still about 10 zs sentence readings.</Paragraph>
      <Paragraph position="5">  Without morphological disambiguation, the parser gave analyses for 98% of all sentences; the use of the EngCG-2 disambignator decreased the recognition rate only by 1.5%. Considering that the known strength of the EngCG-2 disambiguator is h/gh recall, the small loss in the number of parses does not seem particularly surpr'~dng.</Paragraph>
      <Paragraph position="6"> The number of parses decreased even when the small disambiguator was used. The decrease was considerable with EngCG-2, e.g. the rate of sentences receiving 1-5 parses rose from about 60% to 80%. The somewhat unexpected syntactic disambiguating power of the morphological disambiguators is probably due to the lexical nature of the disambiguation grammar (many constraints refer to words, not only to tags). Lexical information has been argued to be an important part of a successful POS tagger (cf. e.g. Church \[2\]).</Paragraph>
      <Paragraph position="7"> Generally, parsing was rather slow, considering the shortness of the sentences. Disambiguation certainly had a positive impact on parsing time, e.g. the ratio of sentences parsed in less than ten seconds rose from 6% to about 40%.</Paragraph>
    </Section>
    <Section position="2" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
4.3 Analysis of data 2
</SectionTitle>
      <Paragraph position="0"> Nodis and Small were in trouble due to excessively slow parsing. The first 9 sentences were parsed by all three setups. Here are the relevant statistics.</Paragraph>
      <Paragraph position="1">  The general trend seems to agree with experiences from Data 1: the number of parses as well as parsing time generally decreases when more morphological disambiguation is carried out (however, note the curious exception in the case of sentence 3: parsing was faster with no disambiguation than with small disambiguation). Because of the scarcity of the data, more specific comparisons can not be made.</Paragraph>
      <Paragraph position="2"> The setup with EngCG-2 disambiguation parsed all 200 sentences of Data 2. Because the other setups did not do this in the time allowed, no comparisons could be made. It may however be interesting to make two observations about the number of parses received. Consider the following table.</Paragraph>
      <Paragraph position="3"> li 13.s% (7) J Ile (38) I -s 1~.8.u% (117) t \[!:20 186% (172) J Of all sentences, 96.5% got at least one parse, i.e. the slightly greater length of the input sentences does not seem to considerably affect the parser's coverage (the recognition rate was the same in Data 1).</Paragraph>
      <Paragraph position="4"> The ambiguity rate increases cousidexably. For instance, only 28.5% of all sentences in Data 2 (compared to the 80% of Data 1) received 1-5 parses.</Paragraph>
    </Section>
    <Section position="3" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
4.4 Analysis of data 3
</SectionTitle>
      <Paragraph position="0"> In the analysis of the 20-word sentences, even the setup using the EngCG-2 dissmbiguator was in trouble: within the time allowed, the system analysed only 25 sentences. All of them received at least one parse.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML