File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1013_metho.xml

Size: 12,969 bytes

Last Modified: 2025-10-06 14:14:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1013">
  <Title>Developing a hybrid NP parser</Title>
  <Section position="4" start_page="81" end_page="82" type="metho">
    <SectionTitle>
3 Grammatical representation
</SectionTitle>
    <Paragraph position="0"> The input of our parser is morphologically analyzed and disambiguated text enriched with alternative syntactic tags, e.g.</Paragraph>
    <Paragraph position="1"> &amp;quot;&lt;others&gt;&amp;quot; &amp;quot;other&amp;quot; PRON N0M PL @&gt;N @NH  &amp;quot;&lt;moved&gt;&amp;quot; &amp;quot;move&amp;quot; &lt;SV&gt; &lt;SV0&gt; V PAST VFIN @V &amp;quot;&lt;away&gt;&amp;quot; &amp;quot;away&amp;quot; ADV ADVL @&gt;A @AH &amp;quot;&lt;from&gt;&amp;quot; &amp;quot;from&amp;quot; PREP @DUMMY &amp;quot;&lt;tradit ional&gt;&amp;quot; &amp;quot;traditional&amp;quot; A ABS @&gt;N @N&lt; @NH &amp;quot;&lt;jazz&gt;&amp;quot; &amp;quot;jazz&amp;quot; &lt;-Indef&gt; N NOM SG @&gt;N @NH &amp;quot;&lt;practice&gt;&amp;quot; &amp;quot;practice&amp;quot; N N0M SG @&gt;N @NH &amp;quot;practice&amp;quot; &lt;SV0&gt; V PRES -SG3 VFIN @V  Every indented line represents a morphological reading; the sample shows that some morphological ambiguities are not resolved by the rule-based morphological disambiguator, known as the EngCG tagger (Voutilainen et al., 1992; Karlsson et al., 1995). Our syntactic tags start with the &amp;quot;@&amp;quot; sign. A word is syntactically ambiguous if it has more than one syntactic tags (e.g. practice above has three alternative syntactic tags). Syntactic tags are added to the morphological analysis with a simple lookup module. The syntactic parser's main task is disambiguating (rather than adding new information to the input sentence): contextuMly illegitimate alternatives should be discarded, while legitimate tags should be retained (note that also morphological ambiguities may be resolved as a side effect). Next we describe the syntactic tags:  (intensify) adjectives (including adjectival ING-forms and non-finite ED-forms), adverbs and various kinds of quantifiers (certain determiners, pronouns and numerals).</Paragraph>
    <Paragraph position="2">  parser does not address the attachment of prepositional phrases.</Paragraph>
  </Section>
  <Section position="5" start_page="82" end_page="84" type="metho">
    <SectionTitle>
4 Syntactic rules
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="82" end_page="82" type="sub_section">
      <SectionTitle>
4.1 Rule formalism
</SectionTitle>
      <Paragraph position="0"> The rules follow the Constraint Grammar formalism, and they were applied using the recent parsercompiler CG-2 (Tapanainen, 1996). The parser reads a sentence at a time and discards those ambiguity-forming readings that are disallowed by a constraint.</Paragraph>
      <Paragraph position="1"> Next we describe some basic features of the rule formalism. The rule R~.HOV~. ((c)&gt;hi) (,ic &lt;&lt;&lt; OR ((c)V) OR (~CS) BARRIER (@NH)); removes the premodifier tag @&gt;N from an ambiguous reading if somewhere to the right (*1) there is an unambiguous (C) occurrence of a member of the set &lt;&lt;&lt; (sentence boundary symbols) or the verb tag @V or the subordinating conjunction tag @CS, and there are no intervening tags for nominal heads (@NH).</Paragraph>
      <Paragraph position="2"> This is a partial rule about coordination:</Paragraph>
      <Paragraph position="4"> It removes the premodifier tag if all three context-conditions are satisfied: * the word to be disambiguated (0) is not a determiner, numeral or adjective,  * the first word to the right (1) is an unambiguous coordinating conjunction, and * the second word to the right is an unambiguous determiner.</Paragraph>
      <Paragraph position="5">  In addition to REMOVing, also SELECTing a reading is possible: when all context-conditions are satisfied, all readings but the one the rule was expressly about are discarded.</Paragraph>
      <Paragraph position="6"> The rules can refer to words and tags directly or by means of predefined sets. They can refer not only to any fixed context positions; also reference to contextual patterns is possible. The rules never discard a last reading, so every word retains at least one analysis. On the other hand, an ambiguity remains unresolved if there are no rules for that particular type of ambiguity.</Paragraph>
    </Section>
    <Section position="2" start_page="82" end_page="82" type="sub_section">
      <SectionTitle>
4.2 Grammar development
</SectionTitle>
      <Paragraph position="0"> A day was spent on writing 107 constraints; about 15,000 words of the parser's output were proofread during the process. The routine was the following:  1. The current grammar (containing e.g. 2 rules) is applied to the ambiguous input in a 'trace' mode in which the parser also indicates, which rule discarded which analysis, 2. The grammarian observes remaining ambiguities and proposes new rules for disambiguating them, and 3. He also tries to identify misanalyses (cases  where the correct tag is discarded) and, using the trace information, corrects the faulty rule This routine is useful if the development time is very restricted, and only the most common ambiguity types have to be resolved with reasonable success. However, if the grammar should be of a very high quality (extremely few mispredictions, high degree of ambiguity resolution), a large test corpus, formally similar to the input except for the manually added extra information about the correct analysis, should be used. This kind of test corpus would enable the automatic identification of mispredictions as well as counting of various performance statistics for the rules. However, manually disambiguating a test corpus of a few hundred thousand words would probably require a human effort of at least a month.</Paragraph>
    </Section>
    <Section position="3" start_page="82" end_page="84" type="sub_section">
      <SectionTitle>
4.3 Sample output
</SectionTitle>
      <Paragraph position="0"> The following is genuine output of the linguistic (CG-2) parser using the 107 syntactic disambiguation rules. The traces starting with &amp;quot;S:&amp;quot; indicate the line on which the applied rule is in the grammar file. One syntactic (and morphological) ambiguity remains unresolved: until remains ambiguous due to preposition and subordinating conjunction readings.</Paragraph>
      <Paragraph position="1">  To solve shallow parsing with the relaxation labelling algorithm we model each word in the sentence as a variable, and each of its possible readings as a label for that variable. We start with a uniform weight distribution.</Paragraph>
      <Paragraph position="2"> We will use the algorithm to select the right syntactic tag for every word. Each iteration will increase the weight for the tag which is currently most compatible with the context and decrease the weights for the others.</Paragraph>
      <Paragraph position="3"> Since constraints are used to decide how compatible a tag is with its context, they have to assess the compatibility of a combination of readings. We adapt CG constraints described above.</Paragraph>
      <Paragraph position="4"> The REMOVE constraints express total incompatibility 5 and SELECT constraints express total compatibility (actually, they express incompatibility of all other possibilities).</Paragraph>
      <Paragraph position="5"> The compatibility value for these should be at least as strong as the strongest value for a statistically obtained constraint (see below). This produces a value of about -4-10.</Paragraph>
      <Paragraph position="6"> But because we want the linguistic part of the model to be more important than the statistical part and because a given label will receive the influence SWe model compatibility values using mutual information (Cover and Thomas, 1991), which enables us to use negative numbers to state incompatibility. See (PadrS, 1996) for a performance comparison between M.I. and other measures when applying relaxation labelling to NLP.</Paragraph>
      <Paragraph position="7">  of about two bigrams and three trigrams 6, a single linguistic constraint might have to override five statistical constraints. So we will make the compatibility values six times stronger, that is, =h60. Since in our implementation of the CG parser (Tapanainen, 1996) constraints tend to be applied in a certain order - e.g. SELECT constraints are usually applied before REMOVE constraints - we adjust the compatibility values to get a similar effect: if the value for SELECT constraints is +60, the value for REMOVE constraints will be lower in absolute value, (i.e. -50). With this we ensure that two contradictory constraints (if there are any) do not cancel each other. The SELECT constraint will win, as if it had been applied before.</Paragraph>
      <Paragraph position="8"> This enables using any Constraint Grammar with this algorithm although we are applying it more flexibly: we do not decide whether a constraint is applied or not. It is always applied with an influence (perhaps zero) that depends on the weights of the labels.</Paragraph>
      <Paragraph position="9"> If the algorithm should apply the constraints in a more strict way, we can introduce an influence threshold under which a constraint does not have enough influence, i.e. is not applied.</Paragraph>
      <Paragraph position="10"> We can add more information to our model in the form of statistically derived constraints. Here we use bigrams and trigrams as constraints.</Paragraph>
      <Paragraph position="11"> The 218,000-word corpus of journalese from which these constraints were extracted was analysed using the following modules:  ten in a day No human effort was spent on creating this training corpus. The training corpus is partly ambiguous, so the bi/trigram information acquired will be slightly noisy, but accurate enough to provide an almost supervised statistical model.</Paragraph>
      <Paragraph position="12"> For instance, the following constraints have been statistically extracted from bi/trigram occurrences in the training corpus.</Paragraph>
      <Paragraph position="14"> so there is always a bi/trigram which is applied more significantly than the others.</Paragraph>
      <Paragraph position="16"> The compatibility value is the mutual information, computed from the probabilities estimated from a training corpus. We do not need to assign the compatibility values here, since we can estimate them from the corpus.</Paragraph>
      <Paragraph position="17"> The compatibility values assigned to the hand-written constraints express the strength of these constraints compared to the statistical ones. Modifying those values means changing the relative weights of the linguistic and statistical parts of the model.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="84" end_page="85" type="metho">
    <SectionTitle>
6 Preparation of the benchmark
</SectionTitle>
    <Paragraph position="0"> corpus For evaluating the systems, five roughly equal-sized benchmark corpora not used in the development of our parsers and taggers were prepared. The texts, totaling 6,500 words, were copied from the Gutenberg e-text archive, and they represent present-day American English. One text is from an article about AIDS; another concerns brainwashing techniques; the third describes guerilla warfare tactics; the fourth addresses the assassination of J. F. Kennedy; the last is an extract from a speech by Noam Chomsky. null The texts were first analysed by a recent version of the morphological analyser and rule-based disambiguator EngCG, then the syntactic ambiguities were added with a simple lookup module. The ambiguous text was then manually disambiguated. The disambiguated texts were also proofread afterwards. Usually, this practice resulted in one analysis per word. However, there were two types of exception: 1. The input did not contain the desired alternative (due to a morphological disambiguation error). In these cases, no reading was marked as correct. Two such words were found in the corpora; they detract from the performance figures. null 2. The input contained more than one analyses all of which seemed equally legitimate, even when semantic and textual criteria were consulted.</Paragraph>
    <Paragraph position="1"> In these cases, all the equal alternatives were marked as correct. The benchmark corpus contains 18 words (mainly ING-forms and nonfinite ED-forms) with two correct syntactic analyses.</Paragraph>
    <Paragraph position="2"> The number of multiple analyses could probably be made even smaller by specifying the grammatical representation (usage principles of the syn- null tactic tags) in more detail, in particular incorporating some analysis conventions for certain apparent borderline cases (for a discussion of specifying a parser's linguistic task, see (Voutilainen and J~rvinen, 1995)).</Paragraph>
    <Paragraph position="3"> To improve the objectivity of the evaluation, the benchmark corpus (as well as parser outputs) have</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML