File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1067_metho.xml

Size: 5,065 bytes

Last Modified: 2025-10-06 14:12:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1067">
  <Title>Automatic Acquisition of Subcategorization Frames from Tagged Text</Title>
  <Section position="3" start_page="0" end_page="342" type="metho">
    <SectionTitle>
RESULTS
</SectionTitle>
    <Paragraph position="0"> So far, we have concentrated on the five subcategorization frames shown in Table 1. Table 2 shows the results  hit them tell him he's a fool want him to attend  obtained on a 2.6 million-word Wall Street Journal corpus</Paragraph>
  </Section>
  <Section position="4" start_page="342" end_page="343" type="metho">
    <SectionTitle>
METHODOLOGY
</SectionTitle>
    <Paragraph position="0"> Our program uses a finite-state grammar for recognizing the auxiliary, and determining subcategorization frames. The English auxiliary system is known to be finite state and our treatment of it is standard, so the first subsection discusses the determination of subcategorization frames. The second subsection describes a planned statistical approach to the one to three percent error rates described above.</Paragraph>
    <Section position="1" start_page="342" end_page="342" type="sub_section">
      <SectionTitle>
Complement Grammar
</SectionTitle>
      <Paragraph position="0"> The obvious approach to finding an SF like &amp;quot;V NP to V&amp;quot; is to look for occurrences of just that pattern in the training corpus, but the obvious approach fads to address the bootstrapping problem, as shown by (1) above. Our solution is based on the following insights:  each SF using the tagged mode. Error rates for verb detection are estimated separately below.</Paragraph>
      <Paragraph position="1"> Rather than take the obvious approach of looking for &amp;quot;V NP to V&amp;quot;, we look for clear cases like &amp;quot;V PRONOUN to V'. The advantages can be seen by contrasting (2) with (1)  (page 1).</Paragraph>
      <Paragraph position="2"> (1) a. oK I expected him to eat ice-cream b. * I doubted him to eat ice-cream  More generally, our system recognizes linguistic structure using a small finite-state grammar that describes only that fragment of English that is most useful for recognizing SFs. The grammar relies exclusively on closed-class lexlcal items such as pronouns, prepositions, determiners, and auxiliary verbs.</Paragraph>
      <Paragraph position="3"> The complement grammar needs to distinguish three types of complements: direct objects, infinitives, and clauses. Figure 1 shows a substantial part of the grammar responsible for detecting these complements. Any verb followed im- null &lt;DO&gt; &lt;infinitive&gt; is assigned the corresponding SF. mediately by matches for &lt;DO&gt;, &lt;clause&gt;, &lt;infinitive&gt;, &lt;DO&gt;&lt;clause&gt;, or &lt;DO&gt;&lt;inf&gt; is assigned the corresponding SF. Adverbs are ignored for purposes of adjacency. The notation &amp;quot;?&amp;quot; follows optional expressions, and D0 is specified in context-sensitive notation for convenience.</Paragraph>
    </Section>
    <Section position="2" start_page="342" end_page="343" type="sub_section">
      <SectionTitle>
Robust Classification
</SectionTitle>
      <Paragraph position="0"> Our system, like any other, occasionally makes mistakes. Error rates of one to three percent are a substantial accomplishment, but if a word occurs enough times in a corpus it is bound to show up eventually in some construetion that fools the system. For that reason any learning  system that gets only positive examples and makes a permanent judgment on a single example will always degrade as the number of occurrences increases. In fact, making a judgment based on any fixed number of examples with any finite error rate will always lead to degradation with corpus-size. A better approach is to require a fixed percentage of the total occurrences of any given verb to appear with a given SF before concluding that random error is not responsible for these observations. Unfortunately, the cutoff percentage is arbitrary and sampling error makes classification unstable for verbs with few occurrences in the input. The sampling error can be dealt with (\[1\]) but the arbitrary cutoffpercentage can't, z Rather than using fixed cutoffs, we are developing an approach that will automatically generate statistical models of the sources of noise using standard regression techniques. For example, purposive adjuncts like &amp;quot;Jon quit to pursue a career in finance&amp;quot; are quite rare, accounting for only two percent of the apparent infinitival complements. Furthermore, they are distributed across a much larger set of matrix verbs than the true infinitival complements, so any given verb occurs very rarely indeed with purposive adjuncts. In a histogram sorting verbs by their apparent frequency of occurrence with infinitival complements, those that in fact have appeared with purposive adjuncts and not true infinitival complements will be clustered at the low frequencies. The distributions of such clusters can be modeled automatically and the models used for identifying false positives.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML