File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1327_intro.xml

Size: 4,587 bytes

Last Modified: 2025-10-06 14:01:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1327">
  <Title>Using Semantically Motivated Estimates to Help Subcategorization Acquisition</Title>
  <Section position="3" start_page="0" end_page="216" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Manual development of large subcategorised lexicons has proved difficult because predicates change behaviour between sublanguages, domains and over time. Yet parsers depend crucially on such information, and probabilistic parsers would greatly benefit from accurate information concerning the relative frequency of different subcategorization frames (SCFs) for a given predicate.</Paragraph>
    <Paragraph position="1"> Over the past years acquiring subcategorization dictionaries from textual corpora has become increasingly popular (e.g. Brent, 1991, 1993; Ushioda et al., 1993; Briscoe and Carroll, 1997; Manning, 1993; Carroll and Rooth 1998; Gahl, 1998; Lapata, 1999, Sarkar and Zeman, 2000). The different approaches vary according to the methods used and the number of SCFs being extracted. Regardless of this, there is a ceiling on the performance of these systems at around 80% token recall*.</Paragraph>
    <Paragraph position="2"> *Token recall is the percentage of SCF tokens in a sample of manually analysed text that were correctly acquired by the system.</Paragraph>
    <Paragraph position="3"> One significant source of error lies in the statistical filtering methods frequently used to remove noise from automatically acquired SCFs. These methods are reported to be particularly unreliable for low frequency scFs (Brent, 1991, 1993; Briscoe and Carroll, 1997; Manning, 1993; Manning and Schiitze, 1999; Korhonen, Gorrell and McCarthy, 2000), resulting in a poor overall performance.</Paragraph>
    <Paragraph position="4"> According to Korhonen, Gorrell and Mc-Carthy (2000), the poor performance of statistical filtering can be largely explained by the zipfian nature of the data, coupled with the fact that many statistical tests are based on the assumption of two zipfian distributions correlating: the conditional SCF distribution of an individual verb (p(scfilverbj)) and the unconditional SCF distribution of all verbs in general (p(scfl)). Contrary to this assumption, however, there is no significant correlation between the two distributions.</Paragraph>
    <Paragraph position="5"> Korhonen, Gorrell and McCarthy (2000) have showed that a simple method of filtering SCFs on the basis of their relative frequency performs more accurately than statistical filtering. This method sensitive to the sparse data problem is best integrated with smoothing. Yet the performance of the sophisticated smoothing techniques which back-off to an unconditional distribution also suffer from the lack of correlation between p(scfi\[verbj) and p(scf0. In this paper, we propose a method for obtaining more accurate back-off estimates for SCF acquisition. Taking Levin's verb classification (Levin, 1993) as a starting point, we show that in terms of SCF distributions, individual verbs correlate better with other semantically similar verbs than with all verbs in general. On the basis of this observation, we propose classifying verbs according to their semantic class and using the conditional SCF distributions of a few other members in the  same class as back-off estimates of the class (p( sc filsernantic class j)).</Paragraph>
    <Paragraph position="6"> Adopting the SCF acquisition system of Briscoe and Carroll (1997) we report an experiment which demonstrates how these estimates can be used in filtering. This is done by acquiring the conditional SCF distributions for selected test verbs, smoothing these distributions with the unconditional distribution of the respective verb class, and applying a simple method for filtering the resulting set of SCFs. Our results show that the proposed method improves the acquisition of SCFs significantly. We discuss how this method can be used to benefit large-scale SCF acquisition.</Paragraph>
    <Paragraph position="7"> We begin by reporting our findings that the SCF distributions of semantically similar verbs correlate well (section 2). We then introduce the method we adopted for constructing the back-~off estimates for the data used in our experiment (section 3.1), summarise the main features of the SCF acquisition approach (section 3.2), and describe the smoothing techniques adopted (section 3.3). Finally, we review the empirical evaluation (section 4) and discuss directions for future work (section 5).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML