File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2100_metho.xml

Size: 11,989 bytes

Last Modified: 2025-10-06 14:07:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2100">
  <Title>Automatic Extraction of Subcategorization Frames for Czech*</Title>
  <Section position="4" start_page="0" end_page="694" type="metho">
    <SectionTitle>
2 lhsk Description
</SectionTitle>
    <Paragraph position="0"> In this section we describe precisely the proposed task. We also describe the input training material and the output produced by our algorithms.</Paragraph>
    <Section position="1" start_page="0" end_page="691" type="sub_section">
      <SectionTitle>
2.1 Identifying subeategorization frames
</SectionTitle>
      <Paragraph position="0"> Ill general, the problem of identifying subcategorization fi-ames is to distinguish between arguments and adjuncts among the constituents modifying a IOI/c of the ammymous rcviewcrs pointed out that (Basili and Vindigni. 1908) presents a corpus-driven acquisition of subcategorization frames for Italian.</Paragraph>
      <Paragraph position="2"> given within braces {}. In this example, the frames N4 R2(od), N4 R6(v) and N4 R6(po) have been observed with other verbs in the corpus. Note that the counts in this figure do not correspond to the real counts for the verb absoh,ovat in the training corpus.</Paragraph>
      <Paragraph position="3"> where c(.) are counts in the training data. Using the values computed above:</Paragraph>
      <Paragraph position="5"> Taking these probabilities to be binomially distributed, the log likelihood statistic (Dunning, 1993)  is given by: - 2 log A = 2\[log L(pt, k:l, rtl) @ log L(p2, k2, rl,2) -log L(p, kl, n2) - log L(p, k2, n2)\] where, log L(p, n, k) = k logp + (,z -- k)log(1 - p) According to this statistic, tile greater the value of -2 log A for a particular pair of observed frame and  verb, the more likely that frame is to be valid SF of the verb.</Paragraph>
    </Section>
    <Section position="2" start_page="691" end_page="691" type="sub_section">
      <SectionTitle>
3.2 T-scores
</SectionTitle>
      <Paragraph position="0"> Another statistic that has been used for hypothesis testing is the t-score. Using tile definitions from Section 3.1 we can compute t-scores using the equation below and use its value to measure the association between a verb and a frame observed with it.</Paragraph>
      <Paragraph position="2"> In particular, the hypothesis being tested using the t-score is whether the distributions Pi and P2 are not independent. If the value of T is greater than some threshold then the verb v should take the frame f as a SF.</Paragraph>
    </Section>
    <Section position="3" start_page="691" end_page="694" type="sub_section">
      <SectionTitle>
3.3 Binomial Models of Miscue Probabilities
</SectionTitle>
      <Paragraph position="0"> Once again assuming that the data is binomially distributed, we can look for fiames that co-occur with a verb by exploiting the miscue probability: the probability of a frame co-occuring with a verb when it is not a valid SF. This is the method used by several earlier papers on SF extraction starting with (Brent, 1991; Brent, 1993; Brent, 1994).</Paragraph>
      <Paragraph position="1"> Let us consider probability PU which is the probability that a given verb is observed with a fiame but this frame is not a valid SF for this verb. p!f is the error probability oil identifying a SF for a verb. Let us consider a verb v which does not have as one of its valid SFs the frame f. How likely is it that v will be seen 'm, or more times in the training data with fi'ame f? If v has been seen a total of n times ill the data, then H*(p!f; m, 7z) gives us this likelihood.</Paragraph>
      <Paragraph position="3"> threshold value then it is extremely unlikely that the hypothesis is tree, and hence the frame f must be a SF of tile verb v. Setting the threshold value to 0.0,5 gives us a 95% or better contidence value that the verb v has been observed often enough with a flame f for it to be a valid SE Initially, we consider only the observed fnnnes (OFs) from the treebank. There is a chance that some are subsets of some others but now we count only tile cases when the OFs were seen themselves.</Paragraph>
      <Paragraph position="4"> Let's assume the test statistic reiected the flame.</Paragraph>
      <Paragraph position="5"> Then it is not a real SF but there probably is a sub-set of it that is a real SE So we select exactly one of  tile subsets whose length is one member less: this is the successor of the rejected flame and inherits its frequency. Of course one frame may be successor of several longer frames and it can have its own count as OF. This is how frequencies accumulate and frames become more likely to survive. The exalnple shown in Figure 2 illustrates how the sub-sets and successors are selected.</Paragraph>
      <Paragraph position="6"> An important point is the selection of the successor. We have to select only one of the ~t possible successors of a flame of length 7z, otherwise we would break tile total frequency of the verb. Suppose there is m rejected flames of length 7z. &amp;quot;Ellis yields m * n possible modifications to consider before selection of the successor. We implemented two methods for choosing a single successor flame: 1. Choose the one that results in the strongest preference for some frame (that is, the successor flmne results in the lowest entropy across the corpus). This measure is sensitive to the frequency of this flame in the rest of corpus.</Paragraph>
      <Paragraph position="7"> 2deg Random selection of the successor frame from the alternatives.</Paragraph>
      <Paragraph position="8"> Random selection resulted in better precision (88% instead of 86%). It is not clear wily a method that is sensitive to the frequency of each proposed successor frame does not perform better than random selection.</Paragraph>
      <Paragraph position="9"> The technique described here may sometimes result in subset of a correct SF, discarding one or more of its members. Such frame can still hel l) parsers because they can at least look for the dependents that have survived.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="694" end_page="694" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> For the evalnation of the methods described above we used the Prague l)ependency Treebank (PI)T).</Paragraph>
    <Paragraph position="1"> We used 19,126 sentences of training data from tile PDT (about 300K words). In this training set, there were 33,641 verb tokens with 2,993 verb types.</Paragraph>
    <Paragraph position="2"> There were a total of 28,765 observed fiames (see Section 2.1 for exphmation of these terms). There were 914 verb types seen 5 or more times.</Paragraph>
    <Paragraph position="3"> Since there is no electronic valence dictionary for Czech, we evaluated our tiltering technique on a set of 500 test sentences which were unseen and separate flom the training data. These test sentences were used as a gold standard by distinguishing the arguments and adjuncts manually. We then compared the accuracy of our output set of items marked as either arguments or adjuncts against this gold standard.</Paragraph>
    <Paragraph position="4"> First we describe the baseline methods. Base-line method 1: consider each dependent of a verb an adjunct. Baseline method 2: use just the longest known observed frame matching the test pattern. If no matching OF is known, lind the longest partial match in the OFs seen in the training data. We exploit the functional and morphological tags while matching. No statistical filtering is applied in either baseline method.</Paragraph>
    <Paragraph position="5"> A comparison between all three methods that were proposed in this paper is shown in Table 1.</Paragraph>
    <Paragraph position="6"> The experiments showed that the method improved precision of this distinction flom 57% to 88%. We were able to classify as many as 914 verbs which is a number outperlormed only by Manning, with 10x more data (note that our results arc for a different language).</Paragraph>
    <Paragraph position="7"> Also, our method discovered 137 subcategorization frames from the data. The known upper bound of frames that the algorithm could have found (the total number of the obsem, edframe types) was 450.</Paragraph>
  </Section>
  <Section position="6" start_page="694" end_page="695" type="metho">
    <SectionTitle>
5 Comparison with related work
</SectionTitle>
    <Paragraph position="0"> Preliminary work on SF extraction from coq~ora was done by (Brent, 1991; Brunt, 1993; Brent, 1994) and (Webster and Marcus, 1989; Ushioda et al., 1993). Brent (Brent, 1993; Brent, 1994) uses the standard method of testing miscue probabilities for filtering frames observed with a verb. (Brent, 1994) presents a method lbr estimating 1)7. Brent applied his method to a small number of verbs and associated SF types. (Manning, 1993) applies Brent's method to parsed data and obtains a subcategorization dictionary for a larger set of verbs. (Briscoe and Carroll, 1997; Carroll and Minnen, 1998) differs from earlier work in that a substantially larger set of SF types are considered; (Canoll and Rooth, 1998) use an EM algorithm to learn subcategorization as a result of learning rule probabilities, and, in tnrn, to improve parsing accuracy by applying the verb SFs obtained. (Basili and Vindigni, 1998) use a conceptual clustering algorithm for acquiring sub-categorization fl'ames for Italian. They establish a partial order on partially overlapping OFs (similar to our Ot: subsets) which is then used to suggest a potential SF. A complete comparison of all the previous approaches with tile current work is given in  While these approaches differ in size and quality of training data, number of SF types (e.g. intransitive verbs, transitive verbs) and number of verbs processed, there are properties that all have in con&gt; mon. They all assume that they know tile set of possible SF types in advance. Their task can be viewed as assigning one or more of the (known) SF types to a given verb. In addition, except for (Briscoe and Carroll, 1997; Carroll and Minnen, 1998), only a small number of SF types is considered.</Paragraph>
    <Paragraph position="1">  the values are not integers since for some difficult cases in the test data, the value for each argument/adjunct decision was set to a value between \[0, 1\]. Recall is computed as the number of known verb complements divided by the total number of complements. Precision is computed as the number of correct suggestions divided by the number of known verb complements. Ffl=l = (2 x p x r)/(p + r). % unknown represents the percent of test data not considered by a particular method. Using a dependency treebank as input to our learning algorithm has both advantages and drawbacks. There are two main advantages of using a treebank: * Access to more accurate data. Data is less noisy when compared with tagged or parsed input data. We can expect correct identification of verbs and their dependents.</Paragraph>
    <Paragraph position="2"> * We can explore techniques (as we have done in this paper) that try and learn the set of SFs from the data itself, unlike other approaches where the set of SFs have to be set in advance.</Paragraph>
    <Paragraph position="3"> Also, by using a treebank we can use verbs in different contexts which are problematic for previous approaches, e.g. we can use verbs that appear in relative clauses. However, there are two main drawbacks: null Treebanks are expensive to build and so the techniques presented here have to work with less data.</Paragraph>
    <Paragraph position="4"> All the dependents of each verb are visible to the learning algorithm. This is contrasted with previous techniques that rely on linite-state ex= traction rules which ignore many dependents of the verb. Thus our technique has to deal with a different kind of data as compared to previous approaches.</Paragraph>
    <Paragraph position="5"> We tackle the second problem by using the method of observed frame subsets described in Section 3.3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML