XML Viewer - w03-1007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/w03-1007_relat.xml
Size: 23,605 bytes
Last Modified: 2025-10-06 14:15:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1007">
  <Title>Maximum Entropy Models for FrameNet Classification</Title>
  <Section position="3" start_page="0" end_page="5" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> To our knowledge, Gildea and Jurafsky (2002) is the only work to use FrameNet to build a statistically based semantic classifier. They split the problem into two distinct sub-tasks: frame element identification and frame element classification. In the identification phase, syntactic information is extracted from a parse tree to learn the boundaries of the frame elements in a sentence. In the classification phase, similar syntactic information is used to classify those elements into their semantic roles.</Paragraph>
    <Paragraph position="1"> In both phases Gildea and Jurafsky (2002) build a model of the conditional probabilities of the classification given a vector of syntactic features.</Paragraph>
    <Paragraph position="2"> The full conditional probability is decomposed into simpler conditional probabilities that are then interpolated to make the classification. Their best performance on held out test data is achieved using a linear interpolation model: where r is the class to be predicted, x is the vector of syntactic features, x i is a subset of those features, a i is the weight given to that subset conditional probability (as determined using the EM algorithm), and m is the total number of subsets used. Using this method, they report a test set accuracy of 78.5% on classifying semantic roles and precision/recall scores of .726/.631 on frame element identification.</Paragraph>
    <Paragraph position="3"> We extend Gildea and Jurafsky (2002)'s initial effort in three ways. First, we adopt a maximum entropy (ME) framework in order to learn a more accurate classification model. Second, we include features that look at previous tags and use previous tag information to find the highest probability semantic role sequence for a given sentence. Finally, we examine sentence-level patterns that exploit more global information in order to classify frame elements. We compare the results of our classifier to that of Gildea and Jurafsky (2002) on matched test sets of both human annotated and automatically identified frame elements.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Semantic Role Classification
</SectionTitle>
      <Paragraph position="0"> Training (36,993 sentences / 75,548 frame elements), development (4,000 sentences / 8,167 frame elements), and held out test sets (3,865 sentences / 7,899 frame elements) were obtained in order to exactly match those used in Gildea and</Paragraph>
      <Paragraph position="2"> . In the experiments presented below, features are extracted for each frame element in a sentence and used to classify that element into one of 120 semantic role categories. The boundaries of each frame element are given based on the human annotations in FrameNet. In Section 4, experiments are performed using automatically identified frame elements.</Paragraph>
      <Paragraph position="3">  Data sets (including parse trees) were obtained from Dan Gildea via personal communication.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Features
</SectionTitle>
      <Paragraph position="0"> For each frame element, features are extracted from the surface text of the sentence and from an automatically generated syntactic parse tree (Collins, 1997). The features used are described below:</Paragraph>
      <Paragraph position="2"> * Target predicate (tar): Although there may be many predicates in a sentence with associated frame elements, classification operates on only one target predicate at a time. The target predicate is the only feature that is not extracted from the sentence itself and must be given by the user. Note that the frame which the target predicate instantiates is not given, leaving any word sense ambiguities to be handled implicitly by the classifier.</Paragraph>
      <Paragraph position="3">  * Phrase type (pt): The syntactic phrase type of the frame element (e.g. NP, PP) is extracted from the parse tree of the sentence by finding the constituent in the tree whose boundaries match the human annotated boundaries of the element. In cases where there exists no constituent that perfectly matches the element, the constituent is chosen which matches the largest text span of the element and has the same left-most boundary.</Paragraph>
      <Paragraph position="4"> * Syntactic head (head): The syntactic heads of the frame elements are extracted from the frame element's matching constituent (as described above) using a heuristic method described by Michael Collins.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
This method
</SectionTitle>
      <Paragraph position="0"> extracts the syntactic heads of constituents; thus, for example, the second frame element in Figure 1 has head &amp;quot;hands,&amp;quot; while the third frame element has head &amp;quot;in.&amp;quot; * Logical Function (lf): A simplification of the grammatical function annotation (see section 1) is extracted from the parse tree. Unlike the  Because of the interaction of head word features with the target predicate, we suspect that ambiguous lexical items do not account for much error. This question, however, will be addressed explicitly in future work.</Paragraph>
      <Paragraph position="1">  function from that set, and total number of feature functions in the set. Examples taken from frame element &amp;quot;in inspiration,&amp;quot; shown in Figure 1.</Paragraph>
      <Paragraph position="2">  full grammatical function, the lf can have only one of three values: external argument, object argument, other. A node is considered an external argument if it is an ancestor of an S node, an object argument if it is an ancestor of a VP node, and other for all other cases. This feature is only applied to frame elements whose phrase type is NP.</Paragraph>
      <Paragraph position="3"> * Position (pos): The position of the frame element relative to the target (before, after) is extracted based on the surface text of the sentence.</Paragraph>
      <Paragraph position="4"> * Voice (voice): The voice of the sentence (active, passive) is determined using a simple regular expression passed over the surface text of the sentence.</Paragraph>
      <Paragraph position="5"> * Order (order): The position of the frame element relative to the other frame elements in the sentence. For example, in the sentence from Figure 1, the element &amp;quot;She&amp;quot; has order=0, while &amp;quot;in inspiration&amp;quot; has order=2.</Paragraph>
      <Paragraph position="6"> * Syntactic pattern (pat): The sentence level syntactic pattern of the sentence is generated by looking at the phrase types and logical functions of each frame element in the sentence. For example, in the sentence: &amp;quot;Alexandra bent her head;&amp;quot; &amp;quot;Alexandra&amp;quot; is an external argument Noun Phrase, &amp;quot;bent&amp;quot; is a target predicate, and &amp;quot;her head&amp;quot; is an object argument Noun Phrase. Thus, the syntactic pattern associated with the sentence is [NP-ext, target, NP-obj].</Paragraph>
      <Paragraph position="7"> These syntactic patterns can be highly informative for classification. For example, in the training data, a syntactic pattern of [NPext, target, NP-obj] given the predicate bend was associated 100% of the time with the Frame Element pattern: &amp;quot;AGENT TARGET BODYPART.&amp;quot; * Previous role (r_n): Frame elements do not occur in isolation, but rather, depend very much on the other elements in a sentence.</Paragraph>
      <Paragraph position="8"> This dependency can be exploited in classification by using the semantic roles of previously classified frame elements as features in the classification of a current element. This strategy takes advantage of the fact that, for example, if a frame element is tagged as an AGENT it is highly unlikely that the next element will also be an AGENT.</Paragraph>
      <Paragraph position="9"> The previous role feature indicates the classification that the n-previous frame element received. During training, this information is provided by simply looking at the true classes of the frame element occurring n positions before the target element. During testing, hypothesized classes of the n elements are used and Viterbi search is performed to find the most probable tag sequence for a sentence.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Maximum Entropy
</SectionTitle>
      <Paragraph position="0"> ME models implement the intuition that the best model will be the one that is consistent with the set of constrains imposed by the evidence, but otherwise is as uniform as possible (Berger et al., 1996). We model the probability of a semantic role r given a vector of features x according to the ME formulation below:</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Experiments
</SectionTitle>
      <Paragraph position="0"> We present three experiments in which different feature sets are used to train the ME classifier. The first experiment uses only those feature combinations described in Gildea and Jurafsky (2002) (feature sets 0-7 from Table 1). The second experiment uses a super set of the first and incorporates the syntactic pattern features described above (feature sets 0-9). The final experiment uses the previous tags and implements Viterbi search to find the best tag sequence (feature sets 0-11).</Paragraph>
      <Paragraph position="2"> (r,x) is a feature function which maps each role and vector element (or combination of elements) to a binary value, n is the total number of feature functions, and l</Paragraph>
      <Paragraph position="4"> is the weight for a given feature function.</Paragraph>
      <Paragraph position="5"> The final classification is just the role with highest probability given its feature vector and the model.</Paragraph>
      <Paragraph position="6"> We further investigate the effect of varying two aspects of classifier training: the standard deviation of the Gaussian priors used for smoothing, and the number of sentences used for training. To examine the effect of optimizing the standard deviation, a range of values was chosen and a classifier was trained using each value until performance on a development set ceased to improve.</Paragraph>
      <Paragraph position="7"> The feature functions that we employ can be divided into feature sets based upon the types and combinations of features on which they operate.</Paragraph>
      <Paragraph position="8"> Table 1 lists the feature sets that we use, as well as the number of individual feature functions they contain. The feature combinations were chosen based both on previous work and trial and error. In future work we will examine more principled feature selection techniques.</Paragraph>
      <Paragraph position="9"> To examine the effect of training set size on performance, five data sets were generated from the original set with 36, 367, 3674, 7349, and 24496 sentences, respectively. These data sets were created by going through the original set and selecting every thousandth, hundredth, tenth, fifth, and every second and third sentence, respectively.</Paragraph>
      <Paragraph position="10"> It is important to note that the feature functions described here are not equivalent to the subset conditional distributions that are used in the Gildea and Jurafsky model. ME models are log-linear models in which feature functions map specific instances of syntactic features and classes to binary values (e.g., if a training element has head=&amp;quot;in&amp;quot; and role=CAUSE, then, for that element, the feature function f(CAUSE, &amp;quot;in&amp;quot;) will equal 1). Thus, ME is not here being used as another way to find weights for an interpolated model. Rather, the ME approach provides an overarching framework in which the full distribution of semantic roles given syntactic features can be modeled.</Paragraph>
      <Paragraph position="11"> We train the ME models using the GIS algorithm (Darroch and Ratcliff, 1972) as implemented in the YASMET ME package (Och, 2002). We use the YASMET MEtagger (Bender et al., 2003) to perform the Viterbi search. The classifier was trained until performance on the development set ceased to improve. Feature weights were smoothed using Gaussian priors with mean 0 (Chen and Rosenfeld, 1999). The standard deviation of this distribution was optimized on the development set for each experiment.</Paragraph>
      <Paragraph position="12">  hand annotated frame element boundaries. G&amp;J refers to the results of Gildea and Jurafsky (2002). Exp 1 incorporates feature sets 0-7 from Table 1; Exp 2 feature sets 0-9; Exp 3 features 0-11.</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows the results of our experiments alongside those of (Gildea and Jurafsky, 2002) on identical held out test sets. The difference in performance between each classifier is statistically significant at (p&lt;0.01) (Mitchell, 1997), with the exception of Exp 2 and Exp 3, whose difference is  Table 2 shows the effect of varying the standard deviation of the Gaussian priors used for smoothing in Experiment 1. The difference in performance between the classifiers trained using standard deviation 1 and 2 is statistically signifi- null function of training set size. Classifiers were trained using the full set of features described for</Paragraph>
    </Section>
    <Section position="7" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Experiment 3.
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the confusion matrix for a subset of semantic roles. Five roles were chosen for presentation based upon their high contribution to classifier error. Confusion between these five account for 27% of all errors made amongst the 120 possible roles. The tenth role, other, represents the sum of the remaining 115 roles. Table 4 presents example errors for five of the most confused roles.</Paragraph>
    </Section>
    <Section position="8" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.5 Discussion
</SectionTitle>
      <Paragraph position="0"> It is clear that the ME models improve performance on frame element classification. There are a number of reasons for this improvement.</Paragraph>
      <Paragraph position="1"> First, for this task the log-linear model employed in the ME framework is better than the linear interpolation model used by Gildea and Jurafsky.</Paragraph>
      <Paragraph position="2"> One possible reason for this is that semantic role classification benefits from the ME model's bias for more uniform probability distributions that satisfy the constraints placed on the model by the training data.</Paragraph>
      <Paragraph position="3"> Another reason for improved performance comes from ME's simpler design. Instead of having to worry about finding proper backoff strategies amongst distributions of features subsets, ME allows one to include many features in a single model and automatically adjusts the weights of these features appropriately.</Paragraph>
      <Paragraph position="4">  ute most to overall system error. Columns refer to actual role. Rows refer to the model's hypothesis. Other refers to combination of all other roles.</Paragraph>
      <Paragraph position="5">  Also, because the ME models find weights for many thousands of features, they have many more degrees of freedom than the linear interpolated models of Gildea and Jurafsky. Although many degrees of freedom can lead to overfitting of the training data, the smoothing procedure employed in our experiments helps to counteract this problem. As evidenced in Table 2, by optimizing the standard deviation used in smoothing the ME models are able to show significant increases in performance on held out test data.</Paragraph>
      <Paragraph position="6"> Finally, by including in our model sentence-level pattern features and information about previous classes, global information can be exploited for improved classification. The accuracy gained by including such global information confirms the intuition that the semantic role of an element is much related to the entire sentence of which it is a part.</Paragraph>
      <Paragraph position="7"> Having discussed the advantages of the models presented here, it is interesting to look at the errors that the system makes. It is clear from the confusion matrix in Table 3 that a great deal of the system error comes from relatively few semantic roles.</Paragraph>
      <Paragraph position="8">  Table 4 offers some insight into why these errors occur. For example, the confusions exemplified in 1 and 2 are both due to the fact that the particular phrases employed can be used in multiple roles (including the roles hypothesized by the system). Thus, while &amp;quot;across the counter&amp;quot; may be considered a goal when one is talking about a per-son and their head, the same phrase would be considered a path if one were talking about a mouse who is running.</Paragraph>
      <Paragraph position="9">  across the counter.</Paragraph>
      <Paragraph position="10"> 2 Area Path Mr. Glass began hallucinating, throwing books around the classroom.</Paragraph>
      <Paragraph position="11"> 3 Message Speaker Debate lasted until 20 September, opposition being voiced by a number of Italian and Spanish prelates.</Paragraph>
      <Paragraph position="12"> 4 Addressee Speaker Furious staff claim they were even called in from holiday to be grilled by a specialist security firm 5 Reason Evaluee We cannot but admire the  efficiency with which she took control of her own life.</Paragraph>
      <Paragraph position="13"> Examples 3 and 4, while showing phrases with similar confusions, stand out as being errors caused by an inability to deal with passive sentences. Such errors are not unexpected; for, even though the voice of the sentence is an explicit feature, the system suffers from the paucity of passive sentences in the data (approximately 5%).</Paragraph>
      <Paragraph position="14"> Finally, example 5 shows an error that is based on the difficult nature of the decision itself (i.e., it is unclear whether &amp;quot;the efficiency&amp;quot; is the reason for admiration, or what is being admired). Often times, phrases are assigned semantic roles that are not obvious even to human evaluators. In such cases it is difficult to determine what information might be useful for the system.</Paragraph>
      <Paragraph position="15"> Having looked at the types of errors that are common for the system, it becomes interesting to examine what strategy may be best to overcome such errors. Aside from new features, one solution is obvious: more data. The curve in Figure 2 shows that there is still a great deal of performance to be gained by training the current ME models on more data. The slope of the curve indicates that we are far from a plateau, and that even constant increases in the amount of available training data may push classifier performance above 90% accuracy. null Having demonstrated the effectiveness of the ME approach on frame element classification given hand annotated frame element boundaries, we next examine the value of the approach given automatically identified boundaries.</Paragraph>
    </Section>
    <Section position="9" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1
Frame Element Identification
</SectionTitle>
      <Paragraph position="0"> Gildea and Jurafsky equate the task of locating frame element boundaries to one of identifying frame elements amongst the parse tree constituents of a given sentence. Because not all frame element boundaries exactly match constituent boundaries, this approach can perform no better than 86.9% (i.e. the number of elements that match constituents (6864) divided by the total number of elements (7899)) on the test set.</Paragraph>
    </Section>
    <Section position="10" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Features
</SectionTitle>
      <Paragraph position="0"> Frame element identification is a binary classification problem in which each constituent in a parse tree is described by a feature vector and, based on that vector, tagged as either a frame element or not.</Paragraph>
      <Paragraph position="1"> In generating feature vectors we use a subset of the features described for role tagging as well as an  spiration&amp;quot; to the target predicate &amp;quot;clapped&amp;quot; is represented as the string PP|VP|VBD.</Paragraph>
      <Paragraph position="2"> Gildea and Jurafsky introduce the path feature in order to capture the structural relationship between a constituent and the target predicate. The  path of a constituent is represented by the nodes through which one passes while traveling up the tree from the constituent and then down through the governing category to the target. Figure 4 shows an example of this feature for a frame element from the sentence presented in Figure 1.</Paragraph>
    </Section>
    <Section position="11" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Experiments
</SectionTitle>
      <Paragraph position="0"> We use the ME formulation described in Section 3.2 to build a binary classifier. The classifier features follow closely those used in Gildea and Jurafsky. We model the data using the feature sets: f(fe, path), f(fe, path, tar), and f(fe, head, tar), where fe represents the binary classification of the constituent. While this experiment only uses three feature sets, the heterogeneity of the path feature is so great that the classifier itself uses 1,119,331 unique binary features.</Paragraph>
      <Paragraph position="1"> With the constituents having been labeled, we apply the ME frame element classifier described above. Results are presented using the classifier of Experiment 1, described in section 3.3. We then investigate the effect of varying the number of constituents used for training on identification performance. Five data sets of approximately 100,000 10,000, 1,000, and 100 constituents were generated from the original set by random selection and used to train ME models as described above.</Paragraph>
    </Section>
    <Section position="12" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> Table 5 compares the results of Gildea and Jurafsky (2002) and the ME frame element identifier on both the task of frame element identification alone, and the combined task of frame element identification and classification. In order to be counted correct on the combined task, the constituent must have been correctly identified as a frame element, and then must have been correctly classified into one of the 120 semantic categories.</Paragraph>
      <Paragraph position="1"> Recall is calculated based on the total number of frame elements in the test set, not on the total number of elements that have matching parse constituents. Thus, the upper limit is 86.9%, not 100%. Precision is calculated as the number of correct positive classifications divided by the number of total positive classifications.</Paragraph>
      <Paragraph position="2"> The difference in the F-scores on the identification task alone and on the combined task are statistically significant at the (p&lt;0.01) level</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML