File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/j05-1004_evalu.xml
Size: 29,865 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-1004"> <Title>The Proposition Bank: An Annotated Corpus of Semantic Roles</Title> <Section position="6" start_page="87" end_page="101" type="evalu"> <SectionTitle> 5. FrameNet and PropBank </SectionTitle> <Paragraph position="0"> The PropBank project and the FrameNet project at the International Computer Science Institute (Baker, Fillmore, and Lowe 1998) share the goal of documenting the syntactic realization of arguments of the predicates of the general English lexicon by annotating a corpus with semantic roles. Despite the two projects' similarities, their methodologies are quite different. FrameNet is focused on semantic frames, which are defined as a schematic representation of situations involving various participants, props, and other conceptual roles (Fillmore 1976). The project methodology has proceeded on a frame-by-frame basis, that is, by first choosing a semantic frame (e.g., Commerce), defining the frame and its participants or frame elements (BUYER, GOODS, SELLER, MONEY), listing the various lexical predicates which invoke the frame(buy, sell, etc.), and then finding example sentences of each predicate in a corpus (the British National Corpus was used) and annotating each frame element in each sentence. The example sentences were chosen primarily to ensure coverage of all the syntactic realizations of the frame elements, and simple examples of these realizations were preferred over those involving complex syntactic structure not immediately relevant to the lexical predicate itself. Only sentences in which the lexical predicate was used ''in frame'' were annotated. A word with multiple distinct senses would generally be analyzed as belonging to different frames in each sense but may only be found in the FrameNet corpus in the sense for which a frame has been defined. It is interesting to note that the semantic frames are a helpful way of generalizing between predicates; words in the same frame have been found frequently to share the same syntactic argument structure (Gildea and Jurafsky 2002). A more complete description of the FrameNet project can be found in Baker, Fillmore, and Lowe (1998) and Johnson et al. (2002), and the ramifications for automatic classification are discussed more thoroughly in Gildea and Jurafsky (2002).</Paragraph> <Paragraph position="1"> In contrast with FrameNet, PropBank is aimed at providing data for training statistical systems and has to provide an annotation for every clause in the Penn Treebank, no matter how complex or unexpected. Similarly to FrameNet, PropBank also attempts to label semantically related verbs consistently, relying primarily on VerbNet classes for determining semantic relatedness. However, there is much less emphasis on the definition of the semantics of the class that the verbs are associated with, although for the relevant verbs additional semantic information is provided through the mapping to VerbNet. The PropBank semantic roles for a given VerbNet class may not correspond to the semantic elements highlighted by a particular FrameNet frame, as shown by the examples of Table 5. In this case, FrameNet's COMMERCE frame includes roles for Buyer (the receiver of the goods) and Seller (the receiver of the money) and assigns these roles consistently to two sentences describing the same event: PropBank annotation also differs in that it takes place with reference to the Penn Treebank trees; not only are annotators shown the trees when analyzing a sentence, they are constrained to assign the semantic labels to portions of the sentence corresponding to nodes in the tree. Parse trees are not used in FrameNet; annotators mark the beginning and end points of frame elements in the text and add 6. A Quantitative Analysis of the Semantic-Role Labels The stated aim of PropBank is the training of statistical systems. It also provides a rich resource for a distributional analysis of semantic features of language that have hitherto been somewhat inaccessible. We begin this section with an overview of general characteristics of the syntactic realization of the different semantic-role labels and then attempt to measure the frequency of syntactic alternations with respect to verb class membership. We base this analysis on previous work by Merlo and Stevenson (2001). In the following section we discuss the performance of a system trained to automatically assign the semantic-role labels.</Paragraph> <Section position="1" start_page="87" end_page="91" type="sub_section"> <SectionTitle> 6.1 Associating Role Labels with Specific Syntactic Constructions </SectionTitle> <Paragraph position="0"> We begin by simply counting the frequency of occurrence of roles in specific syntactic positions. In all the statistics given in this section, we do not consider past- or presentparticiple uses of the predicates, thus excluding any passive-voice sentences. The syntactic positions used are based on a few heuristic rules: Any NP under an S node in the treebank is considered a syntactic subject, and any NP under a VP is considered an object. In all other cases, we use the syntactic category of the argument's node in the treebank tree: for example, SBAR for sentential complements and PP for prepositional phrases. For prepositional phrases, as well as for noun phrases that are the object of a preposition, we include the preposition as part of our syntactic role: for example, PP-in, PP-with. Table 6 shows the most frequent semantic roles associated with various syntactic positions, while Table 7 shows the most frequent syntactic positions for various roles.</Paragraph> <Paragraph position="1"> Tables 6 and 7 show overall statistics for the corpus, and some caution is needed in interpreting the results, as the semantic-role labels are defined on a per-frameset basis and do not necessarily have corpus-wide definitions. Nonetheless, a number of trends are apparent. Arg0, when present, is almost always a syntactic subject, while the subject is Arg0 only 79% of the time. This provides evidence for the notion of a thematic hierarchy in which the highest-ranking role present in a sentence is given the honor of subjecthood. Going from syntactic position to semantic role, the numbered arguments are more predictable than the non-predicate-specific adjunct roles. The two exceptions are the roles of ''modal'' (MOD) and ''negative'' (NEG), which as previously discussed are not syntactic adjuncts at all but were simply marked as ArgMs as the best means of tracking their important semantic contributions. They are almost always realized as auxiliary verbs and the single adverb (part-of-speech tag RB) not, respectively.</Paragraph> </Section> <Section position="2" start_page="91" end_page="95" type="sub_section"> <SectionTitle> 6.2 Associating Verb Classes with Specific Syntactic Constructions </SectionTitle> <Paragraph position="0"> Turning to the behavior of individual verbs in the PropBank data, it is interesting to see how much correspondence there is between verb classes proposed in the literature Palmer, Gildea, and Kingsbury The Proposition Bank and the annotations in the corpus. Table 8 shows the PropBank semantic role labels for thesubjectsofeachverbineachclass.MerloandStevenson(2001)aimto automatically classify verbs into one of three categories: unergative, unaccusative, and object-drop. These three categories, more coarse-grained than the classes of Levin or VerbNet, are defined by the semantic roles they assign to a verb's subjects and objects in both transitive and intransitive sentences, as illustrated by the following examples: 6.2.1 Predictions. In our data, the closest analogs to Merlo and Stevenson's three roles of Causal Agent, Agent, and Theme are ArgA, Arg0, and Arg1, respectively. We hypothesize that PropBank data will confirm 1. that the subject can take one of two roles (Arg0 or Arg1) for unaccusative and unergative verbs but only one role (Arg0) for object-drop verbs; 2. that Arg1s appear more frequently as subjects for intransitive unaccusatives than they do for intransitive unergatives.</Paragraph> <Paragraph position="1"> In Table 8 we show counts for the semantic roles of the subjects of the Merlo and Stevenson verbs which appear in PropBank (80%), regardless of transitivity, in order to measure whether the data in fact reflect the alternations between syntactic and semantic roles that the verb classes predict. For each verb, we show counts only for occurrences tagged as belonging to the first frameset, reflecting the predominant or unmarked sense.</Paragraph> <Paragraph position="2"> show little variability in our corpus, with the subject almost always being Arg0. The unergative and unaccusative verbs show much more variability in the roles that can appear in the subject position, as predicted, although some individual verbs always have Arg0 as subject, presumably as a result of the small number of occurrences. of Arg1 subjects for unaccusatives than for unergatives, with the striking exception of a few unergative verbs, such as jump and rush, whose subjects are almost always Arg1. Jump is being affected by the predominance of a financial-subcorpus sense used for stock reportage (79 out of 82 sentences), which takes jump as rise dramatically: Jaguar shares jumped 23 before easing to close at 654, up 6. (wsj_1957) Rush is being affected by a framing decision, currently being reconsidered, wherein rush was taken to mean cause to move quickly. Thus the entity in motion is tagged Arg1, as in Congress in Congress would have rushed to pass a private relief bill. (wsj_0946) The distinction between unergatives and unaccusatives is not apparent from the PropBank data in this table, since we are not distinguishing between transitives and intransitives, which is left for future experiments.</Paragraph> <Paragraph position="3"> In most cases, the first frameset (numbered 1 in the PropBank frames files) is the most common, but in a few cases this is not the case because of the domain of the text. For example, the second frameset for kick, corresponding to the phrasal usage kick in, meaning begin, accounted for seven instances versus the five instances for frameset 1. Palmer, Gildea, and Kingsbury The Proposition Bank The phrasal frameset has a very different pattern, with the subject always corresponding to Arg1, as in Friday's one-hour collapse] and worked as expected, even though they didn't prevent a stunning plunge. (wsj_2417) Statistics for all framesets of kick are shown in Table 9; the first row in Table 9 corresponds to the entry for kick in the ''Object-Drop'' section of Table 8. Overall, these results support our hypotheses and also highlight the important role played by even the relatively coarse-grained sense tagging exemplified by the framesets.</Paragraph> <Paragraph position="4"> 7. Automatic Determination of Semantic-Role Labels The stated goal of the PropBank is to provide training data for supervised automatic role labelers, and the project description cannot be considered complete without a discussion of PropBank's suitability for this purpose. One of PropBank's important features as a practical resource is that the sentences chosen for annotation are from the same Wall Street Journal corpus used for the original Penn Treebank project, and thus hand-checked syntactic parse trees are available for the entire data set. In this section, we examine the importance of syntactic information for semantic-role labeling by comparing the performance of a system based on gold-standard parses with one using automatically generated parser output. We then examine whether it is possible that the additional information contained in a full parse tree is negated by the errors present in automatic parser output, by testing a role-labeling system based on a flat or ''chunked'' representation of the input.</Paragraph> <Paragraph position="5"> Gildea and Jurafsky (2002) describe a statistical system trained on the data from the FrameNet project to automatically assign semantic roles. The system first passed sentences through an automatic parser (Collins 1999), extracted syntactic features from the parses, and estimated probabilities for semantic roles from the syntactic and lexical features. Both training and test sentences were automatically parsed, as no hand-annotated parse trees were available for the corpus. While the errors introduced by the parser no doubt negatively affected the results obtained, there was no direct way of quantifying this effect. One of the systems evaluated for the Message Understanding Conference task (Miller et al. 1998) made use of an integrated syntactic and semantic model producing a full parse tree and achieved results comparable to other systems that did not make use of a complete parse. As in the FrameNet case, the parser was not trained on the corpus for which semantic annotations were available, and the effect of better, or even perfect, parses could not be measured.</Paragraph> <Paragraph position="6"> In our first set of experiments, the features and probability model of the Gildea and Jurafsky (2002) system were applied to the PropBank corpus. The existence of the hand-annotated treebank parses for the corpus allowed us to measure the improvement in performance offered by gold-standard parses.</Paragraph> </Section> <Section position="3" start_page="95" end_page="99" type="sub_section"> <SectionTitle> 7.1 System Description </SectionTitle> <Paragraph position="0"> Probabilities of a parse constituent belonging to a given semantic role are calculated from the following features: The phrase type feature indicates the syntactic type of the phrase expressing the semantic roles: Examples include noun phrase (NP), verb phrase (VP), and clause (S). The parse tree path feature is designed to capture the syntactic relation of a constituent to the predicate.</Paragraph> <Paragraph position="1"> It is defined as the path from the predicate through the parse tree to the constituent in question, represented as a string of parse tree nonterminals linked by symbols indicating upward or downward movement through the tree, as shown in Figure 2. Although the path is composed as a string of symbols, our systems treat the string as an atomic value. The path includes, as the first element of the string, the part of speech of the predicate, and as the last element, the phrase type or syntactic category of the sentence constituent marked as an argument. The position feature simply indicates whether the constituent to be labeled occurs before or after the predicate. This feature is highly correlated with grammatical function, since subjects will generally appear before a verb and objects after. This feature may overcome the shortcomings of reading grammatical function from the parse tree, as well as errors in the parser output.</Paragraph> <Paragraph position="2"> The voice feature distinguishes between active and passive verbs and is important in predicting semantic roles, because direct objects of active verbs correspond to subjects of passive verbs. An instance of a verb is considered passive if it is tagged as a past participle (e.g., taken), unless it occurs as a descendent verb phrase headed by any form of have (e.g., has taken) without an intervening verb phrase headed by any form of be (e.g., has been taken).</Paragraph> <Paragraph position="3"> 11 While the treebank has a ''subject'' marker on noun phrases, this is the only such grammatical function tag. The treebank does not explicitly represent which verb's subject the node is, and the subject tag is not typically present in automatic parser output.</Paragraph> <Paragraph position="4"> Figure 2 In this example, the path from the predicate ate to the argument NP He can be represented as VBjVPjS,NP, with j indicating upward movement in the parse tree and , downward movement.</Paragraph> <Paragraph position="5"> Palmer, Gildea, and Kingsbury The Proposition Bank The headword is a lexical feature and provides information about the semantic type of the role filler. Headwords of nodes in the parse tree are determined using the same deterministic set of headword rules used by Collins (1999).</Paragraph> <Paragraph position="6"> The system attempts to predict argument roles in new data, looking for the highest-probability assignment of roles r i to all constituents i in the sentence, given the We break the probability estimation into two parts, the first being the probability</Paragraph> <Paragraph position="8"> ,pTh of a constituent's role given our five features for the constituent and the predicate p. Because of the sparsity of the data, it is not possible to estimate this probability from the counts in the training data. Instead, probabilities are estimated from various subsets of the features and interpolated as a linear combination of the resulting distributions. The interpolation is performed over the most specific distributions for which data are available, which can be thought of as choosing the topmost distributions available from a back-off lattice, shown in Figure 3.</Paragraph> <Paragraph position="9"> Next, the probabilities Pdr</Paragraph> <Paragraph position="11"> This approach, described in more detail in Gildea and Jurafsky (2002), allows interaction among the role assignments for individual constituents while making certain independence assumptions necessary for efficient probability estimation. In particular, we assume that sets of roles appear independent of their linear order and that the features F of a constituent are independent of other constituents' features given the constituent's role.</Paragraph> <Paragraph position="12"> 7.1.1 Results. We applied the same system, using the same features, to a preliminary release of the PropBank data. The data set used contained annotations for 72,109 predicate-argument structures containing 190,815 individual arguments and examples from 2,462 lexical predicates (types). In order to provide results comparable with the statistical parsing literature, annotations from section 23 of the treebank were used as the test set; all other sections were included in the training set. The preliminary version of the data used in these experiments was not tagged for WordNet word sense or PropBank frameset. Thus, the system neither predicts the frameset nor uses it as a feature.</Paragraph> <Paragraph position="13"> The system was tested under two conditions, one in which it is given the constituents which are arguments to the predicate and merely has to predict the correct role, and one in which it has to both find the arguments in the sentence and label them correctly. Results are shown in Tables 10 and 11. Results for FrameNet are based on a test set of 8,167 individual labels from 4,000 predicate-argument structures. As a guideline for interpreting these results, with 8,167 observations, the threshold for statistical significance with p < .05 is a 1.0% absolute difference in performance (Gildea and Jurafsky 2002). For the PropBank data, with a test set of 8,625 individual labels, the threshold for significance is similar. There are 7,574 labels for which the predicate has been seen 10 or more times in training (third column of the tables).</Paragraph> <Paragraph position="14"> Results for PropBank are similar to those for FrameNet, despite the smaller number of training examples for many of the predicates. The FrameNet data contained at least 10 examples from each predicate, while 12% of the PropBank data had fewer than 10 training examples. Removing these examples from the test set gives 82.8% accuracy with gold-standard parses and 80.9% accuracy with automatic parses.</Paragraph> <Paragraph position="15"> 7.1.2 Adding Traces. The gold-standard parses of the Penn Treebank include several types of information not typically produced by statistical parsers or included in their evaluation. Of particular importance are traces, empty syntactic categories which generally occupy the syntactic position in which another constituent could be interpreted and include a link to the relevant constituent. Traces are used to indicate cases of wh-extraction, antecedents of relative clauses, and control verbs exhibiting the syntactic phenomena of raising and ''equi.'' Traces are intended to provide hints as to the semantics of individual clauses, and the results in Table 11 show that they do so effectively. When annotating syntactic trees, the PropBank annotators marked the traces along with their antecedents as arguments of the relevant verbs. In line 2 of Table 11, along with all our experiments with automatic parser output, traces were ignored, and the semantic-role label was assigned to the antecedent in both training and test data. In line 3 of Table 11, we assume that the system is given trace information, and in cases of trace chains, the semantic-role label is assigned to the trace in training and test conditions. Trace information boosts the performance of the system by roughly 5%. This indicates that systems capable of recovering traces (Johnson 2002; Dienes and Dubey 2003) could improve semantic-role labeling.</Paragraph> <Paragraph position="16"> Table 10 Accuracy of semantic-role prediction (in percentages) for known boundaries (the system is given the constituents to classify).</Paragraph> <Paragraph position="17"> Palmer, Gildea, and Kingsbury The Proposition Bank As our path feature is a somewhat unusual way of looking at parse trees, its behavior in the system warrants a closer look. The path feature is most useful as a way of finding arguments in the unknown boundary condition. Removing the path feature from the known-boundary system results in only a small degradation in performance, from 82.0% to 80.1%. One reason for the relatively small impact may be sparseness of the feature: 7% of paths in the test set are unseen in training data. The most common values of the feature are shown in Table 12, in which the first two rows correspond to standard subject and object positions. One reason for sparsity is seen in the third row: In the treebank, the adjunction of an adverbial phrase or modal verb can cause an additional VP node to appear in our path feature. We tried two variations of the path feature to address this problem. The first collapses sequences of nodes with the same label, for example, combining rows 2 and 3 of Table 12. The second variation uses only two values for the feature: NP under S (subject position) and NP under VP (object position). Neither variation improved performance in the known-boundary condition.</Paragraph> <Paragraph position="18"> As a gauge of how closely the PropBank semantic-role labels correspond to the path feature overall, we note that by always assigning the most common role for each path (for example, always assigning Arg0 to the subject position), and using no other features, we obtain the correct role 64.0% of the time, versus 82.0% for the complete system. Conditioning on the path and predicate, which allows the subject of different verbs to receive different labels but does not allow for alternation behavior within a verb's argument structure, yields an accuracy rate of 76.6%.</Paragraph> <Paragraph position="19"> Table 13 shows the performance of the system broken down by the argument types in the gold standard. Results are shown for the unknown-boundaries condition, using gold-standard parses and traces (last row, middle two columns of Table 11). The ''Labeled Recall'' column shows how often the semantic-role label is correctly identified, while the ''Unlabeled recall'' column shows how often a constituent with the given role is correctly identified as being a semantic role, even if it is labeled with the wrong role. The more central, numbered roles are consistently easier to identify than the adjunct-like ArgM roles, even when the ArgM roles have preexisting Treebank function tags.</Paragraph> </Section> <Section position="4" start_page="99" end_page="101" type="sub_section"> <SectionTitle> 7.2 The Relation of Syntactic Parsing and Semantic-Role Labeling </SectionTitle> <Paragraph position="0"> Many recent information extraction systems for limited domains have relied on finite-state systems that do not build a full parse tree for the sentence being analyzed.</Paragraph> <Paragraph position="1"> Among such systems, Hobbs et al. (1997) built finite-state recognizers for various entities, which were then cascaded to form recognizers for higher-level relations, while Ray and Craven (2001) used low-level ''chunks'' from a general-purpose syntactic analyzer as observations in a trained hidden Markov model. Such an approach has a large advantage in speed, as the extensive search of modern statistical parsers is avoided. It is also possible that this approach may be more robust to error than parsers.</Paragraph> <Paragraph position="2"> Our experiments working with a flat, ''chunked'' representation of the input sentence, described in more detail in Gildea and Palmer (2002), test this finite-state hypothesis.</Paragraph> <Paragraph position="3"> In the chunked representation, base-level constituent boundaries and labels are present, but there are no dependencies between constituents, as shown by the following Palmer, Gildea, and Kingsbury The Proposition Bank gold-standard rather than automatically derived chunk boundaries, which we believe will provide an upper bound on the performance of a chunk-based system. Distance in chunks from the predicate was used in place of the parser-based path feature.</Paragraph> <Paragraph position="4"> The results in Table 14 show that full parse trees are much more effective than the chunked representation for labeling semantic roles. This is the case even if we relax the scoring criteria to count as correct all cases in which the system correctly identifies the first chunk belonging to an argument (last row of Table 14).</Paragraph> <Paragraph position="5"> As an example for comparing the behavior of the tree-based and chunk-based systems, consider the following sentence, with human annotations showing the arguments of the predicate support: by buying big blocks of stock], traders say.</Paragraph> <Paragraph position="6"> In this case, the system failed to find the predicate's Arg0 relation, because it is syntactically distant from the verb support. The original treebank syntactic tree contains a trace which would allow one to recover this relation, coindexing the empty subject position of support with the noun phrase Big investment banks. However, our automatic parser output does not include such traces. The system based on gold-standard trees and incorporating trace information produced exactly the correct labels: the beleaguered floor traders] by buying big blocks of stock, traders say.</Paragraph> <Paragraph position="7"> Here, as before, the true Arg0 relation is not found, and it would be difficult to imagine identifying it without building a complete syntactic parse of the sentence. But now, unlike in the tree-based output, the Arg0 label is mistakenly attached to a noun phrase immediately before the predicate. The Arg1 relation in direct-object position is fairly easily identifiable in the chunked representation as a noun phrase directly following the verb. The prepositional phrase expressing the Manner relation, however, is not identified by the chunk-based system. The tree-based system's path feature for this constituent is VBjVP,PP, which identifies the prepositional phrase as attaching to the verb and increases its probability of being assigned an argument label. The chunk-based system sees this as a prepositional phrase appearing as the second chunk after the predicate. Although this may be a typical position for the Manner relation, the fact that the preposition attaches to the predicate rather than to its direct object is not represented.</Paragraph> <Paragraph position="8"> Participants in the 2004 CoNLL semantic-labeling shared task (Carreras and Ma`rquez 2004) have reported higher results for chunk-based systems, but to date chunk-based systems have not closed the gap with the state-of-the-art results based on parser output.</Paragraph> <Paragraph position="9"> (1999) return much richer representations than a chunker, they do not include a great deal of the information present in the original Penn Treebank. Specifically, long-distance dependencies indicated by traces in the treebank are crucial for semantic interpretation but do not affect the constituent recall and precision metrics most often used to evaluate parsers and are not included in the output of the standard parsers.</Paragraph> <Paragraph position="10"> Gildea and Hockenmaier (2003) present a system for labeling PropBank's semantic roles based on a statistical parser for combinatory categorial grammar (CCG) (Steedman 2000). The parser, described in detail in Hockenmaier and Steedman (2002), is trained on a version of the Penn Treebank automatically converted to CCG representations. The conversion process uses the treebank's trace information to make underlying syntactic relations explicit. For example, the same CCG-level relation appears between a verb and its direct object whether the verb is used in a simple transitive clause, a relative clause, or a question with wh-extraction. Using the CCG-based parser, Gildea and Hockenmaier (2003) find a 2% absolute improvement over the Collins parser in identifying core or numbered PropBank arguments. This points to the shortcomings of evaluating parsers purely on constituent precision and recall; we feel that a dependency-based evaluation (e.g., Carroll, Briscoe, and Sanfilippo 1998) is more relevant to real-world applications.</Paragraph> </Section> </Section> class="xml-element"></Paper>