File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1010_metho.xml

Size: 8,945 bytes

Last Modified: 2025-10-06 14:07:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1010">
  <Title>Automatic Predicate Argument Analysis of the Penn TreeBank</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Arg1 REL
</SectionTitle>
    <Paragraph position="0"> the building rocked.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Arg1 REL
</SectionTitle>
    <Paragraph position="0"> VerbNet In a related project funded by NSF, NSF-IIS98-00658, we are currently constructing a lexicon, VerbNet, that is intended to overcome some of the limitations of WordNet, an on-line lexical database of English, [Miller, 90], by addressing specifically the needs of natural language processing applications. This lexicon exploits the systematic link between syntax and semantics that motivates the Levin classes, and thus provides a clear and regular association between syntactic and semantic properties of verbs and verb classes, [Dang, et al, 98, 00, Kipper, et al. 00]. Specific sets of syntactic configurations and appropriate selectional restrictions on arguments are associated with individual senses. This lexicon gives us a first approximation of sense distinctions that are reflected in varying predicate argument structures. As such these entries provide a suitable foundation for directing consistent predicate-argument labeling of training data.</Paragraph>
    <Paragraph position="1"> The senses in VerbNet are in turn linked to one or more WordNet senses. Since our focus is predicate-argument structure, we can rely on rigorous and objective sense distinction criteria based on syntax. Purely semantic distinctions, such as those made in WordNet, are subjective and potentially unlimited. Our senses are therefore much more coarse-grained than WordNet, since WordNet senses are purely semantically motivated and often cannot be distinguished syntactically. However, some senses that share syntactic properties can still be distinguished clearly by virtue of different selectional restrictions, which we will also be exploring in the NSF project.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. AUTOMATIC EXTRACTION OF
PREDICATE-ARGUMENT
RELATIONS FROM PARSED
CORPORA
</SectionTitle>
    <Paragraph position="0"> The predicate-argument analysis of a parse tree from a corpus such as the Treebank corpus is performed in three main phases. First, root forms of inflected words are identified using a morphological analyzer derived from the WordNet stemmer and from inflectional information in machine-readable dictionaries such as the Project Gutenberg version of Webster. Also in this phase, phrasal items such as verb-particle constructions, idioms and compound nominals are identified. An efficient matching algorithm is used which is capable of recognizing both continuous and discontinuous phrases, and phrases where the order of words is not fixed. The matching algorithm makes use of hierarchical declarative constraints on the possible realizations of phrases in the lexicon, and can exploit syntactic contextual cues if a syntactic analysis of the input, such as the parse tree structure of the Treebank, is present. In the next phase, the explicit antecedents of empty constituents are read off from the Treebank annotation, and gaps are filled where implicit linkages have been left unmarked. This is done by heuristic examination of the local syntactic context of traces and relative clause heads. If no explicit markings are present (for automatically generated parses or old-style Treebank parses), they are inferred. Estimated accuracy of this phase of the algorithm is upwards of 90 percent.</Paragraph>
    <Paragraph position="1"> Finally, an efficient tree-template pattern matcher is run on the Treebank parse trees, to identify syntactic relations that signal a predicate-argument relationship between lexical items. The patterns used are fragmentary tree templates similar to the elementary and auxiliary trees of a Tree Adjoining Grammar [XTAG, 95]. Each template typically corresponds to a predication over one or more arguments. There are approximately 200 templates for: transitive, intransitive and ditransitive verbs operating on their subjects, objects and indirect objects; prenominal and predicate adjectives, operating on the nouns they modify; subordinating conjunctions operating on the two clauses that they link; prepositions; determiners; and so on. The templates are organized into a compact network in which shared substructures need to be listed only once, even when they are present in many templates.</Paragraph>
    <Paragraph position="2"> Templates are matched even if they are not contiguous in the tree, as long as the intervening material is well-formed. This allows a transitive template for example to match a sentence where there is an intervening auxiliary verb between the subject and the main transitive verb, as in He was dropping it. The mechanism for handling such cases resembles the adjunction mechanism in Tree Adjoining Grammar.</Paragraph>
    <Paragraph position="3"> Tree grammar template for progressive auxiliary verb, licensing discontinuity in main verb tree When a template has been identified, it is instantiated with the lexical items that occur in its predicate and argument positions. Each template is associated with one or more annotated template sets, by means of which it is linked to a bundle of thematic or semantic features, and to a class of lexical items that license the template's occurrence with those features. For instance, if the template is an intransitive verb tree, it will be associated both with an unergative feature bundle, indicating that its subject should have the label Arg0, and also with an unaccusative bundle where the subject is marked as Arg1. Which of the feature bundles gets used depends on the semantic class of the word that Recognition of progressive auxiliary tree which modifies and splits transitive-verb tree for drop in Treebank corpus appears in the predicate position of the template. If the predicate is a causative verb that takes the unaccusative alternation, the subject will be assigned the Arg1 label. If however it is a verb of creation, for example, the subject will be an Arg0. The verb semantics that inform the predicate-argument extractor are theoretically motivated by the Levin classes [Levin, 93], but the actual lexical information it uses is not derived from Levin's work. Rather, it draws on information available in the WordNet 1.6 database [Miller, 90] and on frame codes are derived from the annotation scheme used in the Susanne corpus [Sampson, 95].</Paragraph>
    <Paragraph position="4"> For example, one entry for the verb develop specifies its WordNet synset membership, and indicates its participation in the unaccusative alternation with the code o_can_become_s develop SF:so_N_N+W:svJ3W_W:svIM2+o_can_become_s The prefix SF: signifies that this is a frame code derived from the Susanne corpus. Each frame code picks out a lexical class of the words that take it, and the frame codes are organized into an inheritance network as well. The frame codes in turn are linked to annotated template sets, which describe how these frames can actually appear in the syntactic bracketing format of the TreeBank. In the case of the above frame code for an alternating transitive verb, two template sets are linked: TG:V_so_N_N for the frame with a subject and an object (here notated with s and o); and TG:V_s_N+causative, for the unaccusative frame. Each of the template sets lists tree-grammar templates for all the variations of syntactic structure that its corresponding frame may take on. A template for the canonical structure of a simple declarative sentence involving that frame will be present in the set, but additional templates will be added for the forms the frame takes in relative clauses, questions, or passive constructions.</Paragraph>
    <Paragraph position="5"> The features for each set are listed separately from the templates, with indications of where they should be interpreted within the various template structures. Hence the template set TG:V_s_N+causative includes the feature TGC:subject+print_as=TGPL:arg1 as part of its feature bundle. This serves to associate the label Arg1 with the subject node in each template in the set. When the predicate-argument extractor is able to instantiate such a template, thereby connecting its subject node with a piece of a TreeBank tree, it knows to print that piece of the tree as Arg1 of the predicate for that template. If another annotated feature set were active instead, for instance in a case where the predicate of the template does not belong to a verb class which licenses the unaccusative frame code and its associated annotated template set (TG:V_s_N+causative), the label of the subject might be different.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML