File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1020_metho.xml
Size: 15,412 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1020"> <Title>THE PENN TREEBANK: ANNOTATING PREDICATE ARGUMENT STRUCTURE</Title> <Section position="4" start_page="0" end_page="114" type="metho"> <SectionTitle> 2. A NEW ANNOTATION SCHEME </SectionTitle> <Paragraph position="0"> We have recently completed a detailed style-book for this new level of analysis, with consensus across annotators about the particulars of the analysis. This project has taken about eight months of ten-hour a week effort across a significant subset of all the personnel of the Penn Treebank. Such a stylebook, much larger, and much more fully specified than our initial stylebook, is a prerequisite for high levels of inter-annotator agreement. It is our hope that such a stylebook will also alleviate much of the need for extensive cross-talk between annotators during the annotation task, thereby increasing throughput as well. To ensure that the rules of this new stylebook remain in force, we are now giving annotators about 10% overlapped material to evaluate inter-annotator consistency throughout this new project.</Paragraph> <Paragraph position="1"> We have now begun to annotate this level of structure editing the present Penn Treebank; we intend to automatically extract a bank of predicate-argument structures intended at the very least for parser evaluation from the resulting annotated corpus.</Paragraph> <Paragraph position="2"> The remainder of this paper will discuss the implementation of each of four crucial aspects of the new annotation scheme, as well as notational devices to allow predicate-argument structure to be recovered in the face of conjoined structure involving gapping, where redundant syntactic structure within a conjoined structure is deleted. In particular, the new scheme: 1. Incorporates a consistent treatment of related grammat null ical phenomena. The issue here is not that the representation be &quot;correct&quot; given some theoretical analysis or other, but merely that instances of what are descriptively the same phenomenon be represented similarly. In particular, the notation should make it easy to automatically recover predicate-argument structure.</Paragraph> <Paragraph position="3"> 2. Provides a set of null elements in what can be thought of as &quot;underlying&quot; position for phenomena such as whmovement, passive, and the subjects of infinitival constructions. These null elements must be co-indexed with the appropriate lexical material.</Paragraph> <Paragraph position="4"> 3. Provides some non-context free annotational mechanism to allow the structure of discontinuous constituents to be easily recovered.</Paragraph> <Paragraph position="5"> 4. Allows for a clear, concise distinction between verb arguments and adjuncts where such distinctions are clear, with some easy-to-use notational device to indicate where such a distinction is somewhat murky.</Paragraph> <Paragraph position="6"> Our first step, just now complete, has been to produce a detailed style-book for this new level of analysis, with consensus across annotators about the particulars of the analysis. This project has taken about eight months of ten-hour a week effort across a significant subset of all the personnel of the Penn Treebank. It has become clear during the first stage of the project that a much larger, much more fully specified stylebook than our initial stylebook is a prerequisite for high levels of inter-annotator agreement. It is our hope that such a stylebook will also alleviate much of the need for extensive cross-talk between annotators during the annotation task, thereby increasing throughput as well. To ensure that the rules of this new stylebook remain in force, we intend to give annotators about 10% overlapped material to evaluate inter-annotator consistency throughout this new project.</Paragraph> <Paragraph position="7"> The remainder of this paper discusses the implementation of each of the four points above, as well as notational devices to allow predicate-argument structure to be recovered in the face of conjoined structure involving gapping, where redundant syntactic structure within a conjoined structure is deleted.</Paragraph> </Section> <Section position="5" start_page="114" end_page="114" type="metho"> <SectionTitle> 3. CONSISTENT GRAMMATICAL ANALYSES </SectionTitle> <Paragraph position="0"> The current treebank materials suffer from the fact that differing annotation regimes are used across differing syntactic categories. To allow easy automatic extraction of predicate-argument structure in particular, these differing analyses must be unified. In the original annotation scheme, adjective phrases that serve as sentential predicates have a different structure than VPs, causing sentential adverbs which occur after auxiliaries introducing the ADJP to attach under VP, while sentential adverbs occurring after auxiliaries introducing VPs occur under S. In the current treebank, copular be is treated as a main verb, with predicate adjective or prepositional phrases treated as complements to that verb.</Paragraph> <Paragraph position="1"> In the new stylebook, the predicate is either the lowest (rightmost branching) VP or the phrasal structure immediately under copular BE. In cases when the predicate cannot be identified by those criteria (e.g. in &quot;small clauses&quot; and some inversion structures), the predicate phrase is tagged -PRD (PReDicate).</Paragraph> <Paragraph position="3"> Note that the surface subject is always tagged -SBJ (SuB-Ject), even though this is usually redundant because the subject can be recognized purely structurally. The -TMP tag here marks time (TeMPoral) phrases. Our use of &quot;small clauses&quot; follows one simple rule:, every S maps into a single predication, so here the predicate-argument structure would be something like consider(I, fool(Kris)).</Paragraph> </Section> <Section position="6" start_page="114" end_page="115" type="metho"> <SectionTitle> 4. ARGUMENT-ADJUNCT STRUCTURE </SectionTitle> <Paragraph position="0"> In a well developed predicate-argument scheme, it would seem desirable to label each argument of a predicate with an appropriate semantic label to identify its role with respect to that predicate. It would also seem desirable to distinguish between the arguments of a predicate, and adjuncts of the predication. Unfortunately, while it is easy to distinguish arguments and adjuncts in simple cases, it turns out to be very difficult to consistently distinguish these two categories for many verbs in actual contexts. It also turns out to be very difficult to determine a set of underlying semantic roles that holds up in the face of a few paragraphs of text. In our new annotation scheme, we have tried to come up with a middle ground which allows annotation of those distinctions that seem to hold up across a wide body of material.</Paragraph> <Paragraph position="1"> After many attempts to find a reliable test to distinguish between arguments and adjuncts, we have abandoned structurally marking this difference. Instead, we now label a small set of clearly distinguishable roles, building upon syntactic distinctions only when the semantic intuitions are clear cut.</Paragraph> <Paragraph position="2"> Getting annotators to consistently apply even the small set of distinctions we will discuss here is fairly difficult.</Paragraph> <Paragraph position="3"> In the earlier corpus annotation scheme, We originally used only standard syntactic labels (e.g. NP, ADVP, PP, etc.) for our constituents - in other words, every bracket had just one label. The limitations of this became apparent when a word belonging to one syntactic category is used for another hnction or when it plays a role which we want to be able to identify easily. In the present scheme, each constituent has at least one label but as many as four tags, including numerical indices. We have adopted the set of functional tags shown in Figure 2 for use within the current annotation scheme. NPs and Ss which are clearly arguments of the verb are unmarked by any tag. We allow an open class of other cases that individual annotators feel strongly should be part of the VP.</Paragraph> <Paragraph position="4"> These cases are tagged as -CLR (for CLosely Relatcd); they axe to be semantically analyzed as adjuncts. This class is an experiment in the current tagging; constituents marked -CLR typically correspond to Quirk et al's \[11\] class of predication adjuncts. At the moment, we distinguish a handful of semantic roles: direction, location, manner, purpose, and time, as well as the syntactic roles of surface subject, logi- null cal subject, and (implicit in the syntactic structure) first and second verbal objects.</Paragraph> <Paragraph position="5"> 5. NULL ELEMENTS One important way in which the level of annotation of the current Penn Treebank exceeds that of the Lancaster project is that we have annotated nun elements in a wide range of cases. In the new annotation scheme, we co-index these null elements with the lexical material for which the null element stands. The current scheme happens to use two symbols for null elements: *T*, which marks WH-movement and topicalization, and * which is used for all other null elements, but this distinction is not very important. Co-indexing of null elements is done by suffixing an integer to non-terminal categories (e.g~ NP-10, VP-25). This integer serves as an id number for the constituent. A null element itself is followed by the id number of the constituent with which it is co-indexed. We use SBARQ to mark WH-questions, and SQ to mark auxiliaxy inverted structures. We use the WH-prefixed labels, WHNP, WHADVP, WHPP, etc., only when there is WHmovement; they always leave a co-indexed trace. Crucially, the predicate argument structure can be recovered by simply replacing the null element with the lexical material that it is</Paragraph> <Paragraph position="7"> In passives, the surface subject is tagged -SBJ, a passive trace is inserted after the verb, indicated by (NP *), and co-indexed to the surface subject (i.e. logical object). The logical sub-ject by-phrase, if present, is a child of VP, and is tagged</Paragraph> </Section> <Section position="7" start_page="115" end_page="115" type="metho"> <SectionTitle> -LGS (LoGical Subject). For passives, the predicate argu- </SectionTitle> <Paragraph position="0"> believe (*someone*, shoot (*someone*, Who)) A null element is also used to indicate which lexical NP is to be interpreted as the null null subject of an infinitive complement clause; it is co-indexed with the controlling NP, based upon the lexical properties of the verb.</Paragraph> <Paragraph position="1"> We also use null elements to allow the interpretation of other grammatical structures where constituents do not appear in their default positions. Null elements are used in most cases to mark the fronting (or &quot;topicalization&quot; of any element of an S before the subject (except in inversion). If an adjunct is topicalized, the fronted element does not leave a trace since the level of attachment is the same, only the word order is different. Topicalized arguments, on the other hand, always are marked by a null element:</Paragraph> <Paragraph position="3"> Again, this makes predicate argument interpretation straightforward, if the null element is simply replaced by the constituent to which it is co-indexed.</Paragraph> <Paragraph position="4"> Similarly, if the predicate has moved out of VP, it leaves a null element *T* in the VP node.</Paragraph> <Paragraph position="5"> Here, the SINVnode marks an inverted S structure, and the -TPC tag (ToPiC) marks a fronted (topicalized) constituent; the -GLR tag is discussed below.</Paragraph> </Section> <Section position="8" start_page="115" end_page="117" type="metho"> <SectionTitle> 6. DISCONTINUOUS CONSTITUENTS </SectionTitle> <Paragraph position="0"> Many otherwise clear argument/adjunct relations in the current corpus cannot be recovered due to the essentially context-free representation of the current Treebank. For example, currently there is no good representation for sentences in which constituents which serve as complements to the verb occur after a sententiM level adverb. Either the adverb is trapped within the VP, so that the complement can occur within the VP, where it belongs, or else the adverb is attached to the S, closing off the VP and forcing the complement to attach to the S. This &quot;trapping&quot; problem serves as a limitation for groups that currently use Treebank material to semiautomatically derive lexicons for particular applications.</Paragraph> <Paragraph position="1"> To solve &quot;trapping&quot; problems and annotation of non-contiguous structure, a wide range of phenomena of the kind discussed above can be handled by simple notational devices that use co-indexing to indicate discontinuous structures.</Paragraph> <Paragraph position="2"> Again, an index number added to the label of the original constituent is incorporated into the null element which shows where that constituent should be interpreted within the predicate argument structure.</Paragraph> <Paragraph position="3"> We use a variety of null elements to show show how non-adjacent constituents are related; we refer to such constituents as &quot;pseudoattached'. There axe four different types of pseudo-attach, as shown in Figure 1; the use of each will be explained below: The *IGH* pseudo-attach is used for simple extraposition, solving the most common case of &quot;trapping&quot;: Here, the clause that Terry would catch the ball is to be interpreted as an argument of knew.</Paragraph> <Paragraph position="4"> The *PPA* tag is reserved for so-called &quot;permanent predictable ambiguity&quot;, those cases in which one cannot tell where a constituent should be attached, even given context. Here, annotators attach the constituent at the more likely site (or if that is impossible to determine, at the higher site) and pseudo-attach it at all other plausible sites using using the *PPA * null element. Within the annotator workstation, this is done with a single mouse click, using pseudo-move and pseudo-promote operations.</Paragraph> <Paragraph position="5"> The *RNR*tag is used for so-called &quot;right-node raising&quot; conjunctions, where the same constituent appears to have been shifted out of both conjuncts.</Paragraph> <Paragraph position="6"> So that certain kinds of constructions can be found reliably within the corpus, we have adopted special marking of some special constructions. For example, extraposed sentences which leave behind a semantically null &quot;it&quot; are parsed as follows, using the *EXP* tag: pleasure(teach(*someone*, her)) Note that &quot;It&quot; is recognized as the surface subject, and that the extraposed clause is attached at S level and adjoined to &quot;it&quot; with what we call *EXPa-attach. The *EXP* is automatically co-indexed by our annotator workstation software to the postposed clause. The extraposed clause is interpreted as the subject of a pleasure here; the word it is to be ignored during predicate argument interpretation; this is flagged by the use of a special tag.</Paragraph> </Section> class="xml-element"></Paper>