File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1511_metho.xml
Size: 23,625 bytes
Last Modified: 2025-10-06 14:07:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1511"> <Title>Covering Treebanks with GLARF</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Previous Treebanks </SectionTitle> <Paragraph position="0"> There are several corpora annotated with PRED-ARG information, but each encode some distinctions that are different. The Susanne Corpus (Sampson, 1995) consists of about 1/6 of the Brown Corpus annotated with detailed syntactic information. Unlike GLARF, the Susanne framework does not guarantee that each constituent be assigned a grammatical role. Some grammatical roles (e.g., subject, object) are marked explicitly, others are implied by phrasetags (Fr corresponds to the GLARF node label SBAR under a REL-ATIVE arc label) and other constituents are not assigned roles (e.g., constituents of NPs). Apart from this concern, it is reasonable to ask why we did not adapt this scheme for our use. Susanne's granularity surpasses PTB-based GLARF in many areas with about 350 wordtags (part of speech) and 100 phrasetags (phrase node labels).</Paragraph> <Paragraph position="1"> However, GLARF would express many of the details in other ways, using fewer node and part of speech (POS) labels and more attributes and role labels. In the feature structure tradition, GLARF can represent varying levels of detail by adding or subtracting attributes or defining subsumption hierarchies. Thus both Susanne's NP1p word-tag and Penn's NNP wordtag would correspond to GLARF's NNP POS tag. A GLARF-style Susanne analysis of &quot;Ontario, Canada&quot; is (NP</Paragraph> <Paragraph position="3"> style PTB analysis uses the roles NAME1 and NAME2 instead of PROVINCE and COUNTRY, where name roles (NAME1, NAME2) are more general than PROVINCE and COUNTRY in a subsumption hierarchy. In contrast, attempts to convert PTB into Susanne would fail because detail would be unavailable. Similarly, attempts to convert Susanne into the PTB framework would lose information. In summary, GLARF's ability to represent varying levels of detail allows different types of treebank formats to be converted into GLARF, even if they cannot be converted into each other. Perhaps, GLARF can become a lingua franca among annotated treebanks.</Paragraph> <Paragraph position="4"> The Negra Corpus (Brants et al., 1997) provides PRED-ARG information for German, similar in granularity to GLARF. The most significant difference is that GLARF regularizes some phenomena which a Negra version of English would probably not, e.g., control phenomena. Another novel feature of GLARF is the ability to represent paraphrases (in the Harrisian sense) that are not entirely syntactic, e.g., nominalizations as sentences. Other schemes seem to only regularize strictly syntactic phenomena.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Structure of GLARF </SectionTitle> <Paragraph position="0"> In GLARF, each sentence is represented by a typed feature structure. As is standard, we model feature structures as single-rooted directed acyclic graphs (DAGs). Each nonterminal is labeled with a phrase category, and each leaf is labeled with either: (a) a (PTB) POS label and a word (eat, fish, etc.) or (b) an attribute value (e.g., singular, passive, etc.). Types are based on non-terminal node labels, POSs and other attributes (Carpenter, 1992). Each arc bears a feature label which represents either a grammatical role (SBJ, OBJ, etc.) or some attribute of a word or phrase (morphological features, tense, semantic features, etc.).1 For example, the subject of a sentence is the head of a SBJ arc, an attribute like SINGU-LAR is the head of a GRAM-NUMBER arc, etc.</Paragraph> <Paragraph position="1"> A constituent involved in multiple surface or logical relations may be at the head of multiple arcs. For example, the surface subject (S-SBJ) of a passive verb is also the logical object (L-OBJ). These two roles are represented as two arcs which share the same head. This sort of structure sharing analysis originates with Relational Grammar and related frameworks (Perlmutter, 1984; Johnson and Postal, 1980) and is common in Feature Structure frameworks (LFG, HPSG, etc.). Following (Johnson et al., 1993)2, arcs are typed. There are five different types of role labels: senting neither surface, nor logical positions.</Paragraph> <Paragraph position="2"> In &quot;John seemed to be kidnapped by aliens&quot;, &quot;John&quot; is the surface subject of &quot;seem&quot;, the logical object of &quot;kidnapped&quot;, and the intermediate subject of &quot;to be&quot;. Intermediate arcs capture are helpful for modeling the way sentences conform to constraints. The intermediate subject arc obeys lexical constraints and connect the surface subjects of &quot;seem&quot; (COMLEX Syntax class TO-INF-RS (Macleod et al., 1998a)) to the subject of the infinitive. However, the subject of the infinitive in this case is not a logical sub-ject due to the passive. In some cases, intermediate arcs are subject to number agreement, e.g., in &quot;Which aliens did you say were seen?&quot;, the I-SBJ of &quot;were seen&quot; agrees with &quot;were&quot;.</Paragraph> <Paragraph position="3"> the target of a SBJ subject arc.</Paragraph> <Paragraph position="4"> Logical relations, encoded with SL- and Larcs, are defined more broadly in GLARF than in most frameworks. Any regularization from a non-canonical linguistic structure to a canonical one results in logical relations. Following (Harris, 1968) and others, our model of canonical linguistic structure is the tensed active indicative sentence with no missing arguments. The following argument types will be at the head of logical (L-) arcs based on counterparts in canonical sentences which are at the head of SL- arcs: logical arguments of passives, understood subjects of infinitives, understood fillers of gaps, and interpreted arguments of nominalizations (In &quot;Rome's destruction of Carthage&quot;, &quot;Rome&quot; is the logical sub-ject and &quot;Carthage&quot; is the logical object). While canonical sentence structure provides one level of regularization, canonical verb argument structures provide another. In the case of argument alternations (Levin, 1993), the same role marks an alternating argument regardless of where it occurs in a sentence. Thus &quot;the man&quot; is the indirect object (IND-OBJ) and &quot;a dollar&quot; is the direct object (OBJ) in both &quot;She gave the man a dollar&quot; and &quot;She gave a dollar to the man&quot; (the dative alternation). Similarly, &quot;the people&quot; is the logical object (L-OBJ) of both &quot;The people evacuated from the town&quot; and &quot;The troops evacuated the people from the town&quot;, when we assume the appropriate regularization. Encoding this information allows applications to generalize. For example, a single Information Extraction pattern that recognizes the IND-OBJ/OBJ distinction would be able to handle these two examples. Without this distinction, 2 patterns would be needed.</Paragraph> <Paragraph position="5"> Due to the diverse types of logical roles, we sub-type roles according to the type of regularization that they reflect. Depending on the application, one can apply different filters to a detailed GLARF representation, only looking at certain types of arcs. For example, one might choose all logical (L- and SL-) roles for an application that is trying to acquire selection restrictions, or all surface (S- and SL-) roles if one was interested in obtaining a surface parse. For other applications, one might want to choose between subtypes of logical arcs. Given</Paragraph> <Paragraph position="7"> a trilingual treebank, suppose that a Spanish treebank sentence corresponds to a Japanese nominalization phrase and an English nominalization phrase, e.g., Disney ha comprado Apple Computers Disney's acquisition of Apple Computers Furthermore, suppose that the English treebank analyzes the nominalization phrase both as an NP (Disney = possessive, Apple Computers = object of preposition) and as a paraphrase of a sentence (Disney = subject, Apple Computers = object). For an MT system that aligns the Spanish and English graph representation, it may be useful to view the nominalization phrase in terms of the clausal arguments. However, in a Japanese/English system, we may only want to look at the structure of the English nominalization phrase as an NP.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 GLARF and the Penn Treebank </SectionTitle> <Paragraph position="0"> This section focuses on some characteristics of English GLARF and how we map PTB into GLARF, as exemplified by mapping the PTB representation in Figure 1 to the GLARF representation in Figure 2. In the process, we will discuss how some of the more interesting linguistic phenomena are represented in GLARF.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Mapping into GLARF </SectionTitle> <Paragraph position="0"> Our procedure for mapping PTB into GLARF uses a sequence of transformations. The first transformation applies to PTB, and the output of each a5a7a6a9a8a11a10a13a12a15a14a17a16a18a6a20a19a21a8a11a5a7a22a23a16a18a10a25a24 is the input of a5a26a6a9a8a27a10a13a12a18a14a17a16a18a6a18a19a21a8a11a5a26a22a23a16a18a10a25a24a20a28a30a29 . As many of these transformations are trivial, we focus on the most interesting set of problems. In addition, we explain how GLARF is used to represent some of the more difficult phenomena.</Paragraph> <Paragraph position="1"> (Brants et al., 1997) describes an effort to minimize human effort in the annotation of raw text with comparable PRED-ARG information. In contrast, we are starting with annotated corpus and want to add as much detail as possible automatically. We are as much concerned with finding good procedures for PTB-based parser output as we are minimizing the effort of future human taggers. The procedures are designed to get the right answer most of the time. Human taggers will correct the results when they are wrong.</Paragraph> <Paragraph position="2"> The treatment of coordinate conjunction in PTB is not uniform. Words labeled CC and phrases labeled CONJP usually function as co-ordinate conjunctions in PTB. However, a number of problems arise when one attempts to unambiguously identify the phrases which are conjoined. Most significantly, given a phrase XP with conjunctions and commas and some set of other constituents a31a32a29a34a33a36a35a36a35a36a35a34a33a37a31a38a24 , it is not always clear which a31a40a39 are conjuncts and which are not, i.e., Penn does not explicitly mark items as conjuncts and one cannot assume that all a31a40a39 are conjuncts. In GLARF, conjoined phrases are clearly identified and conjuncts in those phrases are distinguished from non-conjuncts. We will discuss each problematic case that we observed in turn.</Paragraph> <Paragraph position="3"> Instances of words that are marked CC in Penn do not always function as conjunctions. They may play the role of a sentential adverb, a preposition or the head of a parenthetical constituents. In GLARF, conjoined phrases are explicitly marked with the attribute value (CONJOINED T). The mapping procedures recognize that phrases beginning with CCs, PRN phrases containing CCs, among others are not conjoined phrases.</Paragraph> <Paragraph position="4"> A sister of a conjunction (other than a conjunction) need not be a conjunct. There are two cases. First of all, a sister of a conjunction can be a shared modifier, e.g., the right node raised ing or [marketing their best products] there]&quot;. In addition, the boundaries of the conjoined phrase and/or the conjuncts that they contain are omitted in some environments, particularly when single words are conjoined and/or when the phrases occur before the head of a noun phrase or quantifier phrase. Some phrases which are under a single nonterminal node in the treebank (and are not further broken down) include the following: &quot;between $190 million and $195 million&quot;, &quot;Hollingsworth & Vose Co.&quot;, &quot;cotton and acetate fibers&quot;, &quot;those workers and managers&quot;, &quot;this U.S. sales and marketing arm&quot;, and &quot;Messrs. Cray and Barnum&quot;. To overcome this sort of problem, procedures introduce brackets and mark constituents as conjuncts. Considerations included POS categories, similarity measures, construction type (e.g., & is typically part of a name), among other factors.</Paragraph> <Paragraph position="5"> CONJPs have a different distribution than CCs.</Paragraph> <Paragraph position="6"> Different considerations are needed for identifying the conjuncts. CONJPs, unlike CCs, can occur initially, e.g., &quot;[Not only] [was Fred a good doctor], [he was a good friend as well].&quot;). Secondly, they can be embedded in the first conjunct, e.g., &quot;[Fred, not only, liked to play doctor], [he was good at it as well.]&quot;.</Paragraph> <Paragraph position="7"> In Figure 2, the conjuncts are labeled explicitly with their roles CONJ1 and CONJ2, the conjunction is labeled as CONJUNCTION1 and the top-most VP is explicitly marked as a conjoined phrase with the attribute/value (CONJOINED T).</Paragraph> <Paragraph position="8"> We merged together two lexical resources NOMLEX (Macleod et al., 1998b) and COM-LEX Syntax 3.1 (Macleod et al., 1998a), deriving PP complements of nouns from NOMLEX and using COMLEX for other types of lexical information.We use these resources to help add additional brackets, make additional role distinctions and fill a gap when its filler is not marked in PTB. Although Penn's -CLR tags are good indicators of complement-hood, they only apply to verbal complements. Thus procedures for making adjunct/complement distinctions benefited from the dictionary classes. Similarly, COMLEX's NP-FOR-NP class helped identify those -BNF constituents which were indirect objects (&quot;John baked Mary a cake&quot;, &quot;John baked a cake [for Mary]&quot;). The class PRE-ADJ identified those adverbial modifiers within NPs which really modify the adjective. Thus we could add the following brackets to the NP: &quot;[even brief] exposures&quot;. NTITLE and NUNIT were useful for the analysis of pattern type noun phrases, e.g., &quot;President Bill Clinton&quot;, &quot;five million dollars&quot;. Our procedures for identifying the logical subjects of infinitives make extensive use of the control/raising properties of COMLEX classes. For example, X is the subject of the infinitives in &quot;X appeared to leave&quot; and &quot;X was likely to bring attention to the problem&quot;. null Over the past few years, there has been a lot of interest in automatically recognizing named entities, time phrases, quantities, among other special types of noun phrases. These phrases have a number of things in common including: (1) their internal structure can have idiosyncratic properties relative to other types of noun phrases, e.g., per-son names typically consist of optional titles plus one or more names (first, middle, last) plus an optional post-honorific; and (2) externally, they can occur wherever some more typical phrasal constituent (usually NP) occurs. Identifying these patterns makes it possible to describe these differences in structure, e.g., instead of identifying a head for &quot;John Smith, Esq.&quot;, we identify two names and a posthonorific. If this named entity went unrecognized, we would incorrectly assume that &quot;Esq.&quot; was the head. Currently, we merge the output of a named entity tagger to the Penn Tree-bank prior to processing. In addition to NE tagger output, we use procedures based on Penn's proper noun wordtags.</Paragraph> <Paragraph position="9"> In Figure 2, there are four patterns: two NUMBER and two TIME patterns. The TIME patterns are very simple, each consisting just of YEAR elements, although MONTH, DAY, HOUR, MINUTE, etc. elements are possible.</Paragraph> <Paragraph position="10"> The NUMBER patterns each consist of a single NUMBER (although multiple NUMBER constituents are possible, e.g., &quot;one thousand&quot;) and one UNIT constituent. The types of these patterns are indicated by the PATTERN attribute.</Paragraph> <Paragraph position="11"> Figures 1 and 2 are corresponding PTB and GLARF representations of gapping. Penn represents gapping via &quot;parallel&quot; indices for corresponding arguments. In GLARF, the shared verb is at the head of two HEAD arcs. GLARF overcomes some problems with structure sharing analyses of gapping constructions. The verb gap is a &quot;sloppy&quot; (Ross, 1967) copy of the original verb. Two separate spending events are represented by one verb. Intuitively, structure sharing implies token identity, whereas type identity would be more appropriate. In addition, the copied verb need not agree with the subject in the second conjunct, e.g., &quot;was&quot;, not &quot;were&quot; would agree with the second conjunct in &quot;the risks a41a43a42a36a6a44a42a36a39 too high and the potential payoff a42a36a39 too far in the future&quot;. It is thus problematic to view the gap as identical in every way to the filler in this case. In GLARF, we can thus distinguish the gapping sort of logical arc (L-GAPPING-HEAD) from the other types of L-HEAD arcs. We can stipulate that a gapping logical arc represents an appropriately inflected copy of the phrase at the head of that arc.</Paragraph> <Paragraph position="12"> In GLARF, the predicate is always explicit.</Paragraph> <Paragraph position="13"> However, Penn's representation (H. Koti, pc) provides an easy way to represent complex cases, e.g., &quot;John wanted to buy gold, and Mary *gap* silver. In GLARF, the gap would be filled by the nonconstituent &quot;wanted to buy&quot;. Unfortunately, we believe that this is a necessary burden. A goal of GLARF is to explicitly mark all PRED-ARG relations. Given parallel indices, the user must extract the predicate from the text by (imperfect) automatic means. The current solution for GLARF is to provide multiple gaps. The second conjunct of the example in question would have the following analysis: (S (SBJ a45a46a8a27a6a20a47a49a48 ) (PRD</Paragraph> <Paragraph position="15"> is filled by &quot;wanted&quot;, a50a52a51a54a53a61a60 is filled by &quot;to buy&quot; and a55a56a53a57a48 is bound to Mary.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Japanese GLARF </SectionTitle> <Paragraph position="0"> Japanese GLARF will have many of the same specifications described above. To illustrate how we will extend GLARF to Japanese, we discuss two difficult-to-represent phenomena: elision and stacked postpositions.</Paragraph> <Paragraph position="1"> Grammatical analyses of Japanese are often dependency trees which use postpositions as arc labels. Arguments, when elided, are omitted from the analysis. In GLARF, however, we use role labels like SBJ, OBJ, IND-OBJ and COMP and mark elided constituents as zeroed arguments. In the case of stacked postpositions, we represent the different roles via different arcs. We also reanalyze certain postpositions as being complementizers (subordinators) or adverbs, thus excluding them from canonical roles. By reanalyzing this way, we arrived at two types of true stacked postpositions: nominalization and topicalization. For example, in Figure 3, the topicalized NP is at the head of two arcs, labeled S-TOP and L-COMP and the associated postpositions are analyzed as morphological case attributes.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Testing the Procedures </SectionTitle> <Paragraph position="0"> To test our mapping procedures, we apply them to some PTB files and then correct the resulting representation using ANNOTATE (Brants and Plaehn, 2000), a program for annotating edgelabeled trees and DAGs, originally created for the NEGRA corpus. We chose both files that we have used extensively to tune the mapping procedures (training) and other files. We then convert the resulting GLARF Feature Structures into triples of the form a62 Role-Name Pivot Non-Pivota63 for all logical arcs (cf. (Caroll et al., 1998)), using some automatic procedures. The &quot;pivot&quot; is the head of headed structures, but may be some other constituent in non-headed structures. For example, in a conjoined phrase, the pivot is the conjunction, and the head would be the list of heads of the conjuncts. Rather than listing the whole Pivot and non-pivot phrases in the triples, we simply list the heads of these phrases, which is usually a single word. Finally, we compute precision and recall by comparing the triples generated from our procedures to triples generated from the corrected GLARF.3 An exact match is a correct answer and anything else is incorrect.4</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 The Test and the Results </SectionTitle> <Paragraph position="0"> We developed our mapping procedures in two stages. We implemented some mapping procedures based on PTB manuals, related papers and actual usage of labels in PTB. After our initial implementation, we tuned the procedures based on a training set of 64 sentences from two PTB files: wsj 0003 and wsj 0051, yielding 1285 + triples.</Paragraph> <Paragraph position="1"> Then we tested these procedures against a test set consisting of 65 sentences from wsj 0089 (1369 triples). Our results are provided in Figure 4. Precision and recall are calculated on a per sentence basis and then averaged. The precision for a sentence is the number of correct triples divided by the total number of triples generated. The recall is the total number of correct triples divided by the total number of triples in the answer key.</Paragraph> <Paragraph position="2"> Out of 187 incorrect triples in the test corpus, 31 reflected the incorrect role being selected, e.g., the adjunct/complement distinction, 139 reflected errors or omissions in our procedures and 7 triples related to other factors. We expect a sizable improvement as we increase the size of our training corpus and expand the coverage of our pro3We admit a bias towards our output in a small number of cases (less than 1%). For example, it is unimportant whether &quot;exposed to it&quot; modifies &quot;the group&quot; or &quot;workers&quot; in &quot;a group of workers exposed to it&quot;. The output will get full credit for this example regardless of where the reduced relative is attached.</Paragraph> <Paragraph position="3"> 4(Caroll et al., 1998) report about 88% precision and recall for similar triples derived from parser output. However, they allow triples to match in some cases when the roles are different and they do not mark modifier relations.</Paragraph> <Paragraph position="4"> cedures, particularly since one omission often resulted in several incorrect triples.</Paragraph> </Section> </Section> class="xml-element"></Paper>