File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-1511_intro.xml
Size: 4,739 bytes
Last Modified: 2025-10-06 14:01:17
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1511"> <Title>Covering Treebanks with GLARF</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Applications using annotated corpora are often, by design, limited by the information found in those corpora. Since most English treebanks provide limited predicate-argument (PRED-ARG) information, parsers based on these treebanks do not produce more detailed predicate argument structures (PRED-ARG structures). The Penn Treebank II (Marcus et al., 1994) marks subjects (SBJ), logical objects of passives (LGS), some reduced relative clauses (RRC), as well as other grammatical information, but does not mark each constituent with a grammatical role. In our view, a full PRED-ARG description of a sentence would do just that: assign each constituent a grammatical role that relates that constituent to one or more other constituents in the sentence.</Paragraph> <Paragraph position="1"> For example, the role HEAD relates a constituent to its parent and the role OBJ relates a constituent to the HEAD of its parent. We believe that the absence of this detail limits the range of applications for treebank-based parsers. In particular, they limit the extent to which it is possible to generalize, e.g., marking IND-OBJ and OBJ roles allows one to generalize a single pattern to cover two related examples (&quot;John gave Mary a book&quot; = &quot;John gave a book to Mary&quot;). Distinguishing complement PPs (COMP) from adjunct PPs (ADV) is useful because the former is likely to have an idiosyncratic interpretation, e.g., the object of &quot;at&quot; in &quot;John is angry at Mary&quot; is not a locative and should be distinguished from the locative case by many applications.</Paragraph> <Paragraph position="2"> In an attempt to fill this gap, we have begun a project to add this information using both automatic procedures and hand-annotation. We are implementing automatic procedures for mapping the Penn Treebank II (PTB) into a PRED-ARG representation and then we are correcting the output of these procedures manually. In particular, we are hoping to encode information that will enable a greater level of regularization across linguistic structures than is possible with PTB.</Paragraph> <Paragraph position="3"> This paper introduces GLARF, the Grammatical and Logical Argument Representation Framework. We designed GLARF with four objectives in mind: (1) capturing regularizations -noncanonical constructions (e.g., passives, filler-gap constructions, etc.) are represented in terms of their canonical counterparts (simple declarative clauses); (2) representing all phenomena using one simple data structure: the typed feature structure (3) consistently labeling all arguments and adjuncts for phrases with clear heads; and (4) producing clear and consistent PRED-ARGs for phrases that do not have heads, e.g., conjoined structures, named entities, etc. -- rather than trying to squeeze these phrases into an X-bar mold, we customized our representations to reflect their head-less properties. We believe that a framework for PRED-ARG needs to satisfy these objectives to adequately cover a corpus like PTB.</Paragraph> <Paragraph position="4"> We believe that GLARF, because of its uniform treatment of PRED-ARG relations, will be valuable for many applications, including question answering, information extraction, and machine translation. In particular, for MT, we expect it will benefit procedures which learn translation rules from syntactically analyzed parallel corpora, such as (Matsumoto et al., 1993; Meyers et al., 1996). Much closer alignments will be possible using GLARF, because of its multiple levels of representation, than would be possible with surface structure alone (An example is provided at the end of Section 2). For this reason, we are currently investigating the extension of our mapping procedure to treebanks of Japanese (the Kyoto Corpus) and Spanish (the UAM Treebank (Moreno et al., 2000)). Ultimately, we intend to create a parallel trilingual treebank using a combination of automatic methods and human correction. Such a treebank would be valuable resource for corpus-trained MT systems.</Paragraph> <Paragraph position="5"> The primary goal of this paper is to discuss the considerations for adding PRED-ARG information to PTB, and to report on the performance of our mapping procedure. We intend to wait until these procedures are mature before beginning annotation on a larger scale. We also describe our initial research on covering the Kyoto Corpus of Japanese with GLARF.</Paragraph> </Section> class="xml-element"></Paper>