File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/j05-3003_abstr.xml

Size: 6,905 bytes

Last Modified: 2025-10-06 13:44:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-3003">
  <Title>Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks</Title>
  <Section position="2" start_page="0" end_page="330" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> In modern syntactic theories (e.g., lexical-functional grammar [LFG] [Kaplan and Bresnan 1982; Bresnan 2001; Dalrymple 2001], head-driven phrase structure grammar [HPSG] [Pollard and Sag 1994], tree-adjoining grammar [TAG] [Joshi 1988], and combinatory categorial grammar [CCG] [Ades and Steedman 1982]), the lexicon is the central repository for much morphological, syntactic, and semantic information.</Paragraph>
    <Paragraph position="1"> [?] National Centre for Language Technology, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland. E-mail: {rodonovan,mburke,acahill,josef,away}@computing.dcu.ie.</Paragraph>
    <Paragraph position="2"> + Centre for Advanced Studies, IBM, Dublin, Ireland.</Paragraph>
    <Paragraph position="3"> Submission received: 19 March 2004; revised submission received: 18 December 2004; accepted for publication: 2 March 2005.</Paragraph>
    <Paragraph position="4"> (c) 2005 Association for Computational Linguistics Computational Linguistics Volume 31, Number 3 Extensive lexical resources, therefore, are crucial in the construction of wide-coverage computational systems based on such theories.</Paragraph>
    <Paragraph position="5"> One important type of lexical information is the subcategorization requirements of an entry (i.e., the arguments a predicate must take in order to form a grammatical construction). Lexicons, including subcategorization details, were traditionally produced by hand. However, as the manual construction of lexical resources is time consuming, error prone, expensive, and rarely ever complete, it is often the case that the limitations of NLP systems based on lexicalized approaches are due to bottlenecks in the lexicon component. In addition, subcategorization requirements may vary across linguistic domain or genre (Carroll and Rooth 1998). Manning (1993) argues that, aside from missing domain-specific complementation trends, dictionaries produced by hand will tend to lag behind real language use because of their static nature. Given these facts, research on automating acquisition of dictionaries for lexically based NLP systems is a particularly important issue.</Paragraph>
    <Paragraph position="6"> Aside from the extraction of theory-neutral subcategorization lexicons, there has also been work in the automatic construction of lexical resources which comply with the principles of particular linguistic theories such as LTAG, CCG, and HPSG (Chen and Vijay-Shanker 2000; Xia 1999; Hockenmaier, Bierner, and Baldridge 2004; Nakanishi, Miyao, and Tsujii 2004). In this article we present an approach to automating the process of lexical acquisition for LFG (i.e., grammatical-function-based systems). However, our approach also generalizes to CFG category-based approaches. In LFG, subcategorization requirements are enforced through semantic forms specifying which grammatical functions are required by a particular predicate. Our approach is based on earlier work on LFG semantic form extraction (van Genabith, Sadler, and Way 1999) and recent progress in automatically annotating the Penn-II and Penn-III Treebanks with LFG f-structures (Cahill et al. 2002; Cahill, McCarthy, et al. 2004). Our technique requires a treebank annotated with LFG functional schemata. In the early approach of van Genabith, Sadler, and Way (1999), this was provided by manually annotating the rules extracted from the publicly available subset of the AP Treebank to automatically produce corresponding f-structures. If the f-structures are of high quality, reliable LFG semantic forms can be generated quite simply by recursively reading off the subcategorizable grammatical functions for each local PRED value at each level of embedding in the f-structures. The work reported in van Genabith, Sadler, and Way (1999) was small scale (100 trees) and proof of concept and required considerable manual annotation work. It did not associate frames with probabilities, discriminate between frames for active and passive constructions, properly reflect the effects of long-distance dependencies (LDDs), or include CFG category information. In this article we show how the extraction process can be scaled to the complete Wall Street Journal (WSJ) section of the Penn-II Treebank, with about one million words in 50,000 sentences, based on the automatic LFG f-structure annotation algorithm described in Cahill et al. (2002) and Cahill, McCarthy, et al. (2004). More recently we have extended the extraction approach to the larger, domain-diverse Penn-III Treebank. Aside from the parsed WSJ section, this version of the treebank contains parses for a subsection of the Brown corpus (almost 385,000 words in 24,000 trees) taken from a variety of text genres.</Paragraph>
    <Paragraph position="7">  In addition to extracting grammatical-function1 For the remainder of this work, when we refer to the Penn-II Treebank, we mean the parse-annotated WSJ, and when we refer to the Penn-III Treebank, we mean the parse-annotated WSJ and Brown corpus combined.</Paragraph>
    <Paragraph position="8">  O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources based subcategorization frames, we also include the syntactic categories of the predicate and its subcategorized arguments, as well as additional details such as the prepositions required by obliques and particles accompanying particle verbs. Our method discriminates between active and passive frames, properly reflects LDDs in the source data structures, assigns conditional probabilities to the semantic forms associated with each predicate, and does not predefine the subcategorization frames extracted.</Paragraph>
    <Paragraph position="9"> In Section 2 of this article, we briefly outline LFG, presenting typical lexical entries and the encoding of subcategorization information. Section 3 reviews related work in the area of automatic subcategorization frame extraction. Our methodology and its implementation are presented in Section 4. In Section 5 we present results from the extraction process. We evaluate the complete induced lexicon against the COMLEX resource (Grishman, MacLeod, and Meyers 1994) and present the results in Section 6.</Paragraph>
    <Paragraph position="10"> To our knowledge, this is by far the largest and most complete evaluation of subcategorization frames automatically acquired for English. In Section 7, we examine the coverage of our lexicon in regard to unseen data and the rate at which new lexical entries are learned. Finally, in Section 8 we conclude and give suggestions for future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML