File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1047_intro.xml
Size: 3,520 bytes
Last Modified: 2025-10-06 14:02:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1047"> <Title>Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Lexical resources are crucial in the construction of wide-coverage computational systems based on modern syntactic theories (e.g. LFG, HPSG, CCG, LTAG etc.). However, as manual construction of such lexical resources is time-consuming, errorprone, expensive and rarely ever complete, it is often the case that limitations of NLP systems based on lexicalised approaches are due to bottlenecks in the lexicon component.</Paragraph> <Paragraph position="1"> Given this, research on automating lexical acquisition for lexically-based NLP systems is a particularly important issue. In this paper we present an approach to automating subcategorisation frame acquisition for LFG (Kaplan and Bresnan, 1982) i.e.</Paragraph> <Paragraph position="2"> grammatical function-based systems. LFG has two levels of structural representation: c(onstituent)structure, and f(unctional)-structure. LFG differentiates between governable (argument) and nongovernable (adjunct) grammatical functions. Subcategorisation requirements are enforced through semantic forms specifying the governable grammatical functions required by a particular predicate (e.g. FOCUS<( |SUBJ)( |OBLon)> ). Our approach is based on earlier work on LFG semantic form extraction (van Genabith et al., 1999) and recent progress in automatically annotating the Penn-II treebank with LFG f-structures (Cahill et al., 2004b). Depending on the quality of the f-structures, reliable LFG semantic forms can then be generated quite simply by recursively reading off the subcategorisable grammatical functions for each local pred value at each level of embedding in the f-structures.</Paragraph> <Paragraph position="3"> The work reported in (van Genabith et al., 1999) was small scale (100 trees), proof of concept and required considerable manual annotation work. In this paper we show how the extraction process can be scaled to the complete Wall Street Journal (WSJ) section of the Penn-II treebank, with about 1 million words in 50,000 sentences, based on the automatic LFG f-structure annotation algorithm described in (Cahill et al., 2004b). In addition to extracting grammatical function-based subcategorisation frames, we also include the syntactic categories of the predicate and its subcategorised arguments, as well as additional details such as the prepositions required by obliques, and particles accompanying particle verbs. Our method does not predefine the frames to be extracted. In contrast to many other approaches, it discriminates between active and passive frames, properly reflects long distance dependencies and assigns conditional probabilities to the semantic forms associated with each predicate.</Paragraph> <Paragraph position="4"> Section 2 reviews related work in the area of automatic subcategorisation frame extraction. Our methodology and its implementation are presented in Section 3. Section 4 presents the results of our lexical extraction. In Section 5 we evaluate the complete extracted lexicon against the COMLEX resource (MacLeod et al., 1994). To our knowledge, this is the largest evaluation of subcategorisation frames for English. In Section 6, we conclude and give suggestions for future work.</Paragraph> </Section> class="xml-element"></Paper>