File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-2100_intro.xml

Size: 3,731 bytes

Last Modified: 2025-10-06 14:00:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2100">
  <Title>Automatic Extraction of Subcategorization Frames for Czech*</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Tl-te subcategorization of verbs is an essential issue in parsing, because it helps disambiguate the attachment of arguments and recover the correct predicate-argument relations by a parser. (CmToll and Minnen, 1998; CmToll and Rooth, 1998) give several reasons why subcategorization information is important for a natural language parser. Machine-readable dictionaries are not comprehensive enough to provide this lexical infornaation (Manning, 1993; Briscoe and Carroll, 1997). Furthermore, such dictionaries are available only for very few languages.</Paragraph>
    <Paragraph position="1"> We need some general method for the automatic extraction of subcategorization information from text corpora.</Paragraph>
    <Paragraph position="2"> Several techniques and results have been reported on learning subcategorization frames (SFs) from text corpora (Webster and Marcus, 1989; Brent, 1991; Brent, 1993; Brent, 1994; Ushioda et al., 1993; Manning, 1993; Ersan and Charniak, 1996; Briscoe and Carroll, 1997; Carroll and Minnen, 1998; Carroll and Rooth, 1998). All of this work &amp;quot; Tiffs work was done during the second author's visit to tl~e University of Pennsylvania. We would like to thank Prof. Aravind Joshi, l)avid Chiang, Mark l)ras and the anonymous reviewers for their comments. The first at,thor's work is partially supported by NS F Grant S BR 8920230. Many tools used in this work are the resuhs of project No. VS96151 of the Ministry of Education of the Czech Republic. The data (PDT) is thanks to grant No. 405/96/K214 of the Grant Agency of the Czech Republic. Both grants were given to the Institute of Fornml and Applied linguistics, Faculty of Mathenmtics and Physics, Charles University, Prague.</Paragraph>
    <Paragraph position="3"> deals with English. In this paper we report on techniques that automatically extract SFs for Czech, which is a flee word-order language, where verb complements have visible case marking.I Apart from the choice of target language, this work also differs from previous work in other ways.</Paragraph>
    <Paragraph position="4"> Unlike all other previous work in this area, we do not assume that the set of SFs is known to us in advance. Also in contrast, we work with syntactically annotated data (the Prague Dependency Treebank, PDT (HajiC 1998)) where the subcategorization information is not given; although this might be condeg sidered a simpler problem as compared to using raw text, we have discovered interesting problems that a user of a raw or tagged corpus is unlikely to face.</Paragraph>
    <Paragraph position="5"> We first give a detailed description of the task of uncovering SFs and also point out those properties of Czech that have to be taken into account when searching lbr SFs. Then we discuss some dif-.</Paragraph>
    <Paragraph position="6"> ferences fl'Oln the other research efforts. We then present the three techniques that we use to learn SFs from the input data.</Paragraph>
    <Paragraph position="7"> In the input data, many observed dependents of the verb are adjuncts. To treat this problem effectively, we describe a novel addition to the hypothesis testing technique that uses subset of observed fl'ames to permit the learning algorithm to better distinguish arguments fl-om adjtmcts.</Paragraph>
    <Paragraph position="8"> Using our techniques, we arc able to achieve 88% precision in distinguishing argunaents from adjuncts on unseen parsed text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML