File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/p04-1047_relat.xml
Size: 6,113 bytes
Last Modified: 2025-10-06 14:15:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1047"> <Title>Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank</Title> <Section position="3" start_page="0" end_page="0" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Creating a (subcategorisation) lexicon by hand is time-consuming, error-prone, requires considerable linguistic expertise and is rarely, if ever, complete.</Paragraph> <Paragraph position="1"> In addition, a system incorporating a manually constructed lexicon cannot easily be adapted to specific domains. Accordingly, many researchers have attempted to construct lexicons automatically, especially for English.</Paragraph> <Paragraph position="2"> (Brent, 1993) relies on local morphosyntactic cues (such as the -ing suffix, except where such a word follows a determiner or a preposition other than to) in the untagged Brown Corpus as probabilistic indicators of six different predefined subcategorisation frames. The frames do not include details of specific prepositions. (Manning, 1993) observes that Brent's recognition technique is a &quot;rather simplistic and inadequate approach to verb detection, with a very high error rate&quot;. Manning feeds the output from a stochastic tagger into a finite state parser, and applies statistical filtering to the parsing results. He predefines 19 different subcategorisation frames, including details of prepositions. Applying this technique to approx. 4 million words of New York Times newswire, Manning acquires 4900 subcategorisation frames for 3104 verbs, an average of 1.6 per verb. (Ushioda et al., 1993) run a finite state NP parser on a POS-tagged corpus to calculate the relative frequency of just six subcategorisation verb classes. In addition, all prepositional phrases are treated as adjuncts. For 1565 tokens of 33 selected verbs, they report an accuracy rate of 83%.</Paragraph> <Paragraph position="3"> (Briscoe and Carroll, 1997) observe that in the work of (Brent, 1993), (Manning, 1993) and (Ushioda et al., 1993), &quot;the maximum number of distinct subcategorization classes recognized is sixteen, and only Ushioda et al. attempt to derive relative subcategorization frequency for individual predicates&quot;. In contrast, the system of (Briscoe and Carroll, 1997) distinguishes 163 verbal subcategorisation classes by means of a statistical shallow parser, a classifier of subcategorisation classes, and a priori estimates of the probability that any verb will be a member of those classes. More recent work by Korhonen (2002) on the filtering phase of this approach has improved results. Korhonen experiments with the use of linguistic verb classes for obtaining more accurate back-off estimates for use in hypothesis selection. Using this extended approach, the average results for 45 semantically classified test verbs evaluated against hand judgements are precision 87.1% and recall 71.2%. By comparison, the average results for 30 verbs not classified semantically are precision 78.2% and recall 58.7%.</Paragraph> <Paragraph position="4"> Carroll and Rooth (1998) use a hand-written head-lexicalised context-free grammar and a text corpus to compute the probability of particular subcategorisation scenarios. The extracted frames do not contain details of prepositions.</Paragraph> <Paragraph position="5"> More recently, a number of researchers have applied similar techniques to derive resources for other languages, especially German. One of these, (Schulte im Walde, 2002), induces a computational subcategorisation lexicon for over 14,000 German verbs. Using sentences of limited length, she extracts 38 distinct frame types, which contain maximally three arguments each. The frames may optionally contain details of particular prepositional use. Her evaluation on over 3000 frequently occurring verbs against the German dictionary Duden Das Stilw&quot;orterbuch is similar in scale to ours and is discussed further in Section 5.</Paragraph> <Paragraph position="6"> There has also been some work on extracting subcategorisation details from the Penn Treebank.</Paragraph> <Paragraph position="7"> (Kinyon and Prolo, 2002) introduce a tool which uses fine-grained rules to identify the arguments, including optional arguments, of each verb occurrence in the Penn Treebank, along with their syntactic functions. They manually examined the 150+ possible sequences of tags, both functional and categorial, in Penn-II and determined whether the sequence in question denoted a modifier, argument or optional argument. Arguments were then mapped to traditional syntactic functions. As they do not include an evaluation, currently it is impossible to say how effective this technique is.</Paragraph> <Paragraph position="8"> (Xia et al., 2000) and (Chen and Vijay-Shanker, 2000) extract lexicalised TAGs from the Penn Treebank. Both techniques implement variations on the approaches of (Magerman, 1994) and (Collins, 1997) for the purpose of differentiating between complement and adjunct. In the case of (Xia et al., 2000), invalid elementary trees produced as a result of annotation errors in the treebank are filtered out using linguistic heuristics.</Paragraph> <Paragraph position="9"> (Hockenmaier et al., 2002) outline a method for the automatic extraction of a large syntactic CCG lexicon from Penn-II. For each tree, the algorithm annotates the nodes with CCG categories in a top-down recursive manner. In order to examine the coverage of the extracted lexicon in a manner similar to (Xia et al., 2000), (Hockenmaier et al., 2002) compared the reference lexicon acquired from Sections 02-21 with a test lexicon extracted from Section 23 of the WSJ. It was found that the reference CCG lexicon contained 95.09% of the entries in the test lexicon, while 94.03% of the entries in the test TAG lexicon also occurred in the reference lexicon.</Paragraph> <Paragraph position="10"> Both approaches involve extensive correction and clean-up of the treebank prior to lexical extraction.</Paragraph> </Section> class="xml-element"></Paper>