File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/j05-3003_metho.xml
Size: 17,166 bytes
Last Modified: 2025-10-06 14:09:42
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-3003"> <Title>Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks</Title> <Section position="5" start_page="336" end_page="342" type="metho"> <SectionTitle> 4. Methodology </SectionTitle> <Paragraph position="0"> The first step in the application of our methodology is the production of a tree-bank annotated with LFG f-structure information. F-structures are attribute-value structures which represent abstract syntactic information, approximating to basic predicate-argument-modifier structures. Most of the early work on automatic f-structure annotation (e.g., van Genabith, Way, and Sadler 1999; Frank 2000; Sadler, van Genabith, and Way 2000) was applied only to small data sets (fewer than 200 sentences) and was largely proof of concept. However, more recent work (Cahill et al.</Paragraph> <Paragraph position="1"> 2002; Cahill, McCarthy, et al. 2004) has presented efforts in evolving and scaling up annotation techniques to the Penn-II Treebank (Marcus et al. 1994), containing more than 1,000,000 words and 49,000 sentences.</Paragraph> <Paragraph position="2"> We utilize the automatic annotation algorithm of Cahill et al. (2002) and Cahill, McCarthy, et al. (2004) to derive a version of Penn-II in which each node in each tree is annotated with LFG functional annotations in the form of attribute-value structure equations. The algorithm uses categorial, configurational, local head, and Penn-II functional and trace information. The annotation procedure is dependent on locating the head daughter, for which an amended version of Magerman (1994) is used. The head is annotated with the LFG equation |=|. Linguistic generalizations are provided over the left (the prefix) and the right (suffix) context of the head for each syntactic category occurring as the mother nodes of such heads. To give a simple example, the rightmost NP to the left of a VP head under an S is likely to be the subject of the sentence ( |SUBJ =|), while the leftmost NP to the right of the V head of a VP is most probably the verb's object ( |OBJ =|). Cahill, McCarthy, et al. (2004) provide four classes of annotation principles: one for noncoordinate configurations, one for coordinate configurations, one for traces (long-distance dependencies), and a final &quot;catch all and clean up&quot; phase.</Paragraph> <Paragraph position="3"> The satisfactory treatment of long-distance dependencies by the annotation algorithm is imperative for the extraction of accurate semantic forms. The Penn Treebank employs a rich arsenal of traces and empty productions (nodes which do not realize any lexical material) to coindex displaced material with the position where it should be interpreted semantically. The algorithm of Cahill, McCarthy, et al. (2004) translates the traces into corresponding reentrancies in the f-structure representation by treating null constituents as full nodes and recording the traces in terms of index=i f-structure annotations (Figure 3). Passive movement is captured and expressed at f-structure level using a passive:+ annotation. Once a treebank tree is annotated with feature structure equations by the annotation algorithm, the equations are collected, and a constraint solver produces an f-structure.</Paragraph> <Paragraph position="4"> In order to ensure the quality of the semantic forms extracted by our method, we must first ensure the quality of the f-structure annotations. The results of two different evaluations of the automatically generated f-structures are presented in Table 2. Both use the evaluation software and triple encoding presented in Crouch et al. (2002). The first of these is against the DCU 105, a gold-standard set of 105 hand-coded f-structures Computational Linguistics Volume 31, Number 3 Figure 3 Use of reentrancy between TOPIC and COMP to capture long-distance dependency in Penn Treebank sentence wsj 0008 2, Until Congress acts, the government hasn't any authority to issue new debt obligations of any kind, the Treasury said.</Paragraph> <Paragraph position="5"> from Section 23 of the Penn Treebank as described in Cahill, McCarthy, et al. (2004). For the full set of annotations they achieve precision of over 96.5% and recall of over 96.6%. There is, however, a risk of overfitting when evaluation is limited to a gold standard of this size. More recently, Burke, Cahill, et al. (2004a) carried out an evaluation of the automatic annotation algorithm against the publicly available PARC 700 Dependency Bank (King et al. 2003), a set of 700 randomly selected sentences from Section 23 which have been parsed, converted to dependency format, and manually corrected and extended by human validators. They report precision of over 88.5% and recall of over 86% (Table 2). The PARC 700 Dependency Bank differs substantially from both the DCU 105 f-structure bank and the automatically generated f-structures in regard to O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources the style of linguistic analysis, feature nomenclature, and feature geometry. Some, but not all, of these differences are captured by automatic conversion software. A detailed discussion of the issues inherent in this process and a full analysis of results is presented in Burke, Cahill, et al. (2004a). Results broken down by grammatical function for the DCU 105 evaluation are presented in Table 3. OBL (prepositional phrase) arguments are traditionally difficult to annotate reliably. The results show, however, that with respect to obliques, the annotation algorithm, while slightly conservative (recall of 82%), is very accurate: 96% of the time it annotates an oblique, the annotation is correct.</Paragraph> <Paragraph position="6"> A high-quality set of f-structures having been produced, the semantic form extraction methodology is applied. This is based on and substantially extends both the granularity and coverage of an idea in van Genabith, Sadler, and Way (1999): For each f-structure generated, for each level of embedding we determine the local PRED value and collect the subcategorisable grammatical functions present at that level of embedding. (page 72) Consider the automatically generated f-structure in Figure 4 for tree wsj 0003 22 in the Penn-II and Penn-III Treebanks. It is crucial to note that in the automatically generated f-structures the value of the PRED feature is a lemma and not a semantic form. Exploiting the information contained in the f-structure and applying the method described above, we recursively extract the following nonempty semantic forms: impose([subj, obj, obl:on]), in([obj]), of([obj]),andon([obj]).Ineffect, in both the approach of van Genabith, Sadler, and Way (1999) and our approach, semantic forms are reverse-engineered from automatically generated f-structures for treebank trees. The automatically induced semantic forms contain the following subcategorizable syntactic functions: SUBJ OBJ OBJ2 OBL prep OBL2 COMP XCOMP PART PART is not a syntactic function in the strict sense, but we decided to capture the relevant co-occurrence patterns of verbs and particles in the semantic forms. Just as Table 3 Precision and recall on automatically generated f-structures by feature against the DCU 105. Feature Precision Recall F-score ADJUNCT 892/968 = 92 892/950 = 94 93 COMP 88/92 = 96 88/102 = 86 91 COORD 153/184 = 83 153/167 = 92 87 DET 265/267 = 99 265/269 = 99 99 OBJ 442/459 = 96 442/461 = 96 96 OBL 50/52 = 96 50/61 = 82 88 OBLAG 12/12 = 100 12/12 = 100 100 PASSIVE 76/79 = 96 76/80 = 95 96 RELMOD 46/48 = 96 46/50 = 92 94 SUBJ 396/412 = 96 396/414 = 96 96 TOPIC 13/13 = 100 13/13 = 100 100 TOPICREL 46/49 = 94 46/52 = 88 91 XCOMP 145/153 = 95 145/146 = 99 97 Computational Linguistics Volume 31, Number 3 Figure 4 Automatically generated f-structure and extracted semantic forms for the Penn-II Treebank string wsj 0003 22, In July, the Environmental Protection Agency imposed a gradual ban on virtually all uses of asbestos.</Paragraph> <Paragraph position="7"> OBL prep includes the prepositional head of the PP, PART includes the actual particle which occurs, for example, add([subj, obj, part:up]).</Paragraph> <Paragraph position="8"> In the work presented here, we substantially extend and scale the approach of van Genabith, Sadler, and Way (1999) in regard to coverage, granularity, and evaluation. First, we scale the approach to the full WSJ section of the Penn-II Treebank and the parsed Brown corpus section of Penn-III, with a combined total of approximately 75,000 trees. Van Genabith, Sadler, and Way (1999) was proof of concept on 100 trees. Second, in contrast to the approach of van Genabith, Sadler, and Way (1999) (and many other approaches), our approach fully reflects long-distance dependencies, indicated in terms of traces in the Penn-II and Penn-III Treebanks and corresponding reentrancies at f-structure. Third, in addition to abstract syntactic-function-based subcategorization frames, we also compute frames for syntactic function-CFG category pairs, for both the verbal heads and their arguments, and also generate O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources pure CFG-based subcategorization frames. Fourth, in contrast to the approach of van Genabith, Sadler, and Way (1999) (and many other approaches), our method differentiates between frames for active and passive constructions. Fifth, in contrast to that of van Genabith, Sadler, and Way (1999), our method associates conditional probabilities with frames. Sixth, we evaluate the complete set of semantic forms extracted (not just a selection) against the manually constructed COMLEX (MacLeod, Grishman, and Meyers 1994) resource.</Paragraph> <Paragraph position="9"> In order to capture CFG-based categorial information, we add a CAT feature to the f-structures automatically generated from the Penn-II and Penn-III Treebanks. Its value is the syntactic category of the lexical item whose lemma gives rise to the PRED value at that particular level of embedding. This makes it possible to classify words and their semantic forms based on their syntactic category and reduces the risk of inaccurate assignment of subcategorization frame frequencies due to POS ambiguity, distinguishing, for example, between the nominal and verbal occurrences of the lemma fight. With this, the output for the verb impose in Figure 4 is impose(v,[subj, obj, obl:on]). For some of our experiments, we conflate the different verbal (and other) tags used in the Penn Treebanks to a single verbal marker (Table 4). As a further extension, the extraction procedure reads off the syntactic category of the head of each of the subcategorized syntactic functions: impose(v,[subj(n),obj(n),obl:on]). In this way, our methodology is able to produce surface syntactic as well as abstract functional subcategorization details. Dalrymple (2001) argues that there are cases, albeit exceptional ones, in which constraints on syntactic category are an issue in subcategorization. In contrast to much of the work reviewed in Section 3, which limits itself to the extraction of surface syntactic subcategorization details, our system can provide this information as well as details of grammatical function.</Paragraph> <Paragraph position="10"> the extraction algorithm extracts outlaw([subj]). This is incorrect, as outlaw is a transitive verb and therefore requires both a subject and an object to form a grammatical sentence in the active voice. To cope with this problem, the extraction algorithm uses the feature-value pair passive:+, which appears in the f-structure at the level of embedding of the verb in question, to mark that predicate as occurring in the passive: outlaw([subj],p). The annotation algorithm's accuracy in recognizing passive constructions is reflected by the f-score of 96% reported in Table 3 for the PASSIVE feature.</Paragraph> <Paragraph position="11"> The syntactic functions COMP and XCOMP refer to clausal complements with different predicate control patterns as described in Section 2. However, as it stands, neither of these functions betrays anything about the syntactic nature of the constructs in question. Many lexicons, both automatically acquired and manually created, are more fine grained in their approaches to subcategorized clausal arguments, differentiating, for example, between a that-clause and a to + infinitive clause (Ushioda et al. 1993). With only a slight modification, our system, along with the details provided by the automatically generated f-structures, allows us to extract frames with an equivalent level of detail. For example, to identify a that-clause, we use Figure 5 Automatically generated f-structure for the Penn-II Treebank string wsj 0003 23. By 1997, almost all remaining uses of cancer-causing asbestos will be outlawed.</Paragraph> <Paragraph position="12"> O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources Table 5 Semantic forms for the verb accept.</Paragraph> <Paragraph position="13"> the feature-value pair that:+ at f-structure level to read off the following subcategorization frame for the verb add: add([subj,comp(that)]). Using the feature-value pair to inf:+, we can identify to + infinitive clauses, resulting in the following frame for the verb want: want([subj,xcomp(to inf)]). We can also derive control information about open complements. In Figure 5, the reentrant XCOMP subject is identical to the subject of will in the matrix clause, which allows us to induce information about the nature of the external control of the XCOMP (i.e., whether it is subject or object control). In order to estimate the likelihood of the co-occurrence of a predicate with a particular argument list, we compute conditional probabilities for subcategorization frames based on the number of token occurrences in the corpus:</Paragraph> <Paragraph position="15"> are the possible argument lists which can occur for P.Because of variations in verbal subcategorization across domains, probabilities are also useful for predicting the way in which verbs behave in certain contexts. In Section 6, we use the conditional probabilities to filter possible error judgments by our system.</Paragraph> <Paragraph position="16"> Tables 5-7 show, with varying levels of analysis, the attested semantic forms for the verb accept with their associated conditional probabilities. The effect of differentiating between the active and passive occurrences of verbs can be seen in the different conditional probabilities associated with the intransitive frame ([subj]) of the verb accept (shown in boldface type) in Tables 5 and 6.</Paragraph> </Section> <Section position="6" start_page="342" end_page="343" type="metho"> <SectionTitle> 5. Results </SectionTitle> <Paragraph position="0"> We extract semantic forms for 4,362 verb lemmas from Penn-III. Table 8 shows the number of distinct semantic form types (i.e., lemma and argument list combination) 4 Given these, it is possible to condition frames on both lemma (P)andvoice(v: active/passive):</Paragraph> <Paragraph position="2"> extracted. Discriminating obliques by associated preposition and recording particle information, the algorithm finds a total of 21,005 semantic form types, 16,000 occurring in active voice and 5,005 in passive voice. When the obliques are parameterized for prepositions and particles are included for particle verbs, we find an average of 4.82 semantic form types per verb. Without the inclusion of details for individual prepositions or particles, there was an average of 3.45 semantic form types per verb. Unlike many of the researchers whose work is reviewed in Section 3, we do not predefine the frames extracted by our system. Table 9 shows the numbers of distinct frame types extracted from Penn-II, ignoring PRED values.</Paragraph> <Paragraph position="3"> We provide two columns of statistics, one in which all oblique (PP) arguments are condensed into one OBL function and all particle arguments are condensed into part, and the other in which we differentiate among obl:to (e.g., give), obl:on (e.g., rely), obl:for (e.g., compensate), etc., and likewise for particles. Collapsing obliques and particles into simple functions, we extract 38 frame types. Discriminating particles and obliques by preposition, we extract 577 frame types. Table 10 shows the same results for Penn-III, with 50 simple frame types and 1,084 types when parameterized for prepositions and particles. We also show the result of applying absolute thresholding techniques to the semantic forms induced.</Paragraph> <Paragraph position="4"> Applying an absolute threshold of five occurrences, we still generate 162 frame types</Paragraph> </Section> <Section position="7" start_page="343" end_page="344" type="metho"> <SectionTitle> 5 To recap, if two verbs have the same subcategorization requirements (e.g., give([subj, obj, obj2]), </SectionTitle> <Paragraph position="0"> send([subj, obj, obj2])), then that frame [subj, obj, obj2] is counted only once.</Paragraph> </Section> class="xml-element"></Paper>