File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/j05-3003_evalu.xml
Size: 38,826 bytes
Last Modified: 2025-10-06 13:59:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-3003"> <Title>Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks</Title> <Section position="8" start_page="344" end_page="361" type="evalu"> <SectionTitle> 6. Evaluation </SectionTitle> <Paragraph position="0"> Most of the previous approaches discussed in Section 3 have been evaluated to different degrees. In general, a small number of frequently occurring verbs is selected, and the subcategorization frames extracted for these verbs (from some quantity of unseen test data) are compared to a gold standard. The gold standard is either manually custom-made based on the test data or adapted from an existing external resource such as the OALD (Hornby 1980) or COMLEX (MacLeod, Grishman, and Meyers 1994). There are advantages and disadvantages to both types of gold standard. While it is time-consuming to manually construct a custom-made standard, the resulting standard has the advantage of containing only the subcategorization frames exhibited in the test data. Using an existing externally produced resource is quicker, but the gold Computational Linguistics Volume 31, Number 3 standard may contain many more frames than those which occur in the data from which the test lexicon is induced or, indeed, may omit relevant correct frames contained in the data. As a result, systems generally score better against custom-made, manually established gold standards.</Paragraph> <Paragraph position="1"> Carroll and Rooth (1998) achieve an F-score of 77% against the OALD when they evaluate a selection of 100 verbs with absolute frequency of greater than 500 each. Their system recognizes 15 frames, and these do not contain details of subcategorized-for prepositions. Still, to date this is the largest number of verbs used in any of the evaluations of the systems for English described in Section 3. Sarkar and Zeman (2000) evaluate 914 Czech verbs against a custom-made gold standard and record a token recall of 88%. However, their evaluation does not examine the extracted subcategorization frames but rather the argument-adjunct distinctions posited by their system. The largest lexical evaluation we know of is that of Schulte im Walde (2002b) for German. She evaluates 3,000 German verbs with a token frequency between 10 and 2,000 against the Duden (Dudenredaktion 2001). We will refer to this work and the methods and results presented by Schulte im Walde again in Sections 6.2 and 6.3.</Paragraph> <Paragraph position="2"> We carried out a large-scale evaluation of our automatically induced lexicon (2,993 active verb lemmas for Penn-II and 3,529 for Penn-III, as well as 1,422 passive verb lemmas from Penn-II) against the COMLEX resource. To our knowledge this is the most extensive evaluation ever carried out for English lexical extraction. We conducted a number of experiments on the subcategorization frames extracted from Penn-II and Penn-III which are described and discussed in Sections 6.2, 6.3, and 6.4. Finding a common format for the gold standard and induced lexical entries is a nontrivial task. To ensure that we did not bias the evaluation in favor of either resource, we carried out two different mappings for the frames from Penn-II and Penn-III: COMLEX-LFG Mapping I and COMLEX-LFG Mapping II. For each mapping we carried out six basic experiments (and two additional ones for COMLEX-LFG Mapping II) for the active subcategorization frames extracted. Within each experiment, the following factors were varied: level of prepositional phrase detail, level of particle detail, relative threshold (1% or 5%), and incorporation of an expanded set of directional prepositions. Using the second mapping we also evaluated the automatically extracted passive frames and experimented with absolute thresholds. Direct comparison of subcategorization frame acquisition systems is difficult because of variations in the number of frames extracted, the number of test verbs, the gold standards used, the size of the test data, and the level of detail in the subcategorization frames (e.g., whether they are parameterized for specific prepositions). Therefore, in order to establish a baseline against which to compare our results, following Schulte in Walde (2002b), we assigned the two most frequent frame types (transitive and intransitive) by default to each verb and compared this &quot;artificial&quot; lexicon to the gold standard. The section concludes with a full discussion of the reported results.</Paragraph> <Section position="1" start_page="345" end_page="347" type="sub_section"> <SectionTitle> 6.1 COMLEX </SectionTitle> <Paragraph position="0"> We evaluate our induced semantic forms against COMLEX (MacLeod, Grishman, and Meyers 1994), a computational machine-readable lexicon containing syntactic information for approximately 38,000 English headwords. Its creators paid particular attention to the encoding of more detailed subcategorization information than is available in either the OALD or the LDOCE (Proctor 1978), both for verbs and for nouns O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources Figure 6 Intersection between active-verb lemma types in COMLEX and the Penn-II-induced lexicon. and adjectives which take complements (Grishman, MacLeod, and Meyers 1994). By choosing to evaluate against COMLEX, we set our sights high: Our extracted semantic forms are fine-grained, and COMLEX is considerably more detailed than the OALD or LDOCE used for earlier evaluations. While our system can generate semantic forms for any lemma (regardless of part of speech) which induces a PRED value, we have thus far evaluated the automatic generation of subcategorization frames for verbs only. COMLEX defines 138 distinct verb frame types without the inclusion of specific prepositions or particles.</Paragraph> <Paragraph position="1"> As COMLEX contains information other than subcategorization details, it was necessary for us to extract the subcategorization frames associated with each verbal lexicon entry. The following is a sample entry for the verb reimburse:</Paragraph> <Paragraph position="3"> Each entry is organized as a nested set of typed feature-value lists. The first symbol (i.e., VERB) gives the part of speech. The value of the :ORTH feature is the base form of the verb. Any entry with irregular morphological behavior will also include the features :PLURAL,:PAST, and so on, with the relevant values. All verbs have a :SUBC feature, and for our purposes, this is the most interesting feature. In the case of the example above, the subcategorization values specify that reimburse can occur with two object noun phrases (NP-NP), an object noun phrase followed by a prepositional phrase headed by for (NP-PP :PVAL (&quot;for&quot;)) or just an object noun phrase (NP). (Note that the details of the subject are not included in COMLEX frames.) What makes the COMLEX resource particularly suitable for our evaluation is that each of the complement types (NP-NP, NP-PP,andNP) which make up the value of the :SUBC feature is associated with a formal frame definition which looks like the following: (vp-frame np-np :cs ((np 2)(np 3)) :gs (:subject 1 :obj 2 :obj2 3) :ex &quot;she asked him his name&quot;) The value of the :cs feature is the constituent structure of the subcategorization frame, which lists the syntactic CF-PSG constituents in sequence (omitting the subject, again). The value of the :gs feature is the grammatical structure which indicates the functional role played by each of the CF-PSG constituents. The elements of the Intersection between active-verb lemma types in COMLEX and the Penn-III-induced lexicon. constituent structure are indexed, and these indices are referenced in the :gs field. The index 1 always refers to the surface subject of the verb. This mapping between constituent structure and functional structure makes the information contained in COMLEX particularly suitable as an evaluation standard for the LFG semantic forms which we induce.</Paragraph> <Paragraph position="4"> We present the evaluation for the verbs which occur in an active context in the treebank. COMLEX does not provide passive frames. For Penn-II, there are 2,993 verb lemmas (used actively) that both resources have in common. 2,669 verb lemmas appear in COMLEX but not in the induced lexicon, and 416 verb lemmas (used actively) appear in the induced lexicon but not in COMLEX (Figure 6). For Penn-III, COMLEX and the induced lexicon share 3,529 verb lemmas (used actively). This is shown in</Paragraph> </Section> <Section position="2" start_page="347" end_page="350" type="sub_section"> <SectionTitle> 6.2 COMLEX-LFG Mapping I and Penn-II </SectionTitle> <Paragraph position="0"> In order to carry out the evaluation, we have to find a common format for the expression of subcategorization information between our induced LFG-style subcategorization frames and those contained in COMLEX. The following are the common syntactic functions: SUBJ, OBJ, OBJ</Paragraph> <Paragraph position="2"> , COMP,andPART. Unlike our system, COMLEX does not distinguish an OBL from an OBJ</Paragraph> <Paragraph position="4"> . As in COMLEX, the value of i depends on the number of objects/obliques already present in the semantic form. COMLEX does not differentiate between COMPs and XCOMPs as our system does (control information is expressed in a different way: see Section 6.3), so we conflate our two LFG categories to that of COMP. The process is summarized in Table 11.</Paragraph> <Paragraph position="5"> The manually constructed COMLEX entries provide a gold standard against which we evaluate the automatically induced frames. We calculate the number of true positives (tps) (where our semantic forms and those from COMLEX are the same), the number of false negatives ( fns) (those frames which appeared in COMLEX but were not produced by our system), and the number of false positives ( fps) (those frames 6 Given these figures, one might begin to wonder about the value of automatic induction. First, COMLEX does not rank frames by probabilities, which are essential in disambiguation. Second, the coverage of COMLEX is not complete: 518 lemmas &quot;discovered&quot; by the induction experiment are not listed in COMLEX; see the error analysis in Section 6.5.</Paragraph> <Paragraph position="6"> O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources produced by our system which do not appear in COMLEX). We calculate precision, recall, and F-score using the following standard equations:</Paragraph> <Paragraph position="8"> recall + precision We use the frequencies associated with each of our semantic forms in order to set a relative threshold to filter the selection of semantic forms. For a threshold of 1% we disregard any semantic forms with a conditional probability (i.e., given a lemma) of less than or equal to 0.01. As some verbs occur less frequently than others, we think it is important to use a relative rather than absolute threshold (as in Carroll and Rooth [1998], for instance) in this way. We carried out the evaluation in a similar way to Schulte im Walde's (2002b) for German, the only experiment comparable in scale to ours. Despite the obvious differences in approach and language, this allows us to make some tentative comparisons between our respective results. The statistics shown in Table 12 give the results of three different experiments with the relative threshold set to 1%. As for all the results tables, the baseline statistics (simply assigning the most frequent frames, in this case transitive and intransitive, to each lemma by default) are in each case shown in the left column, and the results achieved by our induced lexicon are presented in the right column. Distinguishing between complement and adjunct prepositional phrases is a notoriously difficult aspect of automatic subcategorization frame acquisition. For this reason, following the evaluation setup in Schulte im Walde (2002b), the three experiments vary with respect to the amount of prepositional information contained in the subcategorization frames.</Paragraph> <Paragraph position="9"> Experiment 1. Here we excluded subcategorized prepositional-phrase arguments entirely from the comparison. In a manner similar to that of Schulte im Walde (2002b), any frames containing an OBL were mapped to the same frame type minus that argument.</Paragraph> <Paragraph position="10"> For example, the frame [subj,obl:for] becomes [subj]. Using a relative threshold of 1% (Table 12), our results (precision of 75.2%, recall of 69.1%, and F-score of 72.0%) are remarkably similar to those of Schulte im Walde (2002b), who reports precision of 74.53%, recall of 69.74%, and an f-score of 72.05%.</Paragraph> <Paragraph position="11"> Experiment 2. Here we include subcategorized prepositional phrase arguments but only in their simplest form; that is, they were not parameterized for particular prepositions. For example, the frame [subj,obl:for] is rewritten as [subj,obl].Usinga relative threshold of 1% (Table 12), our results (precision of 65.5%, recall of 63.1%, and F-score of 64.3%) compare favorably to those of Schulte im Walde (2002b), who recorded precision of 60.76%, recall of 63.91%, and an F-score of 62.30%.</Paragraph> <Paragraph position="12"> Experiment 3. Here we used semantic forms which contain details of specific prepositions for any subcategorized prepositional phrase (e.g., [subj,obl:for]). Using a relative threshold of 1% (Table 12), our precision figure (71.8%) is quite high (in comparison to 65.52% as recorded by Schulte im Walde [2002b]). However our recall (16.8%) is very low (compared to the 50.83% that Schulte im Walde [2002b] reports). Consequently our F-score (27.3%) is also low (Schulte im Walde [2002b] records an F-score of 57.24%). The reason for this is discussed in Section 6.2.1.</Paragraph> <Paragraph position="13"> The statistics in Table 13 are the result of the second experiment, in which the relative threshold was increased to 5%. The effect of such an increase is obvious in that precision goes up (by as much as 5%) for each of the three evaluations while recall goes down (by as much as 5.5%). This is to be expected, as a greater threshold means that there are fewer semantic forms associated with each verb in the induced lexicon, but they are more likely to be correct because of their greater frequency of occurrence. The conditional probabilities we associate with each semantic form together with thresholding can be used to customize the induced lexicon to the task for which it is required, that is, whether a very precise lexicon is preferred to one with broader O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources coverage. In Tables 12 and 13, the baseline is exceeded in all experiments with the exception of Experiment 2. This can be attributed to Mapping I, in which OBL</Paragraph> <Paragraph position="15"> (Table 11). Experiment 2 includes obliques without the specific preposition, meaning that in this mapping, the frame [subj,obj:with] becomes [subj,obj]. Therefore, the transitive baseline frame scores better than it should against the gold standard. A more fine-grained LFG-COMLEX mapping in which this effect disappears is presented in Section 6.3.</Paragraph> <Paragraph position="16"> 6.2.1 Directional Prepositions. Our recall statistic was particularly low in the case of evaluation using details of prepositions (Experiment 3, Tables 12 and 13). This can be accounted for by the fact that the creators of COMLEX have chosen to err on the side of overgeneration in regard to the list of prepositions which may occur with a verb and a subcategorization frame containing a prepositional phrase. This is particularly true of directional prepositions. For COMLEX, a list of 31 directional prepositions (Table 14) was prepared and assigned in its entirety by default to any verb which can potentially appear with any directional preposition in order to save time and avoid the risk of missing prepositions. Grishman, MacLeod, and Meyers (1994) acknowledge that this can lead to a preposition list which is &quot;a little rich&quot; for a particular verb, but this is the approach they have chosen to take. In a subsequent experiment, we incorporated this list of directional prepositions by default into our semantic form induction process in the same way as the creators of COMLEX have done. Table 15 shows that doing so results in a significant improvement in the recall statistic (45.1%), as would be expected, with the new statistic being almost three times as good as the result reported in Table 12 for Experiment 3 (16.8%). There is also an improvement in the precision figure (from 71.8% to 86.9%). This is due to a substantial increase in the number of true positives (from 5,612 to 14,675) compared with a stationary false positive figure (2,205 in both cases). The f-score increases from 27.3% to 59.4%.</Paragraph> </Section> <Section position="3" start_page="350" end_page="353" type="sub_section"> <SectionTitle> 6.3 COMLEX-LFG Mapping II and Penn-II </SectionTitle> <Paragraph position="0"> The COMLEX-LFG Mapping I presented above establishes a &quot;least common denominator&quot; for the COMLEX and our LFG-inspired resources. More-fine-grained mappings are possible: in order to ensure that the mapping from our semantic forms to the COMLEX frames did not oversimplify the information in the automatically extracted subcategorization frames, we conducted a further set of experiments in which we converted the information in the COMLEX entries to the format of our extracted semantic forms. We explicitly differentiated between OBLsandOBJs by automatically Table 14 COMLEX directional prepositions.</Paragraph> <Paragraph position="1"> about across along around behind below beneath between beyond by down from in inside into off on onto out out of outside over past through throughout to toward toward up up to via was coindexed with an NP or a PP. Furthermore, as can be seen in the following example, COMLEX frame definitions contain details of the control patterns of sentential complements, encoded using the :features attribute. This allows for automatic discrimination between COMPsandXCOMPs. (vp-frame to-inf-sc :cs (vp 2 :mood to-infinitive :subject 1) :features (:control subject) :gs (:subject 1 :comp 2) :ex &quot;I wanted to come&quot;) The mapping is summarized in Table 16. The results of the subsequent evaluation are presented in Tables 17 and 18. We have added Experiments 2a and 3a. These are the same as Experiments 2 and 3, except that they additionally include the specific particle with each PART function. While the recall figures in Tables 17 and 18 are slightly lower than those in Tables 12 and 13, changing the mapping in this way results in an increase in precision in each case (by as much as 11.6%). The results of the lexical evaluation are consistently better than the baseline, in some cases by almost 16% (Experiment 2, threshold 5%). Notice that in contrast to Tables 12 and 13, in the more-fine-grained COMLEX-LFG Mapping II presented here, all experiments exceed the baseline. 6.3.1 Directional Prepositions. The recall figures for Experiments 3 and 3a in Table 17 (24.0% and 21.5%) and Table 18 (19.7% and 17.4%) drop in a similar fashion to the results seen in Tables 12 and 13. For this reason, we again incorporated the list of 31 directional prepositions (Table 14) by default and reran Experiments 3 and 3a for a threshold of 1%. The results are presented in Table 19. The effect was as expected: The recall scores for the two experiments increased to 40.8% and 35.4% (from 24.0% and 22.5%), and the F-scores increased to 54.4% and 49.7% (from 35.9% and 33.0%).</Paragraph> <Paragraph position="2"> sive semantic forms for 1,422 verb lemmas shared by the induced lexicon and COMLEX. We applied lexical-redundancy rules (Kaplan and Bresnan 1982) to automatically convert the active COMLEX frames to their passive counterparts: For example, subjects are demoted to optional by oblique agents, and direct objects become subjects. The resulting precision was very high (from 72.3% to 80.2%), and there was the expected drop in recall when prepositional details were included (from 54.7% to 29.3%).</Paragraph> <Paragraph position="3"> Computational Linguistics Volume 31, Number 3 6.3.3 Absolute Thresholds. Many of the previous approaches discussed in Section 3 use a limited number of verbs for evaluation, based on the verbs' absolute frequency in the corpus. We carried out a similar experiment. Table 21 shows the results of Experiment 2 for all verbs, for the verb lemmas with an absolute frequency greater than 100, and for verbs with a frequency greater than 200. The use of an absolute threshold results in an increase in precision (from 77.1% to 82.3% and 81.7%), an increase in recall (from 50.4% to 60.8% to 58.7%), and an overall increase in F-score (from 61.0% to 69.9% and 68.4%).</Paragraph> </Section> <Section position="4" start_page="353" end_page="354" type="sub_section"> <SectionTitle> 6.4 Penn-III (Mapping-II) </SectionTitle> <Paragraph position="0"> Recently we have applied our methodology to the Penn-III Treebank, a more balanced corpus resource with a number of text genres. Penn-III consists of the WSJ section from Penn-II as well as a parse-annotated subset of the Brown corpus. The Brown corpus comprises 24,242 trees compiled from a variety of text genres including popular lore, general fiction, science fiction, mystery and detective fiction, and humor. It has been shown (Roland and Jurafsky 1998) that the subcategorization tendencies of verbs vary across linguistic domains. Our aim, therefore, is to increase the scope of the induced lexicon not only in terms of the verb lemmas for which there are entries, but also in terms of the frames with which they co-occur. The f-structure annotation algorithm was extended with only minor amendments to cover the parsed Brown corpus. The most important of these was the way in which we distinguish between oblique and adjunct.</Paragraph> <Paragraph position="1"> We noted in Section 4 that our method of assigning an oblique annotation in Penn-II was precise, albeit conservative. Because of a change of annotation policy in Penn-III, the -CLR tag (indicating a close relationship between a PP and the local syntactic head), information which we had previously exploited, is no longer used. For Penn-III the algorithm annotates all PPs which do not carry a Penn adverbial functional tag (such as -TMP or -LOC) and occur as the sisters of the verbal head of a VP as obliques.</Paragraph> <Paragraph position="2"> In addition, the algorithm annotates as obliques PPs associated with -PUT (locative complements of the verb put) or -DTV (second object in ditransitives) tags.</Paragraph> <Paragraph position="3"> When evaluating the application of the lexical extraction system on Penn-III, we carried out two sets of experiments, identical in each case to those described for Penn-II in Section 6.3, including the use of relative (1% and 5%) rather than absolute thresholds.</Paragraph> <Paragraph position="4"> For the first set of experiments we evaluated the lexicon induced from the parse-annotated Brown corpus only. This evaluation was performed for 2,713 active-verb lemmas using the more fine-grained Mapping-II. Tables 22 and 23 show that the results generally exceed the baseline, in some cases by almost 10%, similar to those recorded for Penn-II (Tables 17 and 18). While the precision is slightly lower than that reported for the experiments in Tables 17 and 18, in particular for Experiments 2, 2a, 3, and 3a, in which details of obliques are included, the recall in each of these experiments is slightly higher than that recorded for Penn-II. We conjecture that the main reason for this is that the amended approach to the annotation of obliques is slightly less precise and conservative than the largely -CLR-tag-driven approach taken for Penn-II. Consequently we record an increase in recall and a drop in precision. This trend is repeated in the second set of experiments. In this instance, we combined the lexicon extracted from the WSJ with that extracted from the parse-annotated Brown corpus, and evaluated the resulting resource for 3,529 active-verb lemmas. The results are shown in Tables 24 and 25. The results compare very positively against the baseline.</Paragraph> <Paragraph position="5"> The precision scores are lower (by between 1.5% and 9.7%) than those reported for Penn-II (Tables 17 and 18). There has however been a significant increase in recall (up to 8.7%) and an overall increase in F-score (by up to 4.4%).</Paragraph> </Section> <Section position="5" start_page="354" end_page="361" type="sub_section"> <SectionTitle> 6.5 Error Analysis and Discussion </SectionTitle> <Paragraph position="0"> The work presented in this section highlights a number of issues associated with the evaluation of automatically induced subcategorization frames against an existing external gold standard, in this case COMLEX. While this evaluation approach is arguably less labor-intensive than the manual construction of a custom-made gold standard, it does introduce a number of difficulties into the evaluation procedure. It is a nontrivial task to convert both the gold standard and the induced resource to a common format in order to facilitate evaluation. In addition, as our results show, the choice of common format and mapping to it can affect the results. In COMLEX-LFG Mapping I (Section 6.2), we found that mapping from the induced lexicon to COMLEX resulted in higher recall scores than those achieved when we (effectively) reversed the mapping (COMLEX-LFG Mapping II [Section 6.3]). The first mapping is essentially a conflation of our more fine-grained LFG grammatical functions with the more generic COMLEX functions, while the second mapping tries to maintain as many distinctions as possible.</Paragraph> <Paragraph position="1"> Another drawback to using an existing external gold standard such as COMLEX to evaluate an automatically induced subcategorization lexicon is that the resources are not necessarily constructed from the same source data. As noted above, it is well documented (Roland and Jurafsky 1998) that subcategorization frames (and their frequencies) vary across domains. We have extracted frames from two sources (the WSJ and the Brown corpus), whereas COMLEX was built using examples from the San Jose Mercury News, the Brown corpus, several literary works from the Library of America, scientific abstracts from the U.S. Department of Energy, and the WSJ. For this reason, it is likely to contain a greater variety of subcategorization frames than our induced lexicon. It is also possible that because of human error, COMLEX contains subcategorization frames the validity of which are in doubt, for example, the overgeneration of subcategorized-for directional prepositional phrases. This is because the aim of the COMLEX project was to construct as complete a set of subcategorization frames as possible, even for infrequent verbs. Lexicographers were allowed to extrapolate from the citations found, a procedure O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources which is bound to be less certain than the assignment of frames based entirely on existing examples. As a generalization, Briscoe (2001) notes that lexicons such as COMLEX tend to demonstrate high precision but low recall. Briscoe and Carroll (1997) report on manually analyzing an open-class vocabulary of 35,000 head words for predicate subcategorization information and comparing the results against the subcategorization details in COMLEX. Precision was quite high (95%), but recall was low (84%). This has an effect on both the precision and recall scores of our system against COMLEX. In order to ascertain the effect of using COMLEX as a gold standard for our induced lexicon, we carried out some more-detailed error analysis, the results of which are summarized in Table 26. We randomly selected 80 false negatives (fn) and 80 false positives (fp) across a range of active frame types containing prepositional and particle detail taken from Penn-III and manually examined them in order to classify them as &quot;correct&quot; or &quot;incorrect.&quot; Of the 80 fps, 33 were manually judged to be legitimate subcategorization frames. For example, as Table 26 shows, there are a number of correct transitive verbs ([subj,obj]) in our automatically induced lexicon which are not included in COMLEX.</Paragraph> <Paragraph position="2"> This examination was also useful in highlighting to us the frame types on which the lexical extraction procedure was performing poorly, in our case, those containing XCOMPs and those containing OBJ2S. Out of 80 fns, 14 were judged to be incorrect when manually examined. These can be broken down as follows: one intransitive frame, three ditransitive frames, three frames containing a COMP, and seven frames containing an oblique were found to be invalid.</Paragraph> <Paragraph position="3"> 7. Lexical Accession Rates In addition to evaluating the quality of our extracted semantic forms, we also examined the rate at which they are induced. This can be expressed as a measure of the coverage of the induced lexicon on new data. Following Hockenmaier, Bierner, and Baldridge (2002), Xia (1999), and Miyao, Ninomiya, and Tsujii (2004), we extract a reference lexicon from Sections 02-21 of the WSJ. We then compare this to a test lexicon from Section 23. Table 27 shows the results of the evaluation of the coverage of an induced lexicon for verbs only. There is a corresponding semantic form in the reference lexicon for 89.89% of the verbs in Section 23. 10.11% of the entries in the test lexicon did not appear in the reference lexicon. Within this group, we can distinguish between known words, which have an entry in the reference lexicon, and unknown words, which do not exist at all in the reference lexicon. In the same way we make the distinction between known frames and unknown frames. There are, therefore, four different cases in which an entry may not appear in the reference lexicon. Table 27 shows that the most common case is that of known verbs occurring with a different, although known, subcategorization frame (7.85%).</Paragraph> <Paragraph position="4"> The rate of accession may also be represented graphically. In Charniak (1996) and Krotov et al. (1998), it was observed that treebank grammars (CFGs extracted from treebanks) are very large and grow with the size of the treebank. We were interested in discovering whether the acquisition of lexical material from the same data displayed a similar propensity. Figure 8 graphs the rate of induction of semantic form and CFG rule types from Penn-III (the WSJ and parse-annotated Brown corpus combined). Because of the variation in the size of sections between the Brown and the WSJ, we plotted accession against word count. The first part of the graph (up to 1,004,414 words) Figure 8 Comparison of accession rates for semantic form and CFG rule types for Penn-III (nonempty frames) (WSJ followed by Brown).</Paragraph> <Paragraph position="5"> O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources represents the rate of accession from the WSJ, and the final 384,646 words are those of the Brown corpus. The seven curves represent the following: The acquisition of semantic form types (nonempty) for all syntactic categories with and without specific preposition and particle information, the acquisition of semantic form types (nonempty) for all verbs with and without specific preposition and particle information, the number of lemmas associated with the extract semantic forms, and the acquisition of CFG rule types. The curve representing the growth in the overall size of the lexicon is similar in shape to that of the PCFG, while the rate of increase in the number of verbal semantic forms (particularly when obliques and particles are excluded) appears to slow more quickly. Figure 8 shows the effect of domain diversity from the Brown section in terms of increased growth rates for 1e+06 words upward. Figure 9 depicts the same information, this time extracted from the Brown section first followed by the WSJ. The curves are different, but similar trends are represented. This time the effects of domain diversity for the Brown section are discernible by comparing the absolute accession rate for the 0.4e+06 mark between Figures 8 and 9.</Paragraph> <Paragraph position="6"> Figure 10 shows the result when we abstract away from semantic forms (verb frame combinations) to subcategorization frames and plot their rate of accession. The graph represents the growth rate of frame types for Penn-III (WSJ followed by Brown and Brown followed by WSJ). The curve rises sharply initially but gradually levels, practically flattening out, despite the increase in the number of words. This reflects the information about Section 23 in Table 27, where we demonstrate that although new verb frame combinations occur, all of the frame types in Section 23 have been seen by the lexical extraction program in previous sections.</Paragraph> <Paragraph position="7"> Figure 9 Comparison of accession rates for semantic form and CFG rule types for Penn-III (nonempty frames) (Brown followed by WSJ).</Paragraph> <Paragraph position="8"> Computational Linguistics Volume 31, Number 3 Figure 10 Accession rates for frame types (without prepositions and particles) for Penn-III.</Paragraph> <Paragraph position="9"> Figure 11 shows that including information about prepositions and particles in the frames results in an accession rate which continues to grow, albeit ever more slowly, with the increase in size of the extraction data. This emphasizes the advantage of our approach, which extracts frames containing such information without the limitation of predefinition.</Paragraph> <Paragraph position="10"> Figure 11 Accession rates for frame types for Penn-III.</Paragraph> <Paragraph position="11"> O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources 8. Conclusions and Further Work We have presented an algorithm for the extraction of semantic forms (or subcategorization frames) from the Penn-II and Penn-III Treebanks, automatically annotated with LFG f-structures. In contrast to many other approaches, ours does not predefine the sub-categorization frames we extract. We have applied the algorithm to the WSJ sections of Penn-II (50,000 trees) (O'Donovan et al. 2004) and to the parse-annotated Brown corpus of Penn-III (almost 25,000 additional trees). We extract syntactic-function-based subcategorization frames (LFG semantic forms) and traditional CFG category-based frames, as well as mixed-function-category-based frames. Unlike many other approaches to sub-categorization frame extraction, our system properly reflects the effects of long-distance dependencies. Also unlike many approaches, our method distinguishes between active and passive frames. Finally, our system associates conditional probabilities with the frames we extract. Making the distinction between the behavior of verbs in active and passive contexts is particularly important for the accurate assignment of probabilities to semantic forms. We carried out an extensive evaluation of the complete induced lexicon against the full COMLEX resource. To our knowledge, this is the most extensive qualitative evaluation of subcategorization extraction in English. The only evaluation of a similar scale is that carried out by Schulte im Walde (2002b) for German. The results reported here for Penn-II compare favorably against the baseline and, in fact, are an improvement on those reported in O'Donovan et al. (2004). The results for the larger, more domain-diverse Penn-III lexicon are very encouraging, in some cases almost 15% above the baseline. We believe our semantic forms are fine-grained, and by choosing to evaluate against COMLEX, we set our sights high: COMLEX is considerably more detailed than the OALD or LDOCE used for other earlier evaluations. Our error analysis also revealed some interesting issues associated with using an external standard such as COMLEX. In the future, we hope to evaluate the automatic annotations and extracted lexicon against Propbank (Kingsbury and Palmer 2002).</Paragraph> <Paragraph position="12"> Apart from the related approach of Miyao, Ninomiya, and Tsujii (2004), which does not distinguish between argument and adjunct prepositional phrases, our treebank and automatic f-structure annotation-based architecture for the automatic acquisition of detailed subcategorization frames is quite unlike any of the architectures presented in the literature. Subcategorization frames are reverse-engineered and almost a byproduct of the automatic f-structure annotation algorithm. It is important to realize that the induction of lexical resources is part of a larger project on the acquisition of wide-coverage, robust, probabilistic, deep unification grammar resources from treebanks Burke, Cahill, et al. (2004b). We are already using the extracted semantic forms in parsing new text with robust, wide-coverage probabilistic LFG grammar approximations automatically acquired from the f-structure-annotated Penn-II treebank, specifically in the resolution of LDDs, as described in Cahill, Burke, et al. (2004). We hope to be able to apply our lexical acquisition methodology beyond existing parse-annotated corpora (Penn-II and Penn-III): New text is parsed by our probabilistic LFG approximations into f-structures from which we can then extract further semantic forms. The work reported here is part of the core components for bootstrapping this approach.</Paragraph> <Paragraph position="13"> In the shorter term, we intend to make the extracted subcategorization lexicons from Penn-II and Penn-III available as a downloadable public-domain research resource.</Paragraph> <Paragraph position="14"> We have also applied our more general unification grammar acquisition methodology to the TIGER Treebank (Brants et al. 2002) and Penn Chinese Treebank (Xue, Chiou, and Palmer 2002), extracting wide-coverage, probabilistic LFG grammar Computational Linguistics Volume 31, Number 3 approximations and lexical resources for German (Cahill et al. 2003) and Chinese (Burke, Lam, et al. 2004). The lexical resources, however, have not yet been evaluated. This, and much else, has to await further research.</Paragraph> </Section> </Section> class="xml-element"></Paper>