File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2001_metho.xml
Size: 23,990 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2001"> <Title>Extracting the Unextractable: A Case Study on Verb-particles</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Method-1: Simple POS-based Extraction </SectionTitle> <Paragraph position="0"> One obvious method for extracting VPCs is to run a simple regular expression over the output of a part-of-speech (POS) tagger, based on the observation that the Penn Treebank POS tagset, e.g., contains a dedicated particle tag (RP). Given that all particles are governed by a verb, extraction consists of simply locating each particle and searching back (to the left of the particle, as particles cannot be passivised or otherwise extraposed) for the head verb of the VPC.</Paragraph> <Paragraph position="1"> Here and for the subsequent methods, we assume that the maximum word length for NP complements in the split conflguration for transitive VPCs is 5,2 i.e. that an NP \heavier&quot; than this would occur more naturally in the joined conflguration. We thus discount all particles which are more than 5 words from their governing verb. Additionally, we extracted a set of 73 canonical particles from the LinGO-ERG, and used this to fllter out extraneous particles in the POS data.</Paragraph> <Paragraph position="2"> In line with our assumption of raw text to extract over, we use the Brill tagger (Brill, 1995) to automatically tag the WSJ, rather than making use of the manual POS annotation provided in the Penn Treebank. We further lemmatise the data using morph (Minnen et al., 2001) and extract VPCs based on the Brill tags. This produces a total of 135 VPCs, which we evaluate according to the standard metrics of precision (Prec), recall (Rec) and F-score (Ffl=1).</Paragraph> <Paragraph position="3"> Note that here and for the remainder of this paper, precision is calculated according to the manual annotation for the combined total of 4,173 VPC candidate types extracted by the various methods described in this paper, whereas recall is relative to the 62 attested VPCs from the Alvey Tools data as described above.</Paragraph> <Paragraph position="4"> As indicated in the flrst line of Table 1 (\Brill&quot;), the simple POS-based method results in a precision of 1.000, recall of 0.177 and F-score of 0.301.</Paragraph> <Paragraph position="5"> In order to determine the upper bound on performance for this method, we ran the extraction method over the original tagging from the Penn Treebank. This resulted in an F-score of 0.774 (\Penn&quot; in Table 1). The primary reason for the large disparity between the Brill tagger output and original Penn Treebank annotation is that it is notoriously di-cult to difierentiate between particles, prepositions and adverbs (Toutanova and Manning, 2000). Over the WSJ, the Brill tagger achieves a modest tag recall of 0.103 for particles, and tag precision of 0.838. That is, it is highly conservative in allocating particle tags, to the extent that it recognises only two particle types for the whole of the WSJ: out and down.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Method-2: Simple Chunk-based Extraction </SectionTitle> <Paragraph position="0"> To overcome the shortcomings of the Brill tagger in identifying particles, we next look to full chunk 2Note, this is the same as the maximum span length of 5 used by Smadja (1993), and above the maximum attested NP length of 3 from our corpus study (see Section 2.2).</Paragraph> <Paragraph position="1"> parsing. Full chunk parsing involves partitioning up a text into syntactically-cohesive, head-flnal segments (\chunks&quot;), without attempting to resolve inter-chunk dependencies. In the chunk inventory devised for the CoNLL-2000 test chunking shared task (Tjong Kim Sang and Buchholz, 2000), a dedicated particle chunk type once again exists. It is therefore possible to adopt an analogous approach to that from Method-1, in identifying particle chunks then working back to locate the verb each particle chunk is associated with.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Chunk parsing method </SectionTitle> <Paragraph position="0"> In order to chunk parse the WSJ, we flrst tagged the full WSJ and Brown corpora using the Brill tagger, and then converted them into chunks based on the original Penn Treebank parse trees, with the aid of the conversion script used in preparing the CoNLL-2000 shared task data.3 We next lemmatised the data using morph (Minnen et al., 2000), and chunk parsed the WSJ with TiMBL 4.1 (Daelemans et al., 2001) using the Brown corpus as training data. TiMBL is a memory-based classiflcation system based on the k-nearest neighbour algorithm, which takes as training data a set of flxed-length feature vectors pre-classifled according to an information fleld. For each test instance described over the same feature vector, it then returns the \neighbours&quot; at the k-nearest distances to the test instance and classifles the test instance according to the class distribution over those neighbours. TiMBL provides powerful functionality for determining the relative distance between difierent values of a given feature in the form of MVDM, and also supports weighted voting between neighbours in classifying inputs, e.g.</Paragraph> <Paragraph position="1"> in the form of inverse distance weighting.</Paragraph> <Paragraph position="2"> We ran TiMBL based on the feature set described in Veenstra and van den Bosch (2000), that is using the 5 word lemmata and POS tags to the left and 3 word lemmata and POS tags to the right of each focus word, along with the POS tag and lemma for the focus word. We set k to 5, ran MVDM over only the POS tags4 and used inverse distance weighting, but otherwise ran TiMBL with the default settings.</Paragraph> <Paragraph position="3"> We evaluated the basic TiMBL method over both the full WSJ data, training on the Brown section of the Penn Treebank, and over the original shared task data from CoNLL-2000, the results for which are presented in Table 2. Note that, similarly to the CoNLL-2000 shared task, precision, recall and 3Note that the gold standard chunk data for the WSJ was used only in evaluation of chunking performance, and to establish upper bounds on the performance of the various extraction methods.</Paragraph> <Paragraph position="4"> 4Based on the results of Veenstra and van den Bosch (2000) and the observation that MVDM is temperamental over sparse data (i.e. word lemmata).</Paragraph> <Paragraph position="5"> F-score are all evaluated at the chunk rather than the word level. The F-score of 0.919 for the CoNLL-2000 data is roughly the median score attained by systems performing in the original task, and slightly higher than the F-score of 0.915 reported by Veenstra and van den Bosch (2000), due to the use of word lemmata rather than surface forms, and also inverse distance weighting. The reason for the dropofi in performance between the CoNLL data and the full WSJ is due to the CoNLL training and test data coming from a homogeneous data source, namely a subsection of the WSJ, but the Brown corpus being used as the training data in chunking the full extent of the WSJ.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Extraction method </SectionTitle> <Paragraph position="0"> Having chunk-parsed the WSJ in the manner described above, we next set about extracting VPCs by identifying each particle chunk, and searching back for the governing verb. As for Method-1, we allow a maximum of 5 words to intercede between a particle and its governing verb, and we apply the additional stipulation that the only chunks that can occur be- null tween the verb and the particle are: (a) noun chunks, (b) preposition chunks adjoining noun chunks, and (c) adverb chunks found in our closed set of particle pre-modiflers (see x 2.1). Additionally, we used the gold standard set of 73 particles to fllter out extraneous particle chunks, as for Method-1 above. The results for chunk-based extraction are presented in Table 3, evaluated over the chunk parser output (\TiMBL&quot;) and also the gold-standard chunk data for the WSJ (\Penn&quot;). These results are signiflcantly better than those for Method-1 over the Brill output and Penn data, respectively, both in terms of the raw number of VPCs extracted and F-score.</Paragraph> <Paragraph position="1"> One reason for the relative success of extracting over chunker as compared to tagger output is that our chunker was considerably more successful than the Brill tagger at annotating particles, returning an F-score of 0.737 over particle chunks (precision=0.786, recall=0.693). The stipulations on particle type and what could occur between a verb and particle chunk were crucial in maintaining a high VPC extraction precision, relative to both particle chunk precision and the gold standard extraction precision. As can be seen from the upper bound on recall (i.e. recall over the gold standard chunk data), however, this method has limited applicability.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Method-3: Chunk Grammar-based Extraction </SectionTitle> <Paragraph position="0"> The principle weakness of Method-2 was recall, leading us to implement a rule-based chunk sequencer which searches for particles in prepositional and adverbial chunks as well as particle chunks. In essence, we take each verb chunk in turn, and search to the right for a single-word particle, prepositional or adverbial chunk which is contained in the gold standard set of 73 particles. For each such chunk pair, it then analyses: (a) the chunks which occur between them to ensure that, maximally, an NP and particle pre-modifler adverb chunk are found; (b) the chunks that occur immediately after the particle/preposition/adverb chunk to check for a clause boundary or NP; and (c) the clause context of the verb chunk for possible extraposition of an NP verbal complement, through passivisation or relativisation. The objective of this analysis is to both determine the valence of the VPC candidate (intransitive or transitive) and identify evidence either supporting or rejecting a VPC analysis. Evidence for or against a VPC analysis is in the form of congruence with the known linguistic properties of VPCs, as described in Section 2.1. For example, if a pronominal noun chunk were found to occur immediately after the (possibly) particle chunk (e.g. *see ofi him), a VPC analysis would not be possible. Alternatively, if a punctuation mark (e.g. a full stop) were found to occur immediately after the \particle&quot; chunk and nothing interceded between the verb and particle chunk, then this would be evidence for an intransitive VPC analysis.</Paragraph> <Paragraph position="1"> The chunk sequencer is not able to furnish positive or negative evidence for a VPC analysis in all cases. Indeed, in a high proportion of instances, a noun chunk (=NP) was found to follow the \particle&quot; chunk, leading to ambiguity between analysis as a VPC, prepositional verb or free verb{preposition combination (see Section 2.1), or in the case that an NP occurs between the verb and particle, the \particle&quot; being the head of a PP post-modifying an NP. As a case in point, the VP hand the paper in here could take any of the following structures: (1) hand [the paper] [in] [here] (transitive VPC hand in with adjunct NP here), (2) hand [the paper] [in here] (transitive prepositional verb hand in or simple transitive verb with PP adjunct), and (3) hand [the paper in here] (simple transitive verb). In such cases, we can choose to either (a) avoid committing ourselves to any one analysis, and ignore all such ambiguous cases, or (b) use some means to resolve the attachment ambiguity (i.e. whether the NP is governed by the verb, resulting in a VPC, or the preposition, resulting in a prepositional verb or free verb{preposition combination). In the latter case, we use an unsupervised attachment disambiguation method, based on the log-likelihood ratio (\LLR&quot;, Dunning (1993)). That is, we use the chunker output to enumerate all the verb{preposition, preposition{ noun and verb{noun bigrams in the WSJ data, based on chunk heads rather than strict word bigrams. We then use frequency data to pre-calculate the LLR for each such type. In the case that the verb and \particle&quot; are joined (i.e. no NP occurs between them), we simply compare the LLR of the verb{noun and particle{noun pairs, and assume a VPC analysis in the case that the former is strictly larger than the latter. In the case that the verb and \particle&quot; are split (i.e. we have the chunk sequence VC NC1 PC NC2),5 we calculate three scores: (1) the product of the LLR for (the heads of) VC-PC and VC-NC2 (analysis as VPC, with NC2 as an NP adjunct of the verb); (2) the product of the LLR for NC1-PC and PC-NC2 (transitive verb analysis, with the PP modifying NC1); and (3) the product of the LLR for VC-PC and PC-NC2 (analysis as prepositional verb or free verb{preposition combination). Only in the case that the flrst of these scores is strictly greater than the other two, do we favour a (transitive) VPC analysis. null Based on the positive and negative grammatical evidence from above, for both intransitive and transitive VPC analyses, we generate four frequency-based features. The optional advent of data derived through attachment resolution, again for both intransitive and transitive VPC analyses, provides another two features. These features can be combined in either of two ways: (1) in a rule-based fashion, where a given verb{preposition pair is extracted out as a VPC only in the case that there is positive and no negative evidence for either an intransitive or transitive VPC analysis (\Rule&quot; in Table 4); and (2) according to a classifler, using TiMBL to train over the auto-chunked Brown data, with the same basic settings as for chunking (with the exception that each feature is numeric and MVDM is not used |results presented as \Timbl&quot; in Table 4). We also present upper bound results for the classifler-based method using gold standard chunk data, rather than the chunker output (\Penn&quot;). For each of these three basic methods, we present results with and without the attachment-resolved data (\SSatt&quot;).</Paragraph> <Paragraph position="2"> Based on the results in Table 4, the classifler-based method (\Timbl&quot;) is superior to not only the rule-based method (\Rule&quot;), but also Method-1 and Method-2. While the rule-based method degrades signiflcantly when the attachment data is factored in, the classifler-based method remains at the same basic F-score value, undergoing a drop in precision but equivalent gain in recall and gaining more than 120 correct VPCs in the process. Rule+att returns the highest recall value of all the automatic methods to date at 0.823, at the cost of low precision at 0.304. This points to the attachment disambiguation method having high recall but low precision. TimblSSatt and PennSSatt are equivalent in terms 5Here, VC = verb chunk, NC = noun chunk and PC = (intransitive or transitive) preposition chunk.</Paragraph> <Paragraph position="3"> of precision, but the Penn data leads to considerably better recall.</Paragraph> <Paragraph position="4"> 6 Improving on the Basic Methods Comparing the results for the three basic methods, it is apparent that Method-1 and Method-2 ofier higher precision while Method-3 ofiers higher recall.</Paragraph> <Paragraph position="5"> In order to capitalise on the respective strengths of the difierent methods, in this section, we investigate the possibility of combining the outputs of the four methods into a single consolidated classifler. System combination is achieved by taking the union of all VPC outputs from all systems, and having a vector of frequency-based features for each, based on the outputs of the difierent methods for the VPC in question. For each of Method-1 and Method-2, a single feature is used describing the total number of occurrences of the given VPC detected by that method. For Method-3, we retain the 6 features used as input to TimblSSatt, namely the frequency with which positive and negative evidence was detected and also the frequency of VPCs detected through attachment resolution, for both intransitive and transitive VPCs. Training data comes from the output of the difierent methods over the Brown corpus, and the chunking data for Method-2 and Method-3 was generated using the WSJ gold standard chunk data as training data, analogously to the method used to chunk parse the WSJ.</Paragraph> <Paragraph position="6"> The result of this simple combination process is presented in the flrst line of Table 5 (\Combine&quot;). Encouragingly, we achieved the exact same recall as the best of the simple methods (Timbl+att) at 0.710, and signiflcantly higher F-score than any individual method at 0.731.</Paragraph> <Paragraph position="7"> Steeled by this initial success, we further augment the feature space with features describing the frequency of occurrence of: (a) the particle in the corpus, and (b) deverbal noun and adjective forms of the VPC in the corpus (e.g. turnaround, dried-up), determined through a simple concatenation operation optionally inserting a hyphen. The flrst of these is attempted to re ect the fact that high-frequency particles (e.g. up, over) are more productive (i.e.</Paragraph> <Paragraph position="8"> are found in novel VPCs more readily) than low-frequency particles.6 The deverbal feature is intended to re ect the fact that VPCs have the po6We also experimented with a similar feature describing verb frequency, but found it to either degrade or have no efiect on classifler performance.</Paragraph> <Paragraph position="9"> tential to undergo deverbalisation whereas prepositional verbs and free verb{preposition combinations do not.7 We additionally added in features describing: (a) the number of letters in the verb lemma, (b) the verb lemma, and (c) the particle lemma. The flrst feature was intended to capture the informal observation that shorter verbs tend to be more productive than longer verbs (which ofiers one possible explanation for the anomalous call/ring/phone/*telephone up). The second and third features are intended to capture this same productivity efiect, but on a individual word-level. Note that as TiMBL treats all features as fully independent, it is not able to directly pick up on the gold standard verb{particle pairs in the training data to select in the test data.</Paragraph> <Paragraph position="10"> The expanded set of features was used to re-evaluate each of: Method-2 (M/2 in Table 5); the classifler version of Method-3 with and without attachment-resolved data (M3SSATT/); and the simple system combination method (Combine/).</Paragraph> <Paragraph position="11"> Additionally, we calculated an upper bound for the expanded feature set based on the gold standard data for each of the methods (Combine/Penn in Table 5). The results for these flve consolidated methods are presented in Table 5.</Paragraph> <Paragraph position="12"> The addition of the 7 new features leads to an appreciable gain in both precision and recall for all methods, with the system combination method once again proving to be the best performer, at an F-score of 0.865. The difierential between the system combination method when trained over auto-generated POS and chunk data (Combine/) and that trained over gold standard data (Combine/Penn) is still tangible, but considerably less than for any of the individual methods. Importantly, Combine/ outperforms the gold standard results for each of the individual methods. Examples of false positives (i.e.</Paragraph> <Paragraph position="13"> verb{prepositions misclassifled as VPCs) returned by this flnal system conflguration are flrm away, base on and very ofi.</Paragraph> <Paragraph position="14"> In Section 1, we made the claim that VPCs are highly productive and domain-speciflc. We validate this claim by comparing the 1000 VPCs correctly extracted by the Combine/ method against both the LinGO-ERG and the relatively broad-coverage Alvey Tools VPC inventory. The 28 March, 2002 version of the LinGO-ERG contains a total of 300 intransitive and transitive VPC types, of which 195 were contained in the 1000 correctly-extracted VPCs. Feeding the remaining 805 VPCs into the grammar (with a lexical type describing their transitivity) would therefore result in an almost four-fold increase in the total number of VPCs, and increase the chances of the grammar being able to parse WSJ-style text. The Alvey Tools data contains a total of 2254 VPC types. Of the 1000 extracted VPCs, 284 or slightly over 28%, were not contained in the Alvey data, with examples including head down, blend together and bid up. Combining this result with that for the LinGO-ERG, one can 7Note that only a limited number of VPCs can be deverbalised in this manner: of the 62 VPCs attested in the WSJ, only 8 had a deverbal usage.</Paragraph> <Paragraph position="15"> see that we are not simply extracting information already at our flngertips, but are accessing signiflcant numbers of novel VPC types.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Related research </SectionTitle> <Paragraph position="0"> There is a moderate amount of research related to the extraction of VPCs, or more generally phrasal verbs, which we brie y describe here.</Paragraph> <Paragraph position="1"> One of the earliest attempts at extracting \interrupted collocations&quot; (i.e. non-contiguous collocations, including VPCs), was that of Smadja (1993). Smadja based his method on bigrams, but unlike conventional collocation work, described bigrams by way of the triple of hword1,word2,posni, where posn is the number of words occurring between word1 and word2 (up to 4). For VPCs, we can reasonably expect from 0 to 4 words to occur between the verb and the particle, leading to 5 distinct variants of the same VPC and no motivated way of selecting between them. Smadja did not attempt to evaluate his method other than anecdotally, making any comparison with our research impossible.</Paragraph> <Paragraph position="2"> The work of Blaheta and Johnson (2001) is closer in its objectives to our research, in that it takes a parsed corpus and extracts out multiword verbs (i.e.</Paragraph> <Paragraph position="3"> VPCs and prepositional verbs) through the use of log-linear models. Once again, direct comparison with our results is di-cult, as Blaheta and Johnson output a ranked list of all verb{preposition pairs, and subjectively evaluate the quality of difierent sections of the list. Additionally, they make no attempt to distinguish VPCs from prepositional verbs.</Paragraph> <Paragraph position="4"> The method which is perhaps closest to ours is that of Kaalep and Muischnek (2002) in extracting Estonian multiword verbs (which are similar to English VPCs in that the components of the multiword verb can be separated by other words). Kaalep and Muischnek apply the \mutual expectation&quot; test over a range of \positioned bigrams&quot;, similar to those used by Smadja. They test their method over three difierent corpora, with results ranging from a precision of 0.21 and recall of 0.86 (F-score=0.34) for the smallest corpus, to a precision of 0.03 and recall of 0.85 (F-score=0.06) for the largest corpus. That is, high levels of noise are evident in the system output, and the F-score values are well below those achieved by our method for English VPCs.</Paragraph> </Section> class="xml-element"></Paper>