File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/p93-1042_metho.xml
Size: 7,942 bytes
Last Modified: 2025-10-06 14:13:32
<?xml version="1.0" standalone="yes"?> <Paper uid="P93-1042"> <Title>HOW DO WE COUNT? THE PROBLEM OF TAGGING PHRASAL VERBS IN PARTS</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. PHRASAL VERBS </SectionTitle> <Paragraph position="0"> The basic assumption underlying the stochastic process is the notion of independence. Words are defined as units separated by spaces and then undergo statistical approximations. As a result the elements of a phrasal verb are treated as two individual words, each with its own lexical probability (i.e. the probability of observing part of speech i given word j). An interesting pattern emerges when we examine the errors involving phrasal verbs. A phrasal verb such as sum up will be tagged by PARTS as noun + preposition instead of verb + particle. This error influences the tagging of other words in the sentence as well. One typical error is found in infinitive constructions, where a phrase like to gun down is tagged as INTO NOUN IN (a prepositional 'to' followed by a noun followed by another preposition). Words like gun, back, and sum, in isolation, have a very high probability of being nouns a.s opposed to verbs, which results in the misclassification described above. However, when these words are followed by a particle, they are usually verbs, and in the infinitive construction, always verbs.</Paragraph> </Section> <Section position="5" start_page="0" end_page="289" type="metho"> <SectionTitle> 2.1. THE HYPOTHESIS </SectionTitle> <Paragraph position="0"> Tile error appears to follow froln the operation of the stochastic process itself. In a trigram model the probability of each word is calculated by taking into consideration two elements: the lexical probability (probability of the word bearing a certain tag) and the contextual probability (probability of a word bearing a certain tag given two previous parts of speech). As a result, if an element has a very high lexical probability of being a noun (gun is a noun in 99 out of 102 occurrences in the Brown Corpus), it will not only influence but will actually override the contextual probability, which might suggest a different assignment. In the case of to gun down the ambiguity of to is enhanced by the ambiguity of gun, and a mistake in tagging gun will automatically lead to an incorrect tagging of to as a preposition.</Paragraph> <Paragraph position="1"> It follows that the tagger should perform poorly on phrasal verbs in those cases where the ambiguous element occurs much more frequenty as a noun (or any other element that is not a verb).The tagger will experience fewer problems handling this construction when the ambiguous element is a verb in the vast majority of instances. If this is true, the model should be changed to take into consideration the dependency between the verb and the particle in order to optimize the performance of the tagger.</Paragraph> </Section> <Section position="6" start_page="289" end_page="289" type="metho"> <SectionTitle> 3. THE EXPERIMENT 3.1. DATA </SectionTitle> <Paragraph position="0"> The first step in testing this hypothesis was to evaluate the current performance of PARTS in handling the phrasal verb construction. To do this a set of 94 pairs of Verb+Particle/Preposition was chosen to represent a range of dominant frequencies from overwhelmingly noun to overwhelmingly verb. 20 example sentences were randomly selected for each pair using an on-line corpus called MODERN, which is a collection of several corpora (Brown, WSJ, AP88-92, HANSE, HAROW, WAVER, DOE, NSF, TREEBANK, and DISS) totaling more than 400 million words. These sentences were first tagged manually to provide a baseline and then tagged automatically using PARTS. The a priori option of assuming only a verbal tag for all the pairs in question was also explored in order to test if this simple solution will be appropriate in all cases. The accuracy of the 3 tagging approaches was evaluated.</Paragraph> </Section> <Section position="7" start_page="289" end_page="290" type="metho"> <SectionTitle> 3.2. RESULTS </SectionTitle> <Paragraph position="0"> Table 2 presents a sample of the pairs examined in tile first column, PARTS performance for each pair in tile second, and the results of assuming a verbal tag in the third. (The &quot;choice&quot; colunm is explained below.) The average performance of PARTS for this task is 89%, which is lower than the general average performance of the tagger as claimed in Church 88. Yet we notice that simply assigning a verbal tag to all pairs actually degrades performance because in some cases the content word is a.lmost always a noun rather than a verb. For example, a phrasal verb like box in generally appears with an intervening object (to box something in), and thus when box and in are adjacent (except for those rare cases involving heavy NP shift) box is a noun.</Paragraph> <Paragraph position="1"> Thus we see that there is a need to distinguish between the cases where the two element sequence should be considered as one word for the purpose of assigniug the Lexical Probability (i.e.,phrasal verb) and cases where we have a Noun + Preposition combination where PARTS' analyses will be preferred. The &quot;choice&quot; in higher performance score, improves the performance of PARTS from 89% to 96% for this task, and reduces the errors in other constructions involving phrasal verbs.</Paragraph> <Paragraph position="2"> When is this alternative needed? In the cases where PARTS had 10% or more errors, most of the verbs occur lnuch more often as nouns or adjectives. This confirms my hypothesis that PARTS will have a problem solving the N/V ambiguity in cases where the lexical probability of the word points to a noun. These are the very cases that should be treated as one unit in the system. The lexical probability should be assigned to the pair as a whole rather than considering the two elements separately. Table 1 lists the cases where tagging improves 10% or more when PARTS is given the additional choice of assigning a verbal tag to the whole expression. Frequency distributions of these tokens in tile Brown Corpus are presented as well, which reflect why statistical probabilities err in these cases. In order to tag these expressions correctly, we will have to capture additional information about the pair which is not available froln tile PARTS statistical model.</Paragraph> </Section> <Section position="8" start_page="290" end_page="290" type="metho"> <SectionTitle> 4. CONCLUSION: LINGUISTIC INTUITIONS </SectionTitle> <Paragraph position="0"> This paper shows that for some cases of phrasal verbs it is not enough to rely on lexical probability alone: We must take into consideration the dependency between the verb and the particle in order to improve the performance of the tagger.The relationship between verbs and particles is deeply rooted in Linguistics. Smith (1943) introduced the term phrasal verb, arguing that it should be regarded as a type of idiom because the elements behave as a unit. He claimed that phrasal verbs express a single concept that often has a one word counterpart in other languages, yet does not always have compositional meaning. Some particles are syntactically more adverbial in nature and some more prepositional, but it is generally agreed that the phrasal verb constitutes a kind of integral functional unit. Perhaps linguistic knowledge can help solve the tagging problem described here and force a redefinition of the boundaries of phrasal verbs. For now we can redefine the word boundaries for the problematic cases that PARTS doesn't handle well. Future research should concentrate on the linguistic characteristics of this problematic construction to determine if there are other cases where the current assumption that one word equals one unit interferes with successful processing.</Paragraph> </Section> class="xml-element"></Paper>