File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1021_metho.xml
Size: 26,155 bytes
Last Modified: 2025-10-06 14:14:00
<?xml version="1.0" standalone="yes"?> <Paper uid="E95-1021"> <Title>and Arto Anttila (eds.). Constraint Grammar: a Language-Independent System for Parsing</Title> <Section position="5" start_page="0" end_page="149" type="metho"> <SectionTitle> 3 The statistical model </SectionTitle> <Paragraph position="0"> We use the Xerox part-of-speech tagger (Cutting et al., 1992), a statistical tagger made at the Xerox</Paragraph> <Section position="1" start_page="0" end_page="149" type="sub_section"> <SectionTitle> Palo Alto Research Center. 3.1 Training </SectionTitle> <Paragraph position="0"> The Xerox tagger is claimed (Cutting el al., 1992) to be adaptable and easily trained; only a lexicon and suitable amount of untagged text is required.</Paragraph> <Paragraph position="1"> A new language-specific tagger can therefore be built with a minimal amount of work. We started our project by doing so. We took our lexicon with the new tagset, a corpus of French text, and trained the tagger. We ran the tagger on another text and counted the errors. The result was not good; 13 % of the words were tagged incorrectly. The tagger does not require a tagged corpus for training, but two types of biases can be set to tell the tagger what is correct and what is not: symbol biases and transition biases. The symbol biases describe what is likely in a given ambiguity class. They represent kinds of lexical probabilities. The transition biases describe the likelihood of various tag pairs occurring in succession. The biases serve as initial values before training.</Paragraph> <Paragraph position="2"> We spent approximately one man-month writing biases and tuning the tagger. Our training corpus was rather small, because the training had to be repeated frequently. When it seemed that the results could not be further improved, we tested the tagger on a new corpus. The eventual result was that 96.8 % of the words in the corpus were tagged correctly. This result is about the same as for statistical tuggers of English.</Paragraph> </Section> <Section position="2" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 3.2 Modifying the biases </SectionTitle> <Paragraph position="0"> A 4 % error rate is not generally considered a negative result for a statistical tagger, but some of the errors are serious. For example, a sequence of determiner.., noun.., noun/verb...preposition is frequently disambiguated in the wrong way, e.g. Le ~rain part ~t cinq heures (The ~rain leaves a~ 5 o'clock). The word part is ambiguous between a noun and a verb (singular, third person), and it is disambiguated incorrectly. The tagger seems to prefer the noun reading between a singular noun and a preposition.</Paragraph> <Paragraph position="1"> One way to resolve this is to write new biases.</Paragraph> <Paragraph position="2"> We added two new ones. The first one says that a singular noun is not likely to be followed by a noun (this is not always true but we could call this a tendency). The second states that a singular noun is likely to be followed by a singular, third-person verb. The result was that the problematic sentence was disambiguated correctly, but the changes had a bad side effect. The overall error rate of the tagger increased by over 50 %. This illustrates how difficult it is to write good biases.</Paragraph> <Paragraph position="3"> Getting a correct result for a particular sentence does not necessarily increase the overall success rate.</Paragraph> </Section> </Section> <Section position="6" start_page="149" end_page="149" type="metho"> <SectionTitle> 4 The constraint-based model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 4.1 A two-level model for tagging </SectionTitle> <Paragraph position="0"> In the constraint-based tagger, the rules are represented as finite-state transducers. The transducers are composed with the sentence in a sequence. Each transducer may remove, or in principle it may also change, one or more readings of the words. After all the transducers have been applied, each word in the sentence has only one analysis.</Paragraph> <Paragraph position="1"> Our constraint-based tagger is based on techniques that were originally developed for morphological analysis. The disambiguation rules are similar to phonological rewrite rules (Kaplan and Kay, 1994), and the parsing algorithm is similar to the algorithm for combining the morphological rules with the lexicon (Karttunen, 1994).</Paragraph> <Paragraph position="2"> The tagger has a close relative in (Koskenniemi, 1990; Koskenniemi et al., 1992; Voutilalnen and Tapanainen, 1993) where the rules are represented as finite-state machines that are conceptually intersected with each other. In this tagger the disambiguation rules are applied in the same manner as the morphological rules in (Koskenniemi, 1983). Another relative is represented in (Roche and Schabes, 1994) which uses a single finite-state transducer to transform one tag into another. A constraint-based system is also presented in (Karlsson, 1990; Karlsson et al., 1995). Related work using finite-state machines has been done using local grammars (Roche, 1992; Silberztein, 1993; Laporte, 1994)'.</Paragraph> </Section> <Section position="2" start_page="149" end_page="149" type="sub_section"> <SectionTitle> 4.2 Writing the rules </SectionTitle> <Paragraph position="0"> One quick experiment that motivated the building of the constraint-based model was the following: we took a million words of newspaper text and ranked ambiguous words by frequency. We found that a very limited set of word forms covers a large part of the total ambiguity. The 16 most frequent ambiguous word forms 2 account for 50 % of all ambiguity. Two thirds of the ambiguity are due to the 97 most frequent ambiguous words 3.</Paragraph> <Paragraph position="1"> Another interesting observation is that the most frequent ambiguous words are usually words which are in general corpus-independent, i.e. words that belong to closed classes (determiners, prepositions, pronouns, conjunctions), auxiliaries, common adverbials or common verbs, like faire (to do, to make). The first corpus-specific word is in the 41st position.</Paragraph> <Paragraph position="2"> For the most frequent ambiguous word forms, one may safely define principled contextual restrictions to resolve ambiguities. This is in particular the case for clitic/determiner ambiguities attached to words like le or la. Our rule says that clitic pronouns are attached to a verb and determiners to a noun with possibly an unrestricted number of premodifiers. This is a good starting point although some ambiguity remains as in la 2Namely de, la, le, les, des, en, du, un, a, duns, une, pus, est, plus, Le, son</Paragraph> </Section> </Section> <Section position="7" start_page="149" end_page="150" type="metho"> <SectionTitle> 3 A similar experiment shows that in the Brown cor- </SectionTitle> <Paragraph position="0"> pus 63 word forms cover 50 % of all the ambiguity, and two thirds of the ambiguity is covered by 220 word forms.</Paragraph> <Paragraph position="1"> place, which can be read as a determiner-noun or clitic-verb sequence.</Paragraph> <Paragraph position="2"> Some of the very frequent words have categories that are rare, for instance the auxiliary forms a and est can also be nouns and the pronoun cela is also a very rare verb form. In such a case, we restrict the use of the rarest categories to contexts where the most frequent reading is not at all possible, otherwise the most frequent reading is preferred. For instance, the word avions may be a noun or an auxiliary verb. We prefer the noun reading and accept the verb reading only when the first-person pronoun nous appears in the left cor/text, e.g. as in nous ne les avions pas (we did not have them).</Paragraph> <Paragraph position="3"> This means that the tagger errs only when a rare reading should be chosen in a context where the most common reading is still acceptable. This may never actually occur, depending on how accurate the contextual restrictions are. It can even be the case that discarding the rare readings would not induce a detectable loss in accuracy, e.g. in the conflict between cela as a pronoun and as a verb. The latter is a rarely used tense of a rather literary verb.</Paragraph> <Paragraph position="4"> The principled rules do not require any tagged corpus, and should be thus corpus-independent.</Paragraph> <Paragraph position="5"> The rules are based on a short list of extremely common words (fewer than 100 words).</Paragraph> <Paragraph position="6"> The rules described above are certainly not sufficient to provide full disambiguation, even if one considers only the most ambiguous word forms.</Paragraph> <Paragraph position="7"> We need more rules for cases that the principled rules do not disambiguate.</Paragraph> <Paragraph position="8"> Some ambiguity is extremely difficult to resolve using the information available. A very problematic case is the word des, which can either be a determiner, Jean mange des pommes (Jean eats apples) or an amalgamated preposition-determiner, as in Jean aime le bruit des vagues (Jean likes the sound of waves).</Paragraph> <Paragraph position="9"> Proper treatment of such an ambiguity would require verb subcategorisation and a description of complex coordinations of noun and prepositional phrases. This goes beyond the scope of both the statistical and the constraint-based taggers. For such cases we introduce ad-hoc heuristics. Some are quite reasonable, e.g. the determiner reading of des is preferred at the begining of a sentence. Some are more or less arguable, e.g. the prepositional reading is preferred after a noun. One may identify various contexts in which either the noun or the adjective can be preferred. Such contextual restrictions (Chanod, 1993) are not always true, but may be considered reasonable for resolving the ambiguity. For instance, in the case of two successive noun/adjective ambiguities like le franc fort (the strong franc or the frank fort), we favour the noun-adjective sequence except when the first word is a common prenominal adjective such as bon, petit, grand, premier, ... as in le petit fort (the small fort) or even le bon petit (the good little one).</Paragraph> <Paragraph position="10"> Our heuristics do not resolve all the ambiguity. To obtain the fully unambiguous result we make use of non-contextual heuristics. The non-contextual rules may be thought of as lexical probabilities. We guess what the most probable tag is in the remaining ambiguities. For instance, preposition is preferred to adjective, pronoun is preferred to past participle, etc. The rules are obviously not very reliable, but they are needed only when the previous rules fail to fully disambiguate. The current system contains 75 rules, consisting of: * 39 reliable contextual rules dealing mostly with frequent ambiguous words.</Paragraph> <Paragraph position="11"> * 25 rules describing heuristics with various degrees of linguistic generality.</Paragraph> <Paragraph position="12"> * 11 non-contextual rules for the remaining ambiguities. null The rules were constructed in less than one month, on the basis of 50 newspaper sentences.</Paragraph> <Paragraph position="13"> All the rules are currently represented by 11 transducers. null</Paragraph> </Section> <Section position="8" start_page="150" end_page="151" type="metho"> <SectionTitle> 5 The results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="150" end_page="151" type="sub_section"> <SectionTitle> 5.1 Test A </SectionTitle> <Paragraph position="0"> For evaluation, we used a corpus totally unrelated to the development corpus. It contains 255 sentences (5752 words) randomly selected from a corpus of economic reports. About 54 % of the words are ambiguous. The text is first tagged manually without using the disambiguators, and the output of the tagger is then compared to the hand-tagged result.</Paragraph> <Paragraph position="1"> If we apply all the rules, we get a fully disambiguated result with an error rate of only 1.3 %. This error rate is much lower than the one we get using the hidden Markov model (3.2 %). See Figure 1.</Paragraph> <Paragraph position="2"> We can also restrict the tagger to using only the most reliable rules. Only 10 words lose the correct tag when almost 2000 out of 3085 ambiguous words are disambiguated. Among the remaining 1136 ambiguous words about 25 % of the ambiguity is due to determiner/preposition ambiguities (words like dn and des), 30 % are adjective/noun ambiguities and 18 % are noun/verb ambiguities.</Paragraph> <Paragraph position="3"> If we use both the principled and heuristic rules, the error rate is 0.52 % while 423 words remain ambiguous. The non-contextual rules that eliminate the remaining 423 ambiguities produce an additional 43 errors. Overall, 98.7 % of the words receive the correct tag.</Paragraph> </Section> <Section position="2" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 5.2 Test B </SectionTitle> <Paragraph position="0"> We also tested the tuggers with more difficult text.</Paragraph> <Paragraph position="1"> The 12 000 word sample of newspaper text has typos and proper names 4 that match an existing word in the lexicon. Problems of the latter type are relatively rare but this sample was exceptional.</Paragraph> <Paragraph position="2"> Altogether the lexicon mismatches produced 0.5 % errors to the input of the tuggers. The results are shown in Figure 2. This text also seems to be generally more difficult to parse than the first one.</Paragraph> </Section> <Section position="3" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 5.3 Combination of the tuggers </SectionTitle> <Paragraph position="0"> We also tried combining the tuggers, using first the rules and then the statistics (a similar approach was also used in (Tapanainen and Voutilainen, 1994)). We evaluated the results obtained by the following sequence of operations: 1) Running the constraint-based tagger without the final, non-contextual rules.</Paragraph> <Paragraph position="1"> 2) Using the statistical disambiguator independently. We select the tag proposed by the statistical disambiguator if it is not removed during step 1.</Paragraph> <Paragraph position="2"> 3) Solving the remaining ambiguities by running the final non-contextual rules of the constraint-based tagger. This last step ensures that one gets a fully disambiguated text. Actually only about 0.5 % of words were not fully disambiguated after step 2.</Paragraph> <Paragraph position="3"> We used the test sample B. After the first step, 1400 words out of 12 000 remain ambiguous. The process of combining the three steps described above eventually leads to more errors than running the constraint-based tagger alone. The statistical tagger introduces 220 errors on the 1400 words that remain ambiguous after step 1. In comparison, the final set of non-contextual rules introduces around 150 errors on the same set of 1400 words. We did not expect this result. One possible explanation for the superior performance of the final non-contextual rules is that they are meant to apply after the previous rules failed to disambiguate the word. This is in itself useful 4like Bats, Botta, Ddrnis, Ferrasse, Hersant, ...</Paragraph> <Paragraph position="4"> information. The final heuristics favour tags that have survived all conditions that restrict their use. For instance, the contextual rules define various contexts where the preposition tag for des is preferred. Therefore, the final heuristics favours the determiner reading for des.</Paragraph> </Section> </Section> <Section position="9" start_page="151" end_page="153" type="metho"> <SectionTitle> 6 Analysis of errors </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 6.1 Errors of principled and heuristic rules </SectionTitle> <Paragraph position="0"> Let us now consider what kind of errors the constraint-based tagger produced. We do not deal with errors produced by the last set of rules, the non-contextual rules, because it is already known that they are not very accurate. To make the tagger better, they should be replaced by writing more accurate heuristic rules.</Paragraph> <Paragraph position="1"> We divide the errors into three categories: (1) errors due to multi-word expressions, (2) errors that should/could be resolved and (3) errors that are hard to resolve by using the information that is available.</Paragraph> <Paragraph position="2"> Thefirst group (15 errors), the multi-word expressions, are difficult for the syntax-based rules because in many cases the expression does not follow any conventional syntactic structure, or the structure may be very rare. In multi-word expressions some words also have categories that may not appear anywhere else. The best way to handle them is to lexicalise these expressions. When a possible expression is recognised we can either collapse it into one unit or leave it otherwise intact except that the most &quot;likely&quot; interpretation is marked.</Paragraph> <Paragraph position="3"> The biggest group (41 errors) contains errors that could have been resolved correctly but were not. The reason for this is obvious: only a relatively small amount of time was allowed for writing the rules. In addition, the rules were constructed on the basis of a rather small set of example sentences. Therefore, it would be very surprising if such errors did not appear in the test sample taken from a different source. The errors are the following: * The biggest subgroup has 19 errors that require modifications to existing rules. Our rules were meant to handle such cases but fail to do so correctly in some sentences. Often only a minor correction is needed.</Paragraph> <Paragraph position="4"> * Some syntactic constructions, or word sequences, were omitted. This caused 7 errors which could easily be avoided by writing more rules. For instance, a construction like &quot;preposition / clitic + finite verb&quot; was not forbidden. The phrase h l'est was analysed in this way while the correct analysis is &quot;preposition / determiner + noun&quot;.</Paragraph> <Paragraph position="5"> * Sometimes a little bit of extra lexical information is required. Six errors would require more information or the kind of refinement in the tag inventory that would not have been appropriate for the statistical tagger.</Paragraph> <Paragraph position="6"> * Nine errors could be avoided by refining existing heuristics, especially by taking into account exceptions for specific words like point, pendant and devant.</Paragraph> <Paragraph position="7"> The remaining errors (28 errors) constitute the price we pay for using the heuristics. Removing the rules which fail would cause a lot of ambiguity to remain. The errors are the following: * Fifteen errors are due to the heuristics for de and des. There is little room for improvement at this level of description (see Chapter 4.2.3). However, the current, simple heuristics fully disambiguate 850 instances of de and des out of 914 i.e. 92 % of all the occurrences were parsed with less than a 2 % error rate.</Paragraph> <Paragraph position="8"> * Six errors involve noun-adjective ambiguities that are difficult to solve, for instance, in a subject or object predicate position.</Paragraph> <Paragraph position="9"> * Seven errors seem to be beyond reach for various reasons: long coordination , rare constructions, etc. An example is les boltes (the boxes) where les is wrongly tagged in the test sample because the noun form is misspelled as boites, which is identified only as a verb by the lexicon.</Paragraph> </Section> <Section position="2" start_page="151" end_page="153" type="sub_section"> <SectionTitle> 6.2 Difference between the taggers </SectionTitle> <Paragraph position="0"> We also investigated how the errors compare between the two taggers. Here we used the fully disambiguated outputs of the taggers. The errors belong mainly to three classes: * Some errors appear predominantly with the statistical tagger and almost never with the constraint-based tagger. This is particularly the case with the ambiguity between past participles and adjectives.</Paragraph> <Paragraph position="1"> * Some errors are common to both taggers, the constraint-based tagger generally being more accurate (often with a ratio of I to 2). These errors cover ambiguities that are known to be difficult to handle in general, such as the already mentioned determiner/preposition ambiguity. null * Finally, there are errors that are specific to the constraint-based tagger. They are often related to errors that could be corrected with some extra work. They are relatively infrequent, thus the global accuracy of the constraint-based tagger remains higher.</Paragraph> <Paragraph position="2"> The first two classes of errors are generally difficult to correct. The easiest way to improve the constraint-based tagger is to concentrate on the final class. As we mentioned earlier, it is not very easy to change the behaviour of the statistical tagger in one place without some side-effects elsewhere. This means that the errors of the first class are probably easiest to resolve by means other than statistics.</Paragraph> <Paragraph position="3"> The first class is quite annoying for the statistical parser because it contains errors that are intuitively very clear and resolvable, but which are far beyond the limits of the current statistical tagger. We can take an easy sentence to demonstrate this: Je ne le pense pas. I do not think so.</Paragraph> <Paragraph position="4"> Tune le penses pas. You do not think so.</Paragraph> <Paragraph position="5"> Il ne le pense pas. He does not think so.</Paragraph> <Paragraph position="6"> The verb pense is ambiguous 5 in the first person or in the third person. It is usually easy to determine the person just by checking the personal pronoun nearby. For a human or a constraint-based tagger this is an easy task, for a statistical tagger it is not. There are two words between the pronoun and the verb that do not carry any information about the person. The personal pronoun may thus be too far from the verb because bi-gram models can see backward no farther than le, and tri-gram models SThat is not case with all the French verbs, e.g. Je crois and //croit.</Paragraph> <Paragraph position="7"> no farther than ne le.</Paragraph> <Paragraph position="8"> Also, as mentioned earlier, resolving the adjective vs. past participle ambiguity is much harder, if the tagger does not know whether there is an auxiliary verb in the sentence or not.</Paragraph> </Section> </Section> <Section position="10" start_page="153" end_page="153" type="metho"> <SectionTitle> 7 Conclusion </SectionTitle> <Paragraph position="0"> We have presented two taggers for french: a statistical one and a constraint-based one.</Paragraph> <Paragraph position="1"> There are two ways to train the statistical tagger: from a tagged corpus or using a self-organising method that does not need a tagged corpus. We had a strict time limit of one month for doing the tagger and no tagged corpus was available. This is a short time for the manual tagging of a corpus and for the training of the tagger. It would be risky to spend, say, three weeks for writing a corpus, and only one week for training. The size of corpus would have to be limited, because it should be also checked.</Paragraph> <Paragraph position="2"> We selected the Xerox tagger that learns from an untagged corpuS. The task was not as straigthforward as we thought. Without human assistance in the training the result was not impressive, and we had to spend much time tuning the tagger and guiding the learning process. In a month we achieved 95-97 % accuracy.</Paragraph> <Paragraph position="3"> The training process of a statistical tagger requires some time because the linguistic information has to be incorporated into the tagger one way or another, it cannot be obtained for free starting from null. Because the linguistic information is needed, we decided to encode the information in a more straightforward way, as explicit linguistic disambiguation rules. It has been argued that statistical taggers are superior to rulebased/hand-coded ones because of better accuracy and better adaptability (easy to train). In our experiment, both claims turned out to be wrong.</Paragraph> <Paragraph position="4"> For the constraint-based tagger we set one month time limit for writing the constraints by hand. We used only linguistic intuition and a very limited set of sentences to write the 75 constraints. We formulated constraints of different accuracy.</Paragraph> <Paragraph position="5"> Some of the constraints are almost 100 % accurate, some of them just describe tendencies.</Paragraph> <Paragraph position="6"> Finally, when we thought that the rules were good enough, we took two text samples from different sources and tested both the taggers. The constraint-based tagger made several naive errors because we had forgotten, miscoded or ignored some linguistic phenomena, but still, it made only half of the errors that the statistical one made.</Paragraph> <Paragraph position="7"> A big difference between the taggers is that the tuning of the statistical tagger is very subtle i.e. it is hard to predict the effect of tuning the parameters of the system, whereas the constraint-based tagger is very straightforward to correct.</Paragraph> <Paragraph position="8"> Our general conclusion is that the hand-coded constraints perform better than the statistical tagger and that we can still refine them. The most important of our findings is that writing constraints that contain more linguistic information than the current statisticM model does not take much time.</Paragraph> </Section> class="xml-element"></Paper>