File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1057_metho.xml
Size: 23,966 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1057"> <Title>Error Mining for Wide-Coverage Grammar Engineering</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A parsability metric for word sequences </SectionTitle> <Paragraph position="0"> The error mining technique assumes we have available a large corpus of sentences. Each sentence is a sequence of words (of course, words might include tokens such as punctuation marks, etc.). We run the parser on all sentences, and we note for which sentences the parser is successful. We define the parsability of a word R(w) as the ratio of the number of times the word occurs in a sentence with a successful parse (C(wjOK)) and the total number of sentences that this word occurs in (C(w)):</Paragraph> <Paragraph position="2"> Thus, if a word only occurs in sentences that cannot be parsed successfully, the parsability of that word is 0. On the other hand, if a word only occurs in sentences with a successful parse, its parsability is 1. If we have no reason to believe that a word is particularly easy or difficult, then we expect its parsability to be equal to the coverage of the parser (the proportion of sentences with a successful parse). If its parsability is (much) lower, then this indicates that something is wrong. For the experiments described below, the coverage of the parser lies between 91% and 95%. Yet, for many words we found parsability values that were much lower than that, including quite a number of words with parsability 0. Below we show some typical examples, and discuss the types of problem that are discovered in this way.</Paragraph> <Paragraph position="3"> If a word has a parsability of 0, but its frequency is very low (say 1 or 2) then this might easily be due to chance. We therefore use a frequency cut-off (e.g. 5), and we ignore words which occur less often in sentences without a successful parse.</Paragraph> <Paragraph position="4"> In many cases, the parsability of a word depends on its context. For instance, the Dutch word via is a preposition. Its parsability in a certain experiment was more than 90%. Yet, the parser was unable to parse sentences with the phrase via via which is an adverbial expression which means via some complicated route. For this reason, we generalize the parsability of a word to word sequences in a straightforward way. We write C(wi:::wj) for the number of sentences in which the sequence wi:::wj occurs. Furthermore, C(wi:::wjjOK), is the number of sentences with a successful parse which contain the sequencewi:::wj. The parsability of a sequence is defined as:</Paragraph> <Paragraph position="6"> If a word sequence wi:::wj has a low parsability, then this might be because it is part of a difficult phrase. It might also be that part of the sequence is the culprit. In order that we focus on the relevant sequence, we consider a longer sequence wh:::wi:::wj:::wk only if its parsability is lower than the parsability of each of its substrings: null</Paragraph> <Paragraph position="8"> This is computed efficiently by considering the parsability of sequences in order of length (shorter sequences before longer ones).</Paragraph> <Paragraph position="9"> We construct a parsability table, which is a list of n-grams sorted with respect to parsability. An n-gram is included in the parsability table, provided: its frequency in problematic parses is larger than the frequency cut-off its parsability is lower than the parsability of all of its sub-strings The claim in this paper is that a parsability table provides a wealth of information about systematic problems in the grammar and lexicon, which is otherwise hard to obtain.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments and results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 First experiment </SectionTitle> <Paragraph position="0"> Data. For our experiments, we used the Twente Nieuws Corpus, version pre-release 0.1.1 This corpus contains among others a large collection of news articles from various Dutch newspapers in the period 1994-2001. In addition, we used all news articles from the Volkskrant 1997 (available on CD-ROM). In order that this material can be parsed relatively quickly, we discarded all sentences of more than 20 words. Furthermore, a time-out per sentence of twenty CPU-seconds was enforced. The Alpino parser normally exploits a part-of-speech tag filter for efficient parsing (Prins and van Noord, 2003) which was switched off, to ensure that the results were not influenced by mistakes due to this filter. In table 1 we list some basic quantitative facts about this material.</Paragraph> <Paragraph position="1"> We exploited a cluster of Linux PCs for parsing.</Paragraph> <Paragraph position="2"> If only a single PC had been available, it would have taken in the order of 100 CPU days, to construct the material described in table 1.</Paragraph> <Paragraph position="3"> These experiments were performed in the autumn of 2002, with the Alpino parser available then. Below, we report on more recent experiments with the latest version of the Alpino parser, which has been improved quite a lot on the basis of the results of the experiments described here.</Paragraph> <Paragraph position="4"> Results. For the data described above, we computed the parsability table, using a frequency cut-off of 5. In figure 1 the frequencies of parsability scores in the parsability table are presented. From the figure, it is immediately clear that the relatively high number of word sequences with a parsability of (almost) zero cannot be due to chance. Indeed, the ity scores occurring in parsability table. Frequency cut-off=5; first experiment (Autumn 2002).</Paragraph> <Paragraph position="5"> parsability table starts with word sequences which constitute systematic problems for the parser. In quite a lot of cases, these word sequences originate from particular types of newspaper text with idiosyncratic syntax, such as announcements of new books, movies, events, television programs etc.; as well as checkers, bridge and chess diagrams. Another category consists of (parts of) English, French and German phrases.</Paragraph> <Paragraph position="6"> We also find frequent spelling mistakes such as de de where only a single de (the definite article) is expected, and heben for hebben (to have), indentiek for identiek (identical), koninging for koningin (queen), etc. Other examples include wordt ik (becomes I), vindt ik (finds I), vind hij (find he) etc. We now describe a number of categories of examples which have been used to improve the parser.</Paragraph> <Paragraph position="7"> Tokenization. A number of n-grams with low parsability scores point towards systematic mistakes during tokenization. Here are a number of exam- null The first and second n-gram indicate sentences which start with a full stop or an exclamation mark, due to a mistake in the tokenizer. The third and fourthn-grams indicate a problem the tokenizer had with a sequence of a single capital letter with a dot, followed by the genitive marker. The grammar assumes that the genitive marking is attached to the proper name. Such phrases occur frequently in reports on criminals, which are indicated in news paper only with their initials. Another systematic mistake is reflected by the last n-grams. In reported speech such as Franca yells: You are crazy! the tokenizer mistakenly introduced a sentence boundary between the exclamation mark and the comma. On the basis of examples such as these, the tokenizer has been improved.</Paragraph> <Paragraph position="8"> Mistakes in the lexicon. Another reason an n-gram receives a low parsability score is a mistake in the lexicon. The following table lists two typical examples:</Paragraph> <Paragraph position="10"> In Dutch, there is a distinction between neuter and non-neuter common nouns. The definite article de combines with non-neuter nouns, whereas neuter nouns select het. The common noun kaft, for example, combines with the definite article de. However, according to the dictionary, it is a neuter common noun (and thus would be expected to combine only with the definite article het). Many similar errors were discovered.</Paragraph> <Paragraph position="11"> Another syntactic distinction that is listed in the dictionary is the distinction between verbs which take the auxiliary hebben (to have) to construct a perfect tense clause vs. those that take the auxiliary zijn (to be). Some verbs allow both possibilities.</Paragraph> <Paragraph position="12"> The last example illustrates an error in the dictionary with respect to this syntactic feature.</Paragraph> <Paragraph position="13"> Incomplete lexical descriptions. The majority of problems that the parsability scores indicate reflect incomplete lexical entries. A number of examples is provided in the following table: R C n-gram 0.00 11 begunstigden favoured (N/V) 0.23 10 zich eraan dat self there-on that 0.08 12 aan te klikken on to click 0.08 12 doodzonde dat mortal sin that 0.15 11 zwarts black's 0.00 16 dupe van victim of 0.00 13 het Turks . the Turkish The word begunstigden is ambiguous between on the one hand the past tense of the verb begunstigen (to favour) and on the other hand the plural nominalization begunstigden (beneficiaries). The dictionary contained only the first reading.</Paragraph> <Paragraph position="14"> The sequence zich eraan dat illustrates a missing valency frame for verbs such as ergeren (to irritate). In Dutch, verbs which take a prepositional complement sometimes also allow the object of the prepositional complement to be realized by a subordinate (finite or infinite) clause. In that case, the prepositional complement is R-pronominalized. Examples: He is not irritated by the fact that . . .</Paragraph> <Paragraph position="15"> The sequence aan te klikken is an example of a verb-particle combination which is not licensed in the dictionary. This is a relatively new verb which is used for click in the context of buttons and hyperlinks. null The sequence doodzonde dat illustrates a syntactic construction where a copula combines with a predicative complement and a sentential subject, if that predicative complement is of the appropriate type. This type is specified in the dictionary, but was missing in the case of doodzonde. Example: That he is sleeping is a pity The word zwarts should have been analyzed as a genitive noun, as in (typically sentences about chess or checkers): whereas the dictionary only assigned the inflected adjectival reading.</Paragraph> <Paragraph position="16"> The sequence dupe van illustrates an example of an R-pronominalization of a PP modifier. This is generally not possible, except for (quite a large) number of contexts which are determined by the verb and the object: He has to suffer for it The word Turks can be both an adjective (Turkish) or a noun the Turkish language. The dictionary contained only the first reading.</Paragraph> <Paragraph position="17"> Very many other examples of incomplete lexical entries were found.</Paragraph> <Paragraph position="18"> Frozen expressions with idiosyncratic syntax. Dutch has many frozen expressions and idioms with archaic inflection and/or word order which breaks the parser. Examples include: The sequence dan schaadt het is part of the idiom Baat het niet, dan schaadt het niet (meaning: it might be unsure whether something is helpful, but in any case it won't do any harm). The sequence God zij is part of a number of archaic formulas such as God zij dank (Thank God). In such examples, the form zij is the (archaic) subjunctive form of the Dutch verb zijn (to be). The sequence Het zij zo is another fixed formula (English: So be it), containing the same subjunctive. The phrase van goeden huize (of good family) is a frozen expression with archaic inflection. The word berge exhibits archaic inflection on the word berg (mountain), which only occurs in the idiomatic expression de haren rijzen mij te berge (my hair rises to the mountain) which expresses a great deal of surprise. The n-gram hele gedwaald only occurs in the idiom Beter ten halve gekeerd dan ten hele gedwaald: it is better to turn halfway, then to go all the way in the wrong direction. Many other (parts of) idiomatic expressions were found in the parsability table.</Paragraph> <Paragraph position="19"> The sequence te weeg only occurs as part of the phrasal verb te weeg brengen (to cause).</Paragraph> <Paragraph position="20"> Incomplete grammatical descriptions. Although the technique strictly operates at the level of words and word sequences, it is capable of indicating grammatical constructions that are not treated, or not properly treated, in the grammar. We, the Dutch, often eat potatoes The sequence Geeft niet illustrates the syntactic phenomenon of topic-drop (not treated in the grammar): verb initial sentences in which the topic (typically the subject) is not spelled out. The sequence de alles occurs with present participles (used as prenominal modifiers) such as overheersende as in de alles overheersende paniek (literally: the all dominating panic, i.e., the panic that dominated everything). The grammar did not allow prenominal modifiers to select an NP complement. The sequence Het laten often occurs in nominalizations with multiple verbs. These were not treated in the grammar. Example: A large number ofn-grams also indicate elliptical structures, not treated in that version of the grammar. Another fairly large source of errors are irregular named entities (Gil y Gil, Osama bin Laden experiments; second experiment (January 2004).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Later experiment </SectionTitle> <Paragraph position="0"> Many of the errors and omissions that were found on the basis of the parsability table have been corrected. As can be seen in table 2, the coverage obtained by the improved parser increased substantially. In this experiment, we also measured the coverage on additional sets of sentences (all sentences from the Trouw 1999 and Volkskrant 2001 newspaper, available in the TwNC corpus). The results show that coverage is similar on these unseen testsets. null Obviously, coverage only indicates how often the parser found a full parse, but it does not indicate whether that parse actually was the correct parse.</Paragraph> <Paragraph position="1"> For this reason, we also closely monitored the performance of the parser on the Alpino tree-bank3 (van der Beek et al., 2002a), both in terms of parsing accuracy and in terms of average number of parses per sentence. The average number of parses increased, which is to be expected if the grammar and lexicon are extended. Accuracy has been steadily increasing on the Alpino tree-bank. Accuracy is defined as the proportion of correct named dependency relations of the first parse returned by Alpino. Alpino employs a maximum entropy disambiguation component; the first parse is the most promising parse according to this statistical model. The maximum entropy disambiguation component of Alpino assigns a score S(x) to each parse x:</Paragraph> <Paragraph position="3"> wherefi(x) is the frequency of a particular featurei in parse x and i is the corresponding weight of that feature. The probability of a parse x for sentence w is then defined as follows, where Y(w) are all the parses of w:</Paragraph> <Paragraph position="5"> (2) The disambiguation component is described in detail in Malouf and van Noord (2004). May 2004. During this period many of the problems described earlier were solved, but other parts of the system were improved too (in particular, the disambiguation component was improved considerably). The point of the graph is that apparently the increase in coverage has not been obtained at the cost of decreasing accuracy.</Paragraph> <Paragraph position="6"> 4 A note on the implementation The most demanding part of the implementation consists of the computation of the frequency of ngrams. If the corpus is large, or n increases, simple techniques break down. For example, an approach in which a hash data-structure is used to maintain the counts of each n-gram, and which increments the counts of each n-gram that is encountered, requires excessive amounts of memory for large n and/or for large corpora. On the other hand, if a more compact data-structure is used, speed becomes an issue. Church (1995) shows that suffix arrays can be used for efficiently computing the frequency of n-grams, in particular for larger n. If the corpus size increases, the memory required for the suffix array may become problematic. We propose a new combination of suffix arrays with perfect hash finite automata, which reduces typical memory requirements by a factor of five, in combination with a modest increase in processing efficiency.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Suffix arrays </SectionTitle> <Paragraph position="0"> Suffix arrays (Manber and Myers, 1990; Yamamoto and Church, 2001) are a simple, but useful data-structure for various text-processing tasks. A corpus is a sequence of characters. A suffix arraysis an array consisting of all suffixes of the corpus, sorted alphabetically. For example, if the corpus is the string abba, the suffix array is ha,abba,ba,bbai.</Paragraph> <Paragraph position="1"> Rather than writing out each suffix, we use integers i to refer to the suffix starting at position i in the corpus. Thus, in this case the suffix array consists of the integersh3;0;2;1i.</Paragraph> <Paragraph position="2"> It is straightforward to compute the suffix array.</Paragraph> <Paragraph position="3"> For a corpus of k + 1 characters, we initialize the suffix array by the integers 0:::k. The suffix array is sorted, using a specialized comparison routine which takes integers i and j, and alphabetically compares the strings starting at i and j in the corpus.4 null Once we have the suffix array, it is simple to compute the frequency of n-grams. Suppose we are interested in the frequency of all n-grams for n = 10. We simply iterate over the elements of the suffix array: for each element, we print the first ten words of the corresponding suffix. This gives us all occurrences of all 10-grams in the corpus, sorted alphabetically. We now count each 10-gram, e.g. by piping the result to the Unix uniq -c command.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Perfect hash finite automata </SectionTitle> <Paragraph position="0"> Suffix arrays can be used more efficiently to compute frequencies of n-grams for larger n, with the help of an additional data-structure, known as the perfect hash finite automaton (Lucchiesi and Kowaltowski, 1993; Roche, 1995; Revuz, 1991).</Paragraph> <Paragraph position="1"> The perfect hash automaton for an alphabetically sorted finite set of words w0:::wn is a weighted minimal deterministic finite automaton which maps wi !i for each w0 i n. We call i the word code of wi. An example is given in figure 3.</Paragraph> <Paragraph position="2"> Note that perfect hash automata implement an order preserving, minimal perfect hash function. The function is minimal, in the sense that n keys are mapped into the range 0:::n 1, and the function is order preserving, in the sense that the alphabetic order of words is reflected in the numeric order of word codes.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Suffix arrays with words </SectionTitle> <Paragraph position="0"> In the approach of Church (1995), the corpus is a sequence of characters (represented by integers reflecting the alphabetic order). A more spaceefficient approach takes the corpus as a sequence of words, represented by word codes reflecting the alphabetic order.</Paragraph> <Paragraph position="1"> To compute frequencies of n-grams for larger n, we first compute the perfect hash finite automaton for all words which occur in the corpus,5 and map ton for the words clock, dock, dog, duck, dust, rock, rocker, stock. Summing the weights along an accepting path in the automaton yields the rank of the word in alphabetic ordering.</Paragraph> <Paragraph position="2"> the corpus to a sequence of integers, by mapping each word to its word code. Suffix array construction then proceeds on the basis of word codes, rather than character codes.</Paragraph> <Paragraph position="3"> This approach has several advantages. The representation of both the corpus and the suffix array is more compact. If the average word length is k, then the corresponding arrays are k times smaller (but we need some additional space for the perfect hash automaton). In Dutch, the average word length k is about 5, and we obtained space savings in that order.</Paragraph> <Paragraph position="4"> If the suffix array is shorter, sorting should be faster too (but we need some additional time to compute the perfect hash automaton). In our experience, sorting is about twice as fast for word codes.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Computing parsability table </SectionTitle> <Paragraph position="0"> To compute parsability scores, we assume there are two corpora cm and ca, where the first is a sub-corpus of the second. cm contains all sentences for which parsing was not successful. ca contains all sentences overall. For both corpora, we compute the frequency of all n-grams for all n; n-grams with a frequency below a specified frequency cut-off are ignored. Note that we need not impose an a priori maximum value for n; since there is a frequency cut-off, for some n there simply aren't any sequences which occur more frequently than this cut-off. The two n-gram frequency files are organized in such a way that shorter n-grams precede longer n-grams.</Paragraph> <Paragraph position="1"> The two frequency files are then combined as follows. Since the frequency file corresponding to cm is (much) smaller than the file corresponding to ca, we read the first file into memory (into a hash data structure). We then iteratively read an n-gram frequency from the second file, and compute the parsability of that n-gram. In doing so, we keep track of the parsability scores assigned to previous (hence shorter) n-grams, in order to ensure that larger n-grams are only reported in case the parsability scores decrease. The final step consists in sorting all remaining n-grams with respect to their parsability.</Paragraph> <Paragraph position="2"> To give an idea of the practicality of the approach, consider the following data for one of the experiments described above. For a corpus of 2,927,016 sentences (38,846,604 words, 209Mb), it takes about 150 seconds to construct the perfect hash automaton (mostly sorting). The automaton is about 5Mb in size, to represent 677,488 distinct words. To compute the suffix array and frequencies of all n-grams (cut-off=5), about 15 minutes of CPU-time are required. Maximum runtime memory requirements are about 400Mb. The result contains frequencies for 1,641,608 distinct ngrams. Constructing the parsability scores on the basis of the n-gram files only takes 10 seconds CPU-time, resulting in parsability scores for 64,998 n-grams (since there are much fewern-grams which actually occur in problematic sentences). The experiment was performed on a Intel Pentium III, 1266MHz machine running Linux. The software is freely available from http://www.let.rug.</Paragraph> <Paragraph position="3"> nl/~vannoord/software.html.</Paragraph> </Section> </Section> class="xml-element"></Paper>