File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1016_metho.xml
Size: 25,210 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1016"> <Title>The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Candidate Selection for Machine Translation </SectionTitle> <Paragraph position="0"> Target word selection is a generation task that occurs in machine translation (MT). A word in a source language can often be translated into different words in the target language and the choice of the appropriate translation depends on a variety of semantic and pragmatic factors. The task is illustrated in (1) where there are five translation alternatives for the German noun Geschichte listed in curly brackets, the first being the correct one.</Paragraph> <Paragraph position="1"> (1) a. Die Geschichte &quot;andert sich, nicht jedoch die Geographie.</Paragraph> <Paragraph position="2"> b. fHistory, story, tale, saga, stripg changes but geography does not.</Paragraph> <Paragraph position="3"> Statistical approaches to target word selection rely on bilingual lexica to provide all possible translations of words in the source language. Once the set of translation candidates is generated, statistical information gathered from target language corpora is used to select the most appropriate alternative (Dagan and Itai, 1994). The task is somewhat simplified by Grefenstette (1998) and Prescher et al. (2000) who do not produce a translation of the entire sentence. Instead, they focus on specific syntactic relations. Grefenstette translates compounds from German and Spanish into English, and uses BNC frequencies as a filter for candidate translations. He observes that this approach suffers from an acute data sparseness problem and goes on to obtain counts for candidate compounds through web searches, thus achieving a translation accuracy of 86-87%.</Paragraph> <Paragraph position="4"> Prescher et al. (2000) concentrate on verbs and their objects. Assuming that the target language translation of the verb is known, they select from the candidate translations the noun that is semantically most compatible with the verb. The semantic fit between a verb and its argument is modeled using a class-based lexicon that is derived from unlabeled data using the expectation maximization algorithm (verb-argument model). Prescher et al. also propose a refined version of this approach that only models the fit between a verb and its object (verb-object model), disregarding other arguments of the verb. The two models are trained on the BNC and evaluated against two corpora of 1,340 and 814 bilingual sentence pairs, with an average of 8.63 and 2.83 translations for the object noun, respectively. Table 4 lists Prescher et al.'s results for the two corpora and for both models together with a random baseline (select a target noun at random) and a frequency baseline (select the most frequent target noun).</Paragraph> <Paragraph position="5"> Grefenstette's (1998) evaluation was restricted to compounds that are listed in a dictionary. These compounds are presumably well-established and fairly frequent, which makes it easy to obtain reliable web frequencies. We wanted to test if the web-based approach extends from lexicalized compounds to productive syntactic units for which dictionary entries do not exist. We therefore performed our evaluation using Prescher et al.'s (2000) test set of verb-object pairs. Web counts were retrieved for all possible verb-object translations; the most likely one was selected using either co-occurrence frequency ( f (v;n)) or conditional probability ( f (v;n)= f (n)). The web counts were gathered using inflected queries involving the verb, a determiner, and the object (see Section 2). Table 3 compares the web-based models against the BNC models. For both the high ambiguity and the low ambiguity data set, we find that the performance of the best Altavista model is not significantly different from that of the best BNC model. Table 4 compares our simple, unsupervised methods with the two sophisticated class-based models discussed above. The results show that there is no significant difference in performance between the best model reported in the literature and the best Altavista or the best BNC model. However, both models significantly outperform the baseline. This holds for both the high and low ambiguity data sets.</Paragraph> <Paragraph position="6"> Altavista BNC high low high low Model ambig ambig ambig ambig</Paragraph> <Paragraph position="8"/> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Context-sensitive Spelling Correction </SectionTitle> <Paragraph position="0"> Context-sensitive spelling correction is the task of correcting spelling errors that result in valid words. Such a spelling error is illustrated in (4) where principal was typed when principle was intended.</Paragraph> <Paragraph position="1"> (2) Introduction of the dialogue principal proved strikingly effective.</Paragraph> <Paragraph position="2"> The task can be viewed as generation task, as it consists of choosing between alternative surface realizations of a word. This choice is typically modeled by confusion sets such as fprincipal, principleg or fthen, thang under the assumption that each word in the set could be mistakenly typed when another word in the set was intended. The task is to infer which word in a confusion set is the correct one in a given context. This choice can be either syntactic (as for fthen, thang) or semantic (as for fprincipal, principleg).</Paragraph> <Paragraph position="3"> A number of machine learning methods have been proposed for context-sensitive spelling correction. These include a variety of Bayesian classifiers (Golding, 1995; Golding and Schabes, 1996), decision lists (Golding, 1995) transformation-based learning (Mangu and Brill, 1997), Latent Semantic Analysis (LSA) (Jones and Martin, 1997), multiplicative weight update algorithms (Golding and Roth, 1999), and augmented mixture models (Cucerzan and Yarowsky, 2002). Despite their differences, most approaches use two types of features: context words and collocations. Context word features record the presence of a word within a fixed window around the target word (bag of words); collocational features capture the syntactic environment of the target word and are usually represented by a small number of words and/or part-of-speech tags to the left or right of the target word. The results obtained by a variety of classification methods are given in Table 6. All methods use either the full set or a subset of 18 confusion sets originally gathered by Golding (1995). Most methods are trained and tested on context sensitive spelling correction the Brown corpus, using 80% for training and 20% for testing.3 We devised a simple, unsupervised method for performing spelling correction using web counts.</Paragraph> <Paragraph position="4"> The method takes into account collocational features, i.e., words that are adjacent to the target word. For each word in the confusion set, we used the web to estimate how frequently it co-occurs with a word or a pair of words immediately to its left or right. Disambiguation is then performed by selecting the word in the confusion set with the highest co-occurrence frequency or probability. The web counts were retrieved using literal queries (see Section 2). Ties are resolved by comparing the unigram frequencies of the words in the confusion set and defaulting to the word with the highest one. Table 5 shows the types of collocations we considered and their corresponding accuracy. The baseline ( f (t)) in Table 5 was obtained by always choosing the most frequent unigram in the confusion set. We used the same test set (2056 tokens from the Brown corpus) and confusion sets as Golding and Schabes (1996), Mangu and Brill (1997), and Cucerzan and Yarowsky (2002).</Paragraph> <Paragraph position="5"> Table 5 shows that the best result (89.24%) for the web-based approach is obtained with a context of one word to the left and one word to the right of the target word ( f (w1;t;w2)). The BNC-based models perform consistently worse than the web-based models with the exception of f (t;w1)=t; the best Altavista model performs significantly better than the best BNC model. Table 6 shows 3An exception is Golding (1995), who uses the entire Brown corpus for training (1M words) and 3/4 of the Wall Street Journal corpus (Marcus et al., 1993) for testing.</Paragraph> <Paragraph position="6"> that both the best Altavista model and the best BNC model outperform their respective baselines. A comparison with the literature shows that the best Altavista model outperforms Golding (1995), Jones and Martin (1997) and performs similar to Golding and Schabes (1996). The highest accuracy on the task is achieved by the class of multiplicative weight-update algorithms such as Winnow (Golding and Roth, 1999). Both the best BNC model and the best Altavista model perform significantly worse than this model. Note that Golding and Roth (1999) use algorithms that can handle large numbers of features and are robust to noise. Our method uses a very small feature set, it relies only on co-occurrence frequencies and does not have access to POS information (the latter has been shown to have an improvement on confusion sets whose words belong to different parts of speech). An advantage of our method is that it can be used for a large number of confusion sets without relying on the availability of training data.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Ordering of Prenominal Adjectives </SectionTitle> <Paragraph position="0"> The ordering of prenominal modifiers is important for natural language generation systems where the text must be both fluent and grammatical. For example, the sequence big fat Greek wedding is perfectly acceptable, whereas fat Greek big wedding sounds odd. The ordering of prenominal adjectives has sparked a great deal of theoretical debate (see Shaw and Hatzivassiloglou 1999 for an overview) and efforts have concentrated on defining rules based on semantic criteria that account for different orders (e.g., age color, value dimension).</Paragraph> <Paragraph position="1"> Data intensive approaches to the ordering problem rely on corpora for gathering evidence for the likelihood of different orders. They rest on the hypothesis that the relative order of premodifiers is fixed, and independent of context and the noun being modified. The simplest strategy is what Shaw and Hatzivassiloglou (1999) call direct evidence. Given an adjective pair fa;bg, they count how many times ha;bi and hb;ai appear in the corpus and choose the pair with the highest frequency.</Paragraph> <Paragraph position="2"> Unfortunately the direct evidence method performs poorly when a given order is unseen in the training data. To compensate for this, Shaw and Hatzivassiloglou (1999) propose to compute the transitive closure of the ordering relation: if a c and c b, then a b. Malouf (2000) further proposes a back-off bigram model of adjective pairs for choosing among alternative orders (P(ha;bijfa;bg) vs. P(hb;aijfa;bg)). He also proposes positional probabilities as a means of estimating how likely it is for a given adjective a to appear first in a sequence by looking at each pair in the training data that contains the adjective a and recording its position. Finally, he uses memory-based learning as a means to encode morphological and semantic similarities among different adjective orders. Each adjective pair ab is encoded as a vector of 16 features (the last eight characters of a and the last eight characters of b) and a class (ha;bi or for adjective ordering (data from Malouf 2000) hb;ai).</Paragraph> <Paragraph position="3"> Malouf (2000) extracted 263,838 individual pairs of adjectives from the BNC which he randomly partitioned into test (10%) and training data (90%) and evaluated all the above methods for ordering prenominal adjectives. His results showed that a memory-based classifier that uses morphological information as well as positional probabilities as features outperforms all other methods (see Table 7). For the ordering task we restricted ourselves to the direct evidence strategy which simply chooses the adjective order with the highest frequency or probability (see Table 7). Web counts were obtained by submitting literal queries to Altavista (see Section 2). We used the same 263,838 adjective pairs that Malouf extracted from the BNC. These were randomly partitioned into a training (90%) and test corpus (10%). The test corpus contained 26,271 adjective pairs. Given that submitting 26,271 queries to Altavista would be fairly timeconsuming, a random sample of 1000 sequences was obtained from the test corpus and the web frequencies of these pairs were retrieved. The best Altavista model significantly outperformed the best BNC model, as indicated in Table 7. We also found that there was no significant difference between the best Altavista model and the best model reported by Malouf, a supervised method using positional probability estimates from the BNC and morphological variants.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Bracketing of Compound Nouns </SectionTitle> <Paragraph position="0"> The first analysis task we consider is the syntactic disambiguation of compound nouns, which has received a fair amount of attention in the NLP literature (Pustejovsky et al., 1993; Resnik, 1993; Lauer, 1995). The task can be summarized as follows: given a three word compound n1 n3 n3, determine the correct binary bracketing of the word sequence (see (3) for an example).</Paragraph> <Paragraph position="1"> (3) a. [[backup compiler] disk] b. [backup [compiler disk]] Previous approaches typically compare different bracketings and choose the most likely one. The adjacency model compares [n1 n2] against [n2 n3] and adopts a right branching analysis if [n2 n3] is more likely than [n1 n2]. The dependency model compares [n1 n2] against [n1 n3] and adopts a right branching analysis if [n1 n3] is more likely than [n1 n2].</Paragraph> <Paragraph position="2"> The simplest model of compound noun disambiguation compares the frequencies of the two competing analyses and opts for the most frequent one (Pustejovsky et al.,</Paragraph> <Paragraph position="4"> 1993). Lauer (1995) proposes an unsupervised method for estimating the frequencies of the competing bracketings based on a taxonomy or a thesaurus. He uses a probability ratio to compare the probability of the left-branching analysis to that of the right-branching (see (4) for the dependency model and (5) for the adjacency model).</Paragraph> <Paragraph position="6"> Here t1, t2 and t3 are conceptual categories in the taxonomy or thesaurus, and the nouns w1 :::wi are members of these categories. The estimation of probabilities over concepts (rather than words) reduces the number of model parameters and effectively decreases the amount of training data required. The probability P(t1 ! t2) denotes the modification of a category t2 by a category t1.</Paragraph> <Paragraph position="7"> Lauer (1995) tested both the adjacency and dependency models on 244 compounds extracted from Grolier's encyclopedia, a corpus of 8 million words. Frequencies for the two models were obtained from the same corpus and from Roget's thesaurus (version 1911) by counting pairs of nouns that are either strictly adjacent or co-occur within a window of a fixed size (e.g., two, three, fifty, or hundred words). The majority of the bracketings in our test set were left-branching, yielding a baseline of 63.93% (see Table 9). Lauer's best results (77.50%) were obtained with the dependency model and a training scheme which takes strictly adjacent nouns into account.</Paragraph> <Paragraph position="8"> Performance increased further by 3.2% when POS tags were taken into account. The results for this tuned model are also given in Table 9. Finally, Lauer conducted an experiment with human judges to assess the upper bound for the bracketing task. An average accuracy of 81.50% was obtained.</Paragraph> <Paragraph position="9"> We replicated Lauer's (1995) results for compound noun bracketing using the same test set. We compared the performance of the adjacency and dependency models (see (4) and (5)), but instead of relying on a corpus and a thesaurus, we estimated the relevant probabilities using web counts. The latter were obtained using inflected queries (see Section 2) and Altavista's NEAR operator.</Paragraph> <Paragraph position="10"> Ties were resolved by defaulting to the most frequent analysis (i.e., left-branching). To gauge the performance of the web-based models we compared them against their BNC-based alternatives; the performance of the best Altavista model was significantly higher than that of the best BNC model (see Table 8). A comparison with the literature (see Table 9) shows that the best BNC model fails to significantly outperform the baseline, and it performs significantly worse than the best model in the literature (Lauer's tuned model). The best Altavista model, on the other hand, is not significantly different from Lauer's tuned model and significantly outperforms the baseline.</Paragraph> <Paragraph position="11"> Hence we achieve the same performance as Lauer without recourse to a predefined taxonomy or a thesaurus.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Interpretation of Compound Nouns </SectionTitle> <Paragraph position="0"> The second analysis task we consider is the semantic interpretation of compound nouns. Most previous approaches to this problem have focused on the interpretation of two word compounds whose nouns are related via a basic set of semantic relations (e.g., CAUSE relates onion tears, FOR relates pet spray). The majority of proposals are symbolic and therefore limited to a specific domain due to the large effort involved in hand-coding semantic information (see Lauer 1995 for an extensive overview).</Paragraph> <Paragraph position="1"> Lauer (1995) is the first to propose and evaluate an unsupervised probabilistic model of compound noun interpretation for domain independent text. By recasting the interpretation problem in terms of paraphrasing, Lauer assumes that the semantic relations of compound heads and modifiers can be expressed via prepositions that (in contrast to abstract semantic relations) can be found in a corpus. For example, in order to interpret war story, one needs to find in a corpus related paraphrases: story about the war, story of the war, story in the war, etc. Lauer uses eight prepositions for the paraphrasing task (of, for, in, at, on, from, with, about). A simple model of compound noun paraphrasing is shown in (6):</Paragraph> <Paragraph position="3"> Lauer (1995) points out that the above model contains one parameter for every triple hp;n1;n2i, and as a result</Paragraph> <Paragraph position="5"> hundreds of millions of training instances would be necessary. As an alternative to (6), he proposes the model in (7) which combines the probability of the modifier given a certain preposition with the probability of the head given the same preposition, and assumes that these two probabilities are independent.</Paragraph> <Paragraph position="7"> Here, t1 and t2 represent concepts in Roget's thesaurus.</Paragraph> <Paragraph position="8"> Lauer (1995) also experimented with a lexicalized version of (7) where probabilities are calculated on the basis of word (rather than concept) frequencies which Lauer obtained from Grolier's encyclopedia heuristically via pattern matching.</Paragraph> <Paragraph position="9"> Lauer (1995) tested the model in (7) on 282 compounds that he selected randomly from Grolier's encyclopedia and annotated with their paraphrasing prepositions. The preposition of accounted for 33% of the paraphrases in this data set (see Baseline in Table 11). The concept-based model (see (7)) achieved an accuracy of 28% on this test set, whereas its lexicalized version reached an accuracy of 40% (see Table 11).</Paragraph> <Paragraph position="10"> We attempted the interpretation task with the lexicalized version of the bigram model (see (7)), but also tried the more data intensive trigram model (see (6)), again in its lexicalized form. Furthermore, we experimented with several conditional and unconditional variants of (7) and (6). Co-occurrence frequencies were estimated from the web using inflected queries (see Section 2). Determiners were inserted before nouns resulting in queries of the type story/stories about and about the/a/0 war/wars for the compound war story.</Paragraph> <Paragraph position="11"> As shown in Table 10, the best performance was obtained using the web-based trigram model ( f (n1; p;n2)); it significantly outperformed the best BNC model. The comparison with the literature in Table 11 showed that the best Altavista model significantly outperformed both the baseline and the best model in the literature (Lauer's word-based model). The BNC model, on the other hand, counts for noun countability detection (data from Baldwin and Bond 2003) achieved a performance that is not significantly different from the baseline, and significantly worse than Lauer's best model.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Noun Countability Detection </SectionTitle> <Paragraph position="0"> The next analysis task that we consider is the problem of determining the countability of nouns. Countability is the semantic property that determines whether a noun can occur in singular and plural forms, and affects the range of permissible modifiers. In English, nouns are typically either countable (e.g., one dog, two dogs) or uncountable (e.g., some peace, *one peace, *two peaces).</Paragraph> <Paragraph position="1"> Baldwin and Bond (2003) propose a method for automatically learning the countability of English nouns from the BNC. They obtain information about noun countability by merging lexical entries from COMLEX (Grishman et al., 1994) and the ALTJ/E Japanese-to-English semantic transfer dictionary (Ikehara et al., 1991). Words are classified into four classes: countable, uncountable, bipartite (e.g., trousers), and plural only (e.g., goods). A memory-based classifier is used to learn the four-way distinction on the basis of several linguistically motivated features such as: number of the head noun, number of the modifier, subject-verb agreement, plural determiners.</Paragraph> <Paragraph position="2"> We devised unsupervised models for the countability learning task and evaluated their performance on Baldwin and Bond's (2003) test data. We concentrated solely on countable and uncountable nouns, as they account for the vast majority of the data. Four models were tested: (a) compare the frequency of the singular and plural forms of the noun; (b) compare the frequency of determiner-noun pairs that are characteristic of countable or uncountable nouns; the determiners used were many for countable and much for uncountable ones; (c) same as model (b), but the det-noun frequencies are normalized by the frequency of the noun; (d) backoff: try to make a decision using det-noun frequencies; if these are too sparse, back off to singular/plural frequencies.</Paragraph> <Paragraph position="3"> Unigram and bigram frequencies were estimated from the web using literal queries; for models (a)-(c) a threshold parameter was optimized on the development set (this parameter determines the ratio of singular/plural frequencies or det-noun frequencies above which a noun was considered as countable). For model (b), an additional backoff parameter was used, specifying the minimum frequency that triggers backoff.</Paragraph> <Paragraph position="4"> The models and their performance on the test set are listed in Table 12. The best Altavista model is the conditional det-noun model ( f (det;n)= f (n)), which achieves 88.38% on countable and 91.22% on uncountable nouns.</Paragraph> <Paragraph position="5"> On the BNC, the simple unigram model performs best. Its performance is not statistically different from that of the best Altavista model. Note that for the BNC models, data sparseness means the det-noun models perform poorly, which is why the backoff model was not attempted here.</Paragraph> <Paragraph position="6"> Table 13 shows that both the Altavista model and BNC model significantly outperform the baseline (relative frequency of the majority class on the gold-standard data).</Paragraph> <Paragraph position="7"> The comparison with the literature shows that both the Altavista and the BNC model perform significantly worse than the best model proposed by Baldwin and Bond (2003); this is a supervised model that uses many more features than just singular/plural frequency and det-noun frequency.</Paragraph> </Section> class="xml-element"></Paper>