File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3238_metho.xml
Size: 19,287 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3238"> <Title>Spelling correction as an iterative process that exploits the collective knowledge of web users</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Error Model. String Edit Functions </SectionTitle> <Paragraph position="0"> All formulations of the spelling correction task given in the previous section used a string distance function and a threshold to restrict the space in which alternative spellings are searched. Various previous work has addressed the problem of choosing appropriate functions (e.g. Kernigham et al. 1990, Brill and Moore, 2002; Toutanova and Moore, 2003).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A trusted lexicon may still be used in the estimation of the </SectionTitle> <Paragraph position="0"> language model probability for the computation of )|( stP .</Paragraph> <Paragraph position="1"> The choice of distance function d and threshold d could be extremely important for the accuracy of a speller. At one extreme, the use of a too restrictive function/threshold combination can result in not finding the best correction for a given query. For example, using the vanilla Damerau-Levenshtein edit distance (defined as the minimum number of point changes required to transform a string into another, where a point change is one of the following operations: insertion of a letter, deletion of a letter, and substitution of one letter with another letter) and a threshold 1=d , the correction donadl duck donald duck would not be possible. At the other extreme, the use of a less limiting function might have as consequence suggesting very unlikely corrections. For example, using the same classical Levenshtein distance and 2=d would allow the correction of the string donadl duck, but will also lead to bad corrections such as log wood dog food (based on the frequency of the queries, as incorporated in )(sP ). Nonetheless, large distance corrections are still desirable in a diversity of situations, for example: platnuin rings platinum rings ditroitigers detroit tigers The system described in this paper makes use of a modified context-dependent weighted Damerau-Levenshtein edit function which allows insertion, deletion, substitution, immediate transposition, and long-distance movement of letters as point changes, for which the weights were interactively refined using statistics from query logs.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The Language Model. Exploiting Large </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Web Query Logs </SectionTitle> <Paragraph position="0"> A misspelling such as ditroitigers is far from the correct alternative and thus, it might be extremely difficult to find its correct spelling based solely on edit distance. Nonetheless, the correct alternative could be reached by allowing intermediate valid correction steps, such as ditroitigers detroittigers detroit tigers. But what makes detroittigers a valid correction step? Recall that the last formulation of spelling correction in Section 3 did not explicitly use a lexicon of the language. Rather, any string that appears in the query log used for training can be considered a valid correction and can be suggested as an alternative to the current web query based on the relative frequency of the query and the alternative spelling. Thus, a spell checker built according to this formulation could suggest the correction detroittigers because this alternative occurs frequently enough in the employed query log. However, detroittigers itself could be corrected to detroit tigers if presented as a stand-alone query to this spell checker, based on similar query-log frequency facts, which naturally leads to the idea of an iterative correction approach. null Einstein's name in a web query log.</Paragraph> <Paragraph position="1"> Essential to such an approach are three typical properties of the query logs (e.g. see Table 1): * words in the query logs are misspelled in various ways, from relatively easy-to-correct misspellings to very-difficult-to-correct ones, that make the user's intent almost impossible to recognize; * the less malign (difficult to correct) a misspelling is the more frequent it is; * the correct spellings tend to be more frequent than misspellings.</Paragraph> <Paragraph position="2"> In this context, the spelling correction problem can be given the following iterative formulation: Given a string *0 S[?]s , find a sequence</Paragraph> <Paragraph position="4"> An example of correction that can be made by iteratively applying the base spell checker is: anol scwartegger arnold schwarzenegger Misspelled query: anol scwartegger First iteration: arnold schwartnegger Second iteration: arnold schwarznegger Third iteration: arnold schwarzenegger Fourth iteration: no further correction Up to this point, we underspecified the notion of string in the task formulations given. One possibility is to consider whole queries as the strings to be corrected and iteratively search for better logged queries according to the agreement between their relative frequencies and the character error model. This is equivalent to identifying all queries in the query log that are misspellings of other queries and for any new query, find a correction sequence of logged queries. While such an approach exploits the vast information available in web-query logs, it only covers exact matches of the queries that appear in these logs and provides a low coverage of infrequent queries. For example, a query such as britnet spear inconcert could not be corrected if the correction britney spears in concert does not appear in the employed query log, although the substring britnet spear could be corrected to britney spears.</Paragraph> <Paragraph position="5"> To address the shortcomings of such an approach, we propose a system based on the following formulation, which uses query substrings: Given *0 S[?]s , find a sequence *21 ,..., S[?]nsss , such that for each 1..0 [?][?] ni there exist the decompositions ii liiliii wwwws 1,11 1,11i0,1 0, ...s ,... +++ == , where k hjw , are words or groups of words such that</Paragraph> <Paragraph position="7"> Note that the length of the string decomposition may vary from one iteration to the next one, for example: In the implementation evaluated in this paper, we allowed decompositions of query strings into words and word bigrams. The tokenization process uses space and punctuation delimiters in addition to the information provided about multi-word compounds (e.g. add-on and back-up) by a trusted English lexicon with approximately 200k entries. By using the tokenization process described above, we extracted word unigram and bigram statistics from query logs to be used as the system's language model.</Paragraph> </Section> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Query Correction </SectionTitle> <Paragraph position="0"> An input query is tokenized using the same space and word-delimiter information in addition to the available lexical information as used for processing the query log. For each token, a set of alternatives is computed using the weighted Levenshtein distance function described in Section 3 and two different thresholds for in-lexicon and out-of-lexicon tokens Matches are searched in the space of word unigrams and bigrams extracted from query logs in addition to the trusted lexicon. Unigrams and bi-grams are stored in the same data structure on which the search for correction alternatives is done. Because of this, the proposed system handles concatenation and splitting of words in exactly the same manner as it handles transformations of words to other words.</Paragraph> <Paragraph position="1"> Once the sets of all possible alternatives are computed for each word form in the query, a modified Viterbi search (in which the transition probabilities are computed using bigram and unigram query-log statistics and output probabilities are replaced with inverse distances between words) is employed to find the best possible alternative string to the input query under the following constraint: no two adjacent in-vocabulary words are allowed to change simultaneously. This constraint prevents changes such as log wood dog food. An algorithmic consequence of this constraint is that there is no need to search all the possible paths in the trellis, which makes the modified search procedure much faster, as described further. We assume that the list of alternatives for each word is randomly ordered but the input word is on the first position of the list when the word is in the trusted lexicon. In this case, the searched paths form what we call fringes.</Paragraph> <Paragraph position="2"> Figure 1 presents an example of a trellis in which w1, w2 and w3 are in-lexicon word forms. Observe that instead of computing the cost of k1k2 possible paths between the alternatives corresponding to w1 and w2, we only need to compute the cost of k1+k2 paths.</Paragraph> <Paragraph position="4"> Because we use word-bigram statistics, stop words such as prepositions and conjunctions may interfere negatively with the best path search. For example, in correcting a query such as platunum and rigs, the language model based on word bi-grams would not provide a good context for the word form rigs.</Paragraph> <Paragraph position="5"> To avoid this type of problems, stop words and their most likely misspelling are given a special treatment. The search is done by first ignoring them, as in Figure 1, where w4 is presumed to be such a word. Once a best path is found by ignoring stop words, the best alternatives for the skipped stop words (or their misspellings) are computed in a second Viterbi search with fringes in which the extremities are fixed, as presented in Figure 2.</Paragraph> <Paragraph position="6"> The approach of search with fringes coupled with an iterative correction process is both very efficient and very effective. In each iteration, the search space is much reduced. Changes such as log wood dog food are avoided because they can not be made in one iteration and there are no intermediate corrections conditionally more probable than the left-hand-side query (log wood) and less probable than the right-hand-side query (dog food). An iterative process is prone to other types of problems. Short queries can be iteratively transformed into other un-related queries; therefore, changing such queries is restricted additionally in our system. Another restriction we imposed is to not allow changes of in-lexicon words in the first iteration, so that easy-to-fix unknown-word errors are handled before any word substitution error.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> For this work, we are concerned primarily with recall because providing good suggestions for misspelled queries can be viewed as more important than abstaining to provide alternative query suggestions for valid queries as long as these suggestions are reasonable (for example, suggesting cowboy ropes for cowboy robes may not have major cost to a user). A real system would have a component that decides whether to surface a spelling suggestion based on where we want to be on the ROC curve, thus negotiating between precision and recall.</Paragraph> <Paragraph position="1"> One problem with evaluating a spell checker designed to correct search queries is that evaluation data is hard to get. Even if the system were used by a search engine and click-through information were available, such information would provide only a crude measure of precision and would not allow us to measure recall, by capturing only cases in which the corrections proposed by that particular speller are clicked on by the users.</Paragraph> <Paragraph position="2"> We performed two different evaluations of the proposed system.4 The first evaluation was done on a test set comprising 1044 unique randomly sampled queries from a daily query log, which were annotated by two annotators. Their interagreement rate was 91.3%. 864 of these queries were considered valid by both annotators; for the other 180, the annotators provided spelling corrections. The overall agreement of our system with the annotators was 81.8%. The system suggested 131 alternative queries for the valid set, counted as false positives, and 156 alternative queries for the misspelled set. Table 2 shows the accuracy obtained by the proposed system and results from an ablation study where we disabled various components of the system, to measure their influence on performance.</Paragraph> </Section> <Section position="10" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The test data sets can be downloaded from </SectionTitle> <Paragraph position="0"> By completely removing the trusted lexicon, the accuracy of the system on misspelled queries (61.1%) was higher than in the case of only using a trusted lexicon and no query log data (52.8%). It can also be observed that the language model built using query logs is by far more important than the channel model employed: using a poorer character error model by setting all edit weights equal did not have a major impact on performance (66.1% recall), while using a poorer language model that only employs unigram statistics from the query logs crippled the system (41.7% recall). Another interesting aspect is related to the number of iterations. Because the first iteration is more conservative than the following iterations, using only one iteration led to fewer false positives but also to a much lower recall (47.2%). Two iterations were sufficient to correct most of the misspelled queries that the full system could correct. While fringes did not have a major impact on recall, they helped avoid false positives (and had a major impact on speed).</Paragraph> <Paragraph position="1"> monthly query logs used to train the language model Figure 3 shows the performance of the full system as a function of the number of monthly query logs employed. While both the total accuracy and the recall increased when using 2 months of data instead of 1 month, by using more query log data (3 and 4 month), the recall (or accuracy on misspelled queries) still improves but at the expense of having more false positives for valid queries, which leads to an overall slightly smaller accuracy. A post-analysis of the results showed that the system suggested in many cases reasonable corrections but different from the gold standard ones. Many false positives could be considered reasonable suggestions, although it is not clear whether they would have been helpful to the users (e.g.</Paragraph> <Paragraph position="2"> 2002 kawasaki ninja zx6e 2002 kawasaki ninja zx6r was counted as an error, although the suggestion represents a more popular motorcycle model). In the case of misspelled queries in which the user's intent was not clear, the suggestion made by the system could be considered valid despite the fact that it disagreed with the annotators' choice (e.g. gogle google instead of the gold standard correction goggle).</Paragraph> <Paragraph position="3"> To address the problems generated by the fact that the annotators could only guess the user intent, we performed a second evaluation, on a set of queries randomly extracted from query log data, by sampling pairs of successive queries ),( 21 qq sent by the same users in which the queries differ from one another by an un-weighted edit distance of at most 1+(len( 1q )+len( 2q ))/10 (i.e. allow a point change for every 5 letters). We then presented the list to human annotators who had the option to reject a pair, choose one of the queries as a valid correction of the other, or propose a correction for both when none of them were valid but the intended valid query was easy to guess from the sequence, as in example 3 below: (audio flie, audio file) audio file (bueavista, buena vista) buena vista (carrabean nooms, carrabean rooms) caribbean rooms Table 3 shows the performance obtained by different instantiations of the system on this set.</Paragraph> <Paragraph position="4"> contains misspelled queries that the users had reformulated The main system disagreed 99 times with the gold standard, in 80 of these cases suggesting a different correction. 40 of the corrections were not appropriate (e.g. porat was corrected by our system to pirate instead of port in chinese porat also called xiamen), 15 were functionally equivalent corrections given our target search engine (e.g.</Paragraph> <Paragraph position="5"> audio flie audio files instead of audio file), 17 were different valid suggestions (e.g. bellsouth lphone isting bellsouth phone listings instead of bellsouth telephone listing), while 8 represented gold standard errors (e.g. the speller correctly suggested brandy sniffters brandy snifters instead of brandy sniffers). Out of 19 cases in which the system did not make a suggestion, 13 were genuine errors (e.g. paul waskiewiscz with the correct spelling paul waskiewicz), 4 were cases in which the original input was correct, although different from the user's intent (e.g. cooed instead of coed) and 2 were gold standard errors (e.g. commandos 3 walkthrough had the wrong correction commando 3 walkthrough, as this query refers to a popular videogame called &quot;commandos 3&quot;).</Paragraph> <Paragraph position="6"> Differences Gold std errors Format Diff. valid Real Errors 80+19 8+2 15+0 17+4 40+13 The above table shows a synthesis of this error analysis on the second evaluation set. The first number in each column refers to a precision error (i.e. the speller suggested something different than the gold standard), while the second refers to a recall error (i.e. no suggestion).</Paragraph> <Paragraph position="7"> As a result of this error analysis, we could arguably consider that while the agreement with the gold standard experiments are useful for measuring the relative importance of components, they do not give us an absolute measure of system usefulness/accuracy. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Agreement Correctness Precision Recall </SectionTitle> <Paragraph position="0"> 73.1 85.5 88.4 85.4 In the above table, we consider correctness as the relative number of times the suggestion made by the speller was correct or reasonable; precision measures the number of correct suggestions in the total number of spelling suggestions made by the system; recall is computed as the relative number of correct/reasonable suggestions made when such suggestions were needed.</Paragraph> <Paragraph position="1"> As an additional verification and to confirm the difficulty of the test queries, we sent a set of them to Google and observed that Google speller's agreement with the gold standard was slightly lower than our system's agreement.</Paragraph> </Section> </Section> class="xml-element"></Paper>