File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1113_metho.xml
Size: 12,772 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1113"> <Title>Dynamic Programming Matching for Large Scale Information Retrieval</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> Keywords: Dynamic programming, Corpus-based, Japanese. 1 Introduction </SectionTitle> <Paragraph position="0"> The dynamic programming method is well-known for its ability to calculate the edit distance between strings. The method can also be applied to information retrieval. Dynamic programming matching can measure the similarity between documents, even if there are partial deletions or insertions. However, there are two problems in applying this method to information retrieval. One problem is search effectiveness. It is poor because dynamic programming matching lacks an adequate weighting schema. The second problem is computational efficiency. Also, lack of an adequate indexing schema means that dynamic programming matching usually has to process the entire document.</Paragraph> <Paragraph position="1"> Yamamoto et al. proposed a method of dynamic programming matching with acceptable search effectiveness (Yamamoto et al., 2000; Yamamoto, Takeda, and Umemura, 2003). They report that the effectiveness of dynamic programming match-ing improves by introducing an IDF (Inverse Document Frequency) weighting schema for all strings that contribute similarity. They calculate matching weights not only for words but also for all strings.</Paragraph> <Paragraph position="2"> Although they report that effectiveness is improved, the speed of their method is slower than that of conventional dynamic programming matching, and much slower than that of a typical information retrieval system.</Paragraph> <Paragraph position="3"> In this paper, we aim to improve the retrieval efficiency of the dynamic programming method while keeping its search effectiveness. From a mathematical point of view, we have only changed the definition of the weighting. The mathematical structure of similarity remains the same as that of the dynamic programming method proposed by (Yamamoto et al., 2000; Yamamoto, Takeda, and Umemura, 2003).</Paragraph> <Paragraph position="4"> Although it has the same definition, the new weighting method makes it possible to build a more efficient information retrieval system by creating the index in advance. To our surprise, we have observed that our proposed method is not only more efficient but also more effective.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Similarities Based on Dynamic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Programming Matching </SectionTitle> <Paragraph position="0"> In this section, we introduce several similarities proposed by (Yamamoto et al., 2000; Yamamoto, Takeda, and Umemura, 2003). All of them are a form of dynamic programming matching. These similarities include translation of the edit distance. This distance has been described by several authors.</Paragraph> <Paragraph position="1"> We have adopted Korfhage's definition: 'the edit distance is the minimum number of edit operations, such as insertion and deletion, which are required to map one string into the other' (Korfhage, 1997).</Paragraph> <Paragraph position="2"> There are three related similarities. The first is dynamic programming matching, which is simply conversion of the edit distance. The second similarity is an extension of the first similarity, introducing a character weighting for each contributing character.</Paragraph> <Paragraph position="3"> The third and proposed similarity is an extension of the second one, using string weight instead of character weight.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Dynamic Programming Matching </SectionTitle> <Paragraph position="0"> As stated above, dynamic programming (DP) matching is a conversion of edit distance. We call this similarity SIM1. While the edit distance (ED) is a measure of difference, counting different characters between two strings , SIM1 is a measure of similarity, counting matching characters between two strings. ED and SIM1 are defined as follows:</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Definition 2.1 Edit Distance (Korfhage, 1997) </SectionTitle> <Paragraph position="0"> Let fi and fl be strings, x and y be a character, and &quot;&quot; be empty string.</Paragraph> <Paragraph position="1"> + If both strings are empty then</Paragraph> <Paragraph position="3"> Let fi and fl be strings, x and y be a character, and &quot;&quot; be empty string.</Paragraph> <Paragraph position="4"> + If both strings are empty then</Paragraph> <Paragraph position="6"/> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Character Weight DP Similarity </SectionTitle> <Paragraph position="0"> SIM1 adds 1.0 to the similarity between two strings for every matching character, and this value is constant for all the time. Our assumption for the new function is that different characters make different contributions. For example, in Japanese information retrieval, Hiragana characters are usually used for functional words and make a different contribution than Kanji characters, which are usually used for content words. Thus, it is natural to assign a different similarity weight according to the nature of the character. The below method of defining Character Weight DP Similarity adds not 1.0 but a specific weight depending on the matching character. We call this similarity SIM2. It resembles Ukkonen's</Paragraph> <Paragraph position="2"/> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 String Weight DP Similarity </SectionTitle> <Paragraph position="0"> DP procedure usually considers just a single character at a time, but since some long substrings can receive good scores, it is natural to consider all prefixes of the longest common prefix, not just the next character.</Paragraph> <Paragraph position="1"> While SIM2 uses a character weight whenever a character matches between strings, a single character may not be enough. In some cases, even when each character has a low weight, the string as a whole may be a good clue for information retrieval. For example, &quot;chirimenjyako&quot; is a Japanese word that could be a retrieval key word. This word, which means &quot;boiled and dried baby sardines,&quot; consists only of Hiragana characters &quot;chi-ri-me-n-jyako&quot; but each character would make a small contribution in SIM2.</Paragraph> <Paragraph position="2"> The proposed similarity is called String Weight DP Similarity, which is a generalization of SIM2.</Paragraph> <Paragraph position="3"> We call this similarity SIM3. It considers the weight of all matching strings and is defined as follows: Definition 2.4 SIM3 Let fi and fl be strings, x and y be a character, and &quot;&quot; be empty string.</Paragraph> <Paragraph position="4"> + If both strings are empty then</Paragraph> <Paragraph position="6"> where >>(= -) is the maximum length string matching from the first character.</Paragraph> <Paragraph position="8"/> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Weighting Function </SectionTitle> <Paragraph position="0"> Yamamoto et al. have used IDF (Inverse Document Frequency) as a weight for each string. The weight is computed using a Score function as follows: Definition 2.5 Yamamoto et al.'s Score function Let >> be string, df(>>) the frequency of documents including >> in the document set for retrieval, and N be the number of documents in the set.</Paragraph> <Paragraph position="2"> The standard one-character-at-a-time DP method assumes that long matches cannot receive exceptionally good scores. In other words, it regards Score(>>) as 0 if the length of >> is greater than one. If the Score function obeys the inequality, Score(- ) < Score(-) + Score( ) for all substrings - and , the best path would consist of a sequence of single characters, and we would not need to consider long phrases. However, we are proposing a different Score function. It sometimes assigns good scores to long phrases, and therefore SIM2 has to be extended into SIM3 to establish a DP procedure that considers more than just one character at a time.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Proposed Weighting Function </SectionTitle> <Paragraph position="0"> Although SIM3, as shown in Section 2.3, has reasonable effectiveness, its computation is harder than that of the edit distance, and much harder than that of the similarity used in a conventional information retrieval system. In this paper, we have modified the weighting function so that it keeps its effectiveness while improving efficiency. To achieve this improvement, we use the SIM3 with the same definition but with a different score function.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Proposed String Weighting </SectionTitle> <Paragraph position="0"> We reduce the computational cost by limiting strings that have positive scores. First, we select bigrams as such strings. In other words, we assign a score of zero if the length of the string does not equal to 2.</Paragraph> <Paragraph position="1"> Several language systems use Kanji characters (e.g.</Paragraph> <Paragraph position="2"> Chinese and Japanese), and bigram is an effective indexing unit for information retrieval for these language systems (Ogawa and Matsuda, 1997). In addition, we may assume that the contribution of a longer string is approximated by the total bigram weighting. We have also restricted our attention to infrequent bigrams. Thus, we have restricted the weighting function Score as follows, where K is the number decided by the given query.</Paragraph> <Paragraph position="3"> + If string length is 2 and cf(>>) < K then</Paragraph> <Paragraph position="5"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Using a Suffix Array for Indexing </SectionTitle> <Paragraph position="0"> Since we have restricted the number of match-ing strings, and all the matching strings appear in a query, we can collect all the positions of such strings. To make it possible, we need some indexing in advance. We have used a suffix array for this index. Below we summarize our proposed algorithm using a suffix array: I. Make a suffix array of the document set.</Paragraph> <Paragraph position="1"> II. For each query, A. Make a set of substrings consisting of two characters (bigram).</Paragraph> <Paragraph position="2"> B. For a given number n, extract the total n of less frequent bigrams, calculating corpus frequency.</Paragraph> <Paragraph position="3"> C. For each bigram from step B, i. Record all positions in which the bi-gram appears in the query and docu-ment set, ii. Record all documents that contain thebigram. D. For each document recorded, i. Compute the similarity between the query and the document with SIM3, using the recorded position of the corresponding bigram.</Paragraph> <Paragraph position="4"> ii. Assign the similarity to the document.</Paragraph> <Paragraph position="5"> E. Extract the most similar 1000 documentsfrom the recorded documents as a retrieval result for the query.</Paragraph> <Paragraph position="6"> We call the retrieval method described above Fast Dynamic Programming (FDP). In general, retrieval systems use indexes to find documents. FDP also uses an index as a usual method. However, unlike conventional methods, FDP requires information not only on the document identification but also on the position of bigrams.</Paragraph> <Paragraph position="7"> Manber and Myers proposed a data structure called &quot;suffix array.&quot; (Manber and Myers, 1993) Figure 1 shows an example of suffix array. Each suffix is expressed by one integer corresponding to its position. We use this suffix array to find out the position of selected bigrams. A suffix array can be created in O(N log(N)) time because we need to sort all suffixes in alphabetical order. We can get the position of any string in O(log(N)) time by a binary search of suffixes and by then obtaining its corresponding position.</Paragraph> </Section> </Section> class="xml-element"></Paper>