File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-3024_evalu.xml

Size: 5,252 bytes

Last Modified: 2025-10-06 13:59:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3024">
  <Title>The Wild Thing!</Title>
  <Section position="7" start_page="94" end_page="95" type="evalu">
    <SectionTitle>
5 Indexing and Compression
</SectionTitle>
    <Paragraph position="0"> The k-best string matching problem raises a number of interesting technical challenges. We have two types of language models: trigram language models and long lists (for finite languages such as the 7 million most popular web queries).</Paragraph>
    <Paragraph position="1"> The long lists are indexed with a suffix array.</Paragraph>
    <Paragraph position="2"> Suffix arrays2 generalize very nicely to phone mode, as described below. We treat the list of web queries as a text of N bytes. (Newlines are replaced with end-of-string delimiters.) The suffix array, S, is a sequence of N ints. The array is initialized with the ints from 0 to N[?]1. Thus, S[i]=i, for 0[?]i&lt;N. Each of these ints represents a string, starting at position i in the text and extending to the end of the string. S is then sorted alphabetically.</Paragraph>
    <Paragraph position="3"> Suffix arrays make it easy to find the frequency and location of any substring. For example, given the substring &amp;quot;mail,&amp;quot; we find the first and last suffix in S that starts with &amp;quot;mail.&amp;quot; The gap between these two is the frequency. Each suffix in the gap points to a super-string of &amp;quot;mail.&amp;quot; To generalize suffix arrays for phone mode we replace alphabetical order (strcmp) with phone order (phone-strcmp). Both strcmp and phone-strcmp consider each character one at a time. In standard alphabetic ordering, 'a'&lt;'b'&lt;'c', but in  phone-strcmp, the characters that map to the same key on the phone keypad are treated as equivalent.</Paragraph>
    <Paragraph position="4"> We generalize suffix arrays to take advantage of popularity weights. We don't want to find all queries that contain the substring &amp;quot;mail,&amp;quot; but rather, just the k-best (most popular). The standard suffix array method will work, if we add a filter on the output that searches over the results for the kbest. However, that filter could take O(N) time if there are lots of matches, as there typically are for short queries.</Paragraph>
    <Paragraph position="5"> An improvement is to sort the suffix array by both popularity and alphabetic ordering, alternating on even and odd depths in the tree. At the first level, we sort by the first order and then we sort by the second order and so on, using a construction, vaguely analogous to KD-Trees (Bentley, 1975).</Paragraph>
    <Paragraph position="6"> When searching a node ordered by alphabetical order, we do what we would do for standard suffix arrays. But when searching a node ordered by popularity, we search the more popular half before the second half. If there are lots of matches, as there are for short strings, the index makes it very easy to find the top-k quickly, and we won't have to search the second half very often. If the prefix is rare, then we might have to search both halves, and therefore, half the splits (those split by popularity) are useless for the worst case, where the input substring doesn't match anything in the table.</Paragraph>
    <Paragraph position="7"> Lookup is O(sqrt N).3 Wildcard matching is, of course, a different task from substring matching. Finite State Machines (Mohri et al, 2002) are the right way to think about the k-best string matching problem with wildcards. In practice, the input strings often contain long anchors of constants (wildcard free substrings). Suffix arrays can use these anchors to generate a list of candidates that are then filtered by a regex package.</Paragraph>
    <Paragraph position="8"> 3 Let F(N) be the work to process N items on the frequency splits and let A(N) be the work to process N items on the alphabetical splits. In the worst case, F(N) = 2A(N/2) + C1 and A(N) = F(N/2) + C2, where C1 and C2 are two constants. In other words, F(N) = 2F(N/4) + C, where C = C1 + 2C2.</Paragraph>
    <Paragraph position="9"> We guess that F(N) = a sqrt(N) + b, where a and b are constant. Substituting this guess into the recurrence, the dependencies on N cancel. Thus, we conclude, F(N) = O(sqrt N).</Paragraph>
    <Paragraph position="10"> Memory is limited in many practical applications, especially in the mobile context. Much has been written about lossless compression of language models. For trigram models, we use a lossy method inspired by the Unix Spell program (McIlroy, 1982). We map each trigram &lt;x, y, z&gt; into a hash code h = (V2 x + V y + z) % P, where V is the size of the vocabulary and P is an appropriate prime. P trades off memory for loss. The cost to store N trigrams is: N [1/loge2 + log2(P/N)] bits.</Paragraph>
    <Paragraph position="11"> The loss, the probability of a false hit, is 1/P.</Paragraph>
    <Paragraph position="12"> The N trigrams are hashed into h hash codes.</Paragraph>
    <Paragraph position="13"> The codes are sorted. The differences, x, are encoded with a Golomb code4 (Witten et al, 1999), which is an optimal Huffman code, assuming that the differences are exponentially distributed, which they will be, if the hash is Poisson.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML