File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1104_metho.xml
Size: 19,661 bytes
Last Modified: 2025-10-06 14:15:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1104"> <Title>Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus</Title> <Section position="3" start_page="28" end_page="33" type="metho"> <SectionTitle> 2 Computing tf and df for all substrings </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 2.1 Suffix arrays </SectionTitle> <Paragraph position="0"> A suffix array is a data structure designed to make it convenient to compute term frequencies for all substrings in a corpus. Figure 1 shows an example of a suffix array for a corpus of N=6 words. A suffix array, s, is an array of all N suffixes, pointers to substrings that start at position i and continue to the end of the corpus, sorted alphabetically. The following very simple C function, suffixarray, takes a corpus as input and returns a suffix array.</Paragraph> <Paragraph position="1"> int suffix_compare(char **a, char **b){ return strcmp(*a, *b); } /* The input is a string, terminated with a null */ char **suffix_array(char *corpus){ int i, N = strten(corpus); char **result=(char **)rnalloc(N*sizeof(char *)); /* initialize result\[i\] with the ith suffix */ for(i=0; i < N; i++) result\[il = corpus + i; ClSOr't(result, N, sizeof(char *), suffix_compare); return result; } Nagao and Mori (1994) describe this procedure, and report that it works well on their corpus, and that it requires O(NlogN)time, assuming that the sort step requires O(NlogN) comparisons, and that each comparison requires 0(1) time. We tried this procedure on our two corpora, and it worked well for the Japanese one, but unfortunately, it can go quadratic for a corpus with long repeated substfings, where strcmp takes O(N) time rather than 0(1) time. For our English corpus, after 50 hours of cpu time, we gave up and turned to Doug Mcllroy's implementation ( http : //cm. bell-labs, corrJcm/cs/ who/doug/ssort, c) of Manber and Myers' (1993) algorithm, which took only 2 hours. For a corpus that would otherwise go quadratic, the Manber and Myers' algorithm is well worth the effort, but otherwise, the procedure described above is simpler, and often a bit faster.</Paragraph> <Paragraph position="2"> As mentioned above, suffix arrays were designed to make it easy to compute term frequencies (~. If you want the term frequency of &quot;to be,&quot; you can do a binary search to find the first and last position in the suffix array that start with this phrase, i and j, and then tfl&quot;to be&quot;) =j-i+l. In this case, i=5 and j=6, and consequently, tfl&quot;to be&quot;)=6-5+1=2. Similarly, tfl&quot;be&quot;)= 2-1+1 = 2, and ~&quot;to&quot;)=6-5+1=2. This straightforward method of computing tf requires O(logN) string comparisons, though as before, each string comparison could take O(N) time. There are more sophisticated algorithms that take O(logN) time, even for corpora with long repeated substrings.</Paragraph> <Paragraph position="3"> A closely related concept is lcp (longest common prefix). Lcp is a vector of length N, where lcp\[i\] indicates the length of the common prefix between the ith suffix and the/+/st suffix in the suffix array. Manber and Myers (1993) showed how to compute the lcp vector in O(NlogN) time, even for corpora with long repeated substrings, though for many corpora, the complications required to avoid quadratic behavior are unnecessary.</Paragraph> <Paragraph position="4"> Corpus: &quot;to be or not to be&quot;</Paragraph> </Section> <Section position="2" start_page="29" end_page="31" type="sub_section"> <SectionTitle> 2.2 Classes of substrings </SectionTitle> <Paragraph position="0"> Thus far we have seen how to compute tf for a single ngram, but how do we compute tfand dffor all ngrams? There are N(N+I)/2 substrings in a text of size N. If every substring has a different tf and df, the counting algorithm would require at least quadratic time and space. Fortunately many substrings have the same tf and the same df. We will cluster the N(N+I)/2 substrings into at most 2N-1 classes and compute tf and df over the classes. There will be at most N distinct values of RIDF.</Paragraph> <Paragraph position="1"> Let <i,j> be an interval on the suffix array: {s\[i\], s\[i+l\] ..... s\[j\]}. We call the interval LCP-delimited if the lcp's are larger inside the interval than at its boundary: min(lcp\[i\], lcp\[i+ l\] ..... lcp\[j-1\])</Paragraph> <Paragraph position="3"> In Figure 1, for example, the interval <5,6> is LCP-delimited, and as a result, 0fCto&quot;) = tf(&quot;to be&quot;) = 2, and dfCto&quot;)=dfCto be&quot;).</Paragraph> <Paragraph position="4"> The interval <5,6> is associated with a class of substrings: &quot;to&quot; and &quot;to be.&quot; Classes will turn out to be important because all of the substrings in a class have the same tf(property l) and the same df (property 2). In addition, we will show that classes partition the set of substrings (property 3) so that we can compute tf and df on the classes, rather than substrings. Doing so is much more efficient because there many fewer classes than substfings (property 4).</Paragraph> <Paragraph position="5"> Classes of substrings are defined to be the (not necessarily least) common prefixes in an interval. In Figure 1, for example, both &quot;to&quot; and &quot;to be&quot; are common prefixes throughout the interval <5,6>. That is, every suffix in the interval <5,6> starts with &quot;to,&quot; and every suffix also starts with &quot;to be&quot;. More formally, we define class(<ij>) as: {s\[i\]ml LBL<rn_<SIL}, where s\[i\]rn is a substring (the first m characters of s\[i\]), LBL(longest boundary lcp) is the fight hand of (1) and SIL (shortest interior Icp) is the left hand side&quot; of (1). In Figure 1, for example, SIL(<5,6>) =</Paragraph> <Paragraph position="7"> &quot;to be&quot;}.</Paragraph> <Paragraph position="8"> Figure 2 shows six LCP-delimited intervals and the LBL and SIL of <2,4>. For <2,4>, the bounding lcp's are lcp\[1 \] = 2 and lcp\[4\]=3 (LBL=3), and the interior lcp's are lcp\[2\]=4 and lcp\[3\]=6 (SIL=4). The interval <2,4> is LCPdelimited, because L B L<SIL. Class(<2,4>)= {s\[2\]m13<m~<4} = {aacc}. The interval <3,3> is *) SIL(<i,i>) is defined to be infinity, and consequently, all intervals <i,i> are LCP-delimited, forall i.</Paragraph> <Paragraph position="9"> :: .....</Paragraph> <Paragraph position="10"> Bounding Icps, LBL, SIL, Intedor Icps of <2, 4> Vertical lines denote lcps. Gray area denotes endpoints of substrings in class(<2,4>).</Paragraph> <Paragraph position="11"> LCP-delimited because SIL is infinite and LBL=6. The interval <2,3> is not LCP-delimited because SIL is 4 and LBL is 6 (LBL>SIL).</Paragraph> <Paragraph position="12"> By construction, the suffixes within the interval <i,j> all start with the substrings in class( <i,j> ), and no suffixes outside this interval start with these substfings. As a result, if sl and s2 are two substfings in class(<ij>) then</Paragraph> <Paragraph position="14"> The calculation of dfis more complicated than tf, and will be discussed in section 2.4.</Paragraph> <Paragraph position="15"> It is not uncommon for an LCP-delimited interval to be nested within another. In Figure 2, for example, the in~rval <3,4> is nested within <2,4>. The computation of df in section 2.4 will take advantage of a very convenient nesting property. Given two LCP-delimited intervals, either one is nested within the other (e.g., <2,4> and <3,4>), or one precedes the other (e.g., <2,2> and <3,4>), but they cannot overlap. Thus, for example, the intervals <1,3> and <2,4> cannot both be LCP-delimited because they overlap.</Paragraph> <Paragraph position="16"> Because of this nesting property, it is possible to express the dfof an interval recursively in terms of its constituents or subintervals.</Paragraph> <Paragraph position="17"> As mentioned aboye, we will use the following partitioning property so that we can compute tfand dfon the classes rather than on the substrings.</Paragraph> <Paragraph position="18"> Property 3: the classes partition the set of all substrings in a text.</Paragraph> <Paragraph position="19"> There are two parts to this argument: every substfing belongs to at most one class (property 3a), and every substring belongs to at least one class (property 3b).</Paragraph> <Paragraph position="20"> Demonstration of property 3a (proof by contradiction): Suppose there is a substfing, s, that is a member of two classes: class(<ij>) and class(<u,v>). There are three possibilities: one interval precedes the other, they are property nested or they overlap. The only interesting case is the nesting case. Suppose without loss of generality that <u,v> is nested within <ij> as in must be a bounding lcp of <u,v> that is smaller than any lcp within <u,v>. This bounding Icp must be within <i j>, and as a result, class(<ij>) and class(<u,v>) must be disjoint. Therefore, s cannot be in both classes.</Paragraph> <Paragraph position="22"> Demonstration of property 3b (constructive argument): Let s be an arbitrary substring in the corpus. There will be at least one suffix in the suffix array that starts with s. Let i be the first such suffix and let j be the last such suffix. By construction, the interval <i j> is LCP-delimited (LBL(<ij>) < Isl and S1L(<ij>) >_ Isl), and s is an element of class(<ij>).</Paragraph> <Paragraph position="23"> Finally, as mentioned above, computing over classes is much more efficient than computing over the substfings themselves because there are many fewer classes (at most 2N-l) than substrings (N(N+I)/2).</Paragraph> <Paragraph position="24"> and at most N-1 classes with ~'> 1.</Paragraph> <Paragraph position="25"> The first clause is relatively straightforward. There are N intervals <i,i>. These are all and only the intervals with tf=l. By construction, these intervals are LCP-delimited.</Paragraph> <Paragraph position="26"> To argue the second clause, we will make use of a uniqueness property: an LCP-delimited interval <ij> can be uniquely determined by its S1L and a representative element k (i.~.k<j).</Paragraph> <Paragraph position="27"> Suppose there were two distinct intervals, <id> and <u,v>, with the same SIL, SIL(<ij>)= SIL(<u,v>), and the same representative, i.~.k<j and u_<k<v. Since they share a common representative, k, the two intervals must overlap. But since they are distinct, there must be a distinguishing element, d, that is in one but not the other. One of these distinguishing elements, d, would have to be a bounding lcp in one and an interior lcp in the other. But then the two intervals couldn't both be LCP-delimited.</Paragraph> <Paragraph position="28"> Given this uniqueness property, we can determine the N-1 upper bound on the number of LCP-delimited intervals by considering the N-1 elements in the Icp vector. Each of these elements, lcp\[k\], has the opportunity to become the SIL of an LCP-delimited interval <ij> with a representative k. Thus there could be as many as N-1 LCP-delimited intervals (though there could be fewer if some of the opportunities don't work out).</Paragraph> <Paragraph position="29"> Moreover, there couldn't be any more intervals with 0f>l, because if there were one, its SIL should have been in the lcp vector. (Note that this lcp counting argument excludes intervals with t~-I discussed above, because their SILs need not be in the lcp vector.) From property 4, it follows that there are at most N distinct values of RIDF. The N intervals <i,i> have just one RIDF value since 0~-'-df=l for these intervals. The other N-1 intervals could have another N-1 RIDF values.</Paragraph> <Paragraph position="30"> In summary, the four properties taken collectively make it practical to compute tf, df and RIDF over a relatively small number of classes; it would have been prohibitively expensive to compute these quantities directly over the N(N+ 1)/2 substrings.</Paragraph> </Section> <Section position="3" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 2.3 Calculating classes using Suffix Array </SectionTitle> <Paragraph position="0"> This section will describe a single pass procedure for Computing classes. Since LCP-delimited intervals obey a convenient nesting property, the procedure is based on a push-down stack. The procedure outputs 4-tuples, <s\[i\],LBL,SIL,~>, one for each LCP-delimited interval. The stack elements are pairs (x,y), where x is an index, typically the left edge of a candidate LCP-delimited interval, and y is the SIL of this candidate interval. Typically, y=lcp\[x\], though not always, as we will see in Figure 5.</Paragraph> <Paragraph position="1"> The algorithm sweeps over the suffixes in suffix array s\[1..N\] and their lcp\[1..N\] (lcp\[N\]=O) successively. While Icp's of suffixes are monotonically increasing, indexes and lcp's of the suffixes are pushed into a stack. When it finds the i-th suffix whose lcp\[i\] is less than the lcp on the top of the stack, the index and Icp on the top are popped off the stack. Popping is repeated until the lcp on the top becomes less than the lcp\[i\].</Paragraph> <Paragraph position="2"> A stack element popped out generates a class. Suppose that a stack element composed of an index i and lcp\[i\] is popped out by lcp\[1\]. Lcp\[i\] is used as the SIL. The LBL is the Icp on the next top element in the stack or lcp\[j\]. If the next top Icp will be popped out by lcp\[j\], then the algorithm uses the next top lop as the LBL, else it uses the lcp\[j\]. Tf is the offset between the indexes i and j, that is, j-i+1.</Paragraph> <Paragraph position="3"> Figure 4 shows the detailed algorithm for Create and clear stack.</Paragraph> <Paragraph position="4"> computing all classes with tf > 1. If classes with tf = 1 are needed, we can easily add the line to output those into the algorithm. The expressions, push(x,y) and pop(x,y), operate on the stack in the obvious way, but note that x and y are inputs for push and outputs for pop. The expression, top(x,y), is equivalent to pop(x,y) followed by push(x,y); it reads the top of the stack without changing the stack pointer.</Paragraph> <Paragraph position="5"> As mentioned above, the stack elements are typically pairs (x,y) where y=lcp\[x\], but not always. Pairs are typically pushed onto the stack by line 6, push(i, Icp\[i\]), and consequently, y=lcp\[x\], in many cases, but some pairs are pushed on by line 15. Figure 5 (a) shows the typical case with the suffix array in Figure 2. At this point, i=3 and the stack contains 4 pairs, a dummy element (-1, -1), followed by three pairs generated by line 6: (1, Icp\[l\]), (2, lcp\[2\]), (3, lcp\[3\]). In contrast, Figure 5 (b) shows an atypical case. In between snapshot (a) and snapshot (b), two LCP-delimited intervals were generated, <s\[3\], 4, 6, 2> and <s\[2\], 3, 4, 3>, and then the pair (2, 3) was pushed onto the stack by line 15, push(indexl, lcp\[i\]), to capture the fact that there is a candidate LCP-delimited interval starting at indexl=2, spanning past the representative element i=4, with an SIL of lcp\[i=4\].</Paragraph> <Paragraph position="6"> index lcp Note! (3, 6)\]\]Popped ilout ('2, 3) (2, 4)\[- s\[4\]. , 2) (1, 2) (1</Paragraph> </Section> <Section position="4" start_page="32" end_page="33" type="sub_section"> <SectionTitle> 2.4 Computing df for all classes </SectionTitle> <Paragraph position="0"> This section will extend the algorithm in Figure 4 to include the calculation of dr. Straightforwardly computing dfindependently for each class would require at least quadratic time, because the program must check document id's for all substfings (N at most) in all classes (N-I at most).</Paragraph> <Paragraph position="1"> Instead of this, we will take advantage of the nesting property of intervals. The df for one interval can be computed recursively in terms of its constituents (nested subintervals), avoiding unnecessary recomputation.</Paragraph> <Paragraph position="2"> The stack elements in Figure 5 is augmented with two additional counters: (1) a df counter for summing the dfs over the nested subintervals and (2) a duplication counter for adjusting for overcounting documents that are referenced in multiple subintervals. The df for an interval is simply the difference of these two counters, that is, the sum of the dfs of the subintervals, minus the duplication. A C code implementation can be found at http://www.milab.is.tsukuba.ac.jp/-myama/oedf/tfdf c. The df counters are relatively straightforward to implement. The crux of the problem is the adjustment for duplication. The adjustment makes use of a document link table, as illustrated in Figure 6. The left two columns indicate that suffixes s\[101\], s\[104\] and s\[107\] are all in document 382, and that several other suffixes are also in the same documents. The third column links together suffixes that are in the same document. Note, for example, that there is a pointer from suffix 104 to 101, indicating that s\[104\] and s\[101\] are in the same document. The suffixes in one of these linked lists are kept sorted by their order in the suffix array. When the algorithm is processing s\[t\], the algorithm searches the stack to find the suffix, s\[k\], with the largest k such k_<i and s\[i\] and s\[k\] are in the same document. This search can be performed in O(logN) time.</Paragraph> <Paragraph position="3"> in a suffix array and four suffixes included in the same document. I1 has four immediate constituents of intervals. S\[j\] is included in the same document of s\[i\]. Count for the document of s\[j\] will be duplicated at computing df of 11. At the point of processing sO'\], the algorithm will increment duplication-counter of I! to cancel dfcount of sO'\]. As the same way, df count of s\[k\] has to canceled at computing df of 11.</Paragraph> <Paragraph position="4"> Figure 8 shows a snapshot of the stack after processing s\[4\] in Figure 2. Each stack element is a 4-tuple of the index of suffix array, lcp, dfcounter and duplication-counter, (i, lcp, df dc). Figure 2 shows s\[1\] and s\[4\] are in the same document. Looking up the document link table, the algorithm knows s\[1\] is the nearest suffix which is in the same document of s\[4\]. The duplication-counter of the element of s\[1\] is incremented. The duplication of counting s\[1\] and s\[4\] for the class generated by s\[1\] will be avoided using this duplication-counter.</Paragraph> <Paragraph position="5"> At some processing point, the algorithm uses only a part of the document link table. It duplication lcp counter (2, 3, 3,0) \[(1, needs only the nearest index on the link, but not the whole of the link. So we can compress the link table to dynamic one in which an entry of each document holds the nearest index. Figure 9 shows the nearest index+ table of document after processing s\[4\].</Paragraph> <Paragraph position="6"> The final algorithm to calculate all classes with tfand dftakes O(NlogN) time and O(N) space in the worst case.</Paragraph> </Section> </Section> class="xml-element"></Paper>