File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1152_metho.xml
Size: 12,389 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1152"> <Title>Efficient Unsupervised Recursive Word Segmentation Using Minimum Description Length</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Overview of the Approach </SectionTitle> <Paragraph position="0"> In this section we provide an overview of our approach to greedy construction of a set of morphs (a dictionary), using a minimal description length (MDL) criterion (Barron et al., 1998) (we present three alternative MDL-type criteria below, of varying levels of sophistication). The idea is to initialize a dictionary of morphs to the set of all word types in the corpus, and incrementally refine it by resegmenting affixes (either prefixes or suffixes) from the corpus. Resegmenting on a prefix p (depicted in Figure 1) means adding the prefix as a new morph, and removing it from all words where it occurs as a prefix. Some of the morphs thus created may already exist in the corpus (e.g., &quot;cognition&quot; in Fig. 1). We denote the set of morphs starting with p as Vp, and the set of continuations that follow p by Sp (i.e., Vp = pSp). The number of occurrences of a morph m in the corpus (as currently segmented) is denoted Note that Vre ={relic, retire, recognition, relive}, and Sre ={lic, tire, cognition, live}.</Paragraph> <Paragraph position="1"> by C(m), and the number of tokens in the corpus with prefix p is denoted B(p) = summationtextvk[?]Vp C(vk). The algorithm examines all prefixes of current morphs in the dictionary as resegmentation candidates. The candidate p[?] that would give the greatest decrease in description length upon resegmentation is chosen, and the corpus is then resegmented on p[?].</Paragraph> <Paragraph position="2"> This is repeated until no candidate can decrease description length.</Paragraph> <Paragraph position="3"> Key to this process is efficient resegmentation of the corpus, which entails incremental update of the description length change that each prefix p will give upon resegmentation, denoted [?]CODEp (the change in the coding cost CODE(M,Data) for the corpus plus the model M). This is achieved in two ways. First, we develop (Sec. 3) expressions for [?]CODEp which depend only on simple properties of p, Vp, and Sp, and their occurrences in the corpus.</Paragraph> <Paragraph position="4"> This locality property obviates the need to examine most of the corpus to determine [?]CODEp. Second, we use a novel word/suffix indexing data structure which permits efficient resegmentation and update of the statistics on which [?]CODEp depends (Sec. 4). Initial experimental results for the different models using our algorithm are given in Section 5.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Local Description Length Models </SectionTitle> <Paragraph position="0"> As we show below, the key to efficiency is deriving local expressions for the change in coding length that will be caused by resegmentation on a particular prefix p. That is, this coding length change, [?]CODEp, should depend only on direct properties of p, those morphs Vp = {vk = psk} for which it is a prefix, and those strings Sp = {sk|psk [?] Vp} (p's continuations). This enables us to efficiently maintain the necessary data about the corpus and to update it on resegmentation, avoiding costly scanning of the entire corpus on each iteration.</Paragraph> <Paragraph position="1"> We now describe three description length models for word segmentation. First, we introduce local description length via two simple models, and then give a derivation of a local expression for description length change for a more realistic description length measure.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Model 1: Dictionary count </SectionTitle> <Paragraph position="0"> Perhaps the simplest possible model is to find a segmentation which minimizes the number of morphs in the dictionary CODE1(M,Data) = |M|. Although the global minimum will almost always be the trivial solution where each morph is an individual letter, this trivial solution may be avoided by enforcing a minimal morph length (of 2, say). Furthermore, when implemented via a greedy prefix (or suffix) resegmenting algorithm, this measure gives surprisingly good results, as we show below.</Paragraph> <Paragraph position="1"> Locality in this model is easily shown, as</Paragraph> <Paragraph position="3"> since p is added to M as are all its continuations not currently in M, while each morph vk [?] Vp is removed (being resegmented as the 2-morph sequence psk).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Model 1a: Adjusted count </SectionTitle> <Paragraph position="0"> We also found a heuristic modification of Model 1 to work well, based on the intuition that an affix with more continuations that are current morphs will be better, while to a lesser extent more continuations that are not current morphs indicates lower quality. This gives the local heuristic formula:</Paragraph> <Paragraph position="2"> where a is a tunable parameter determining the relative weights of the two factors.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Model 2: MDL </SectionTitle> <Paragraph position="0"> A more theoretically motivated model seeks to minimize the combined coding cost of the corpus and the dictionary (Barron et al., 1998):</Paragraph> <Paragraph position="2"> where we assume a minimal length code for the corpus based on the morphs in the dictionary1.</Paragraph> <Paragraph position="3"> The coding cost of the dictionary M is:</Paragraph> <Paragraph position="5"> MAP estimation for appropriately chosen prior and conditional data distribution (Barron et al., 1998).</Paragraph> <Paragraph position="6"> where b is the number of bits needed to represent a character and len(m) is the length of m in characters. null The coding cost CODE(Data|M) of the corpus given the dictionary is simply the total number of bits to encode the data using M's code:</Paragraph> <Paragraph position="8"> where M(Data) is the corpus segmented according to M, N is the number of morph tokens in the segmented corpus, mi is the ith morph token in that segmentation, P(m) is the probability of morph m in the corpus estimated as P(m) = C(m)/N, C(m) is the number of times morph m appears in the corpus, |M |is the total number of morph types in M, and mj is the jth morph type in the M.</Paragraph> <Paragraph position="9"> Now suppose we wish to add a new morph to M by resegmenting on a prefix p from all morphs sharing that prefix, as above. First, consider the total change in cost for the dictionary. Note that the addition of the new morph p will cause an increase of blen(p) bits to the total dictionary size. At the same time, each new morph s [?] Sp [?] M will add its coding cost blen(s), while each preexisting morph s' [?] Sp[?]M will not change the dictionary length at all. Finally, each vk is removed from the dictionary, giving a change of [?]blen(vk). The total change in coding cost for the dictionary by resegmenting on p</Paragraph> <Paragraph position="11"> Now consider the change in coding cost for the corpus after resegmentation. First, consider each preexisting morph type m negationslash[?] Vp, with the same count after resegmentation (since it does not contain p). The coding cost of each occurrence of m, however, will change, since the total number of tokens in the corpus will change. Thus the total cost change for such an m is:</Paragraph> <Paragraph position="13"> The total corpus cost change for unchanged morphs depends only on N and B(p):</Paragraph> <Paragraph position="15"> Now, consider explicitly each morph vk [?] Vp which will be split after resegmentation. First, remove the code for each occurrence of vk from the corpus coding: C(vk)logP(vk). Next, add a code for each occurrence of the new morph created by the prefix: [?]C(vk)log ^P(p), where ^P(p) = B(p)/(N + B(p)) is the probability of morph p in the resegmented corpus. Finally, code the con-</Paragraph> <Paragraph position="17"> ^N is the probability of the 'new' morph sk). Putting this together, we have the corpus coding cost change for Vp (noting that B(p) =summationtext</Paragraph> <Paragraph position="19"> Note that all terms are local to the prefix p, its including morphs Vp and its continuations Sp. This will enable an efficient incremental algorithm for greedy segmentation of all words in the corpus, as described in the next section.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Efficient Greedy Prefix Search </SectionTitle> <Paragraph position="0"> The straightforward greedy algorithm schema for finding an approximately minimal cost dictionary is to repeatedly find the best prefix p[?] = argminp [?]CODEp(M,Data) and resegment the corpus on p[?], until no p[?] exists with negative [?]CODE. However, the expense of passing over the entire corpus repeatedly would be prohibitive. Due to lack of space, we sketch here our method for caching corpus statistics in a pair of tries, in such a way that [?]CODEp can be easily computed for any prefix p, and such that the data structures can be efficiently updated when resegmenting on a prefix p.</Paragraph> <Paragraph position="1"> (A heap is also used for efficiently finding the best prefix.) The main data structures consist of two tries. The first, which we term the main suffix trie (MST), is a suffix trie (Gusfield, 1997) for all the words in the corpus. Each node in the MST represents either the prefix of a current morph (initially, a word in the corpus), or the prefix of a potential morph (in case its preceding prefix gets segmented). Each such node is labeled with various statistics of its prefix p (denoted by the path to it from the root) and its suffixes, such as its prefix length len(p), its count B(p), the number of its continuations |Sp|, and the collective length of its continuations summationtextsk[?]Sp len(sk), as well as the current value of [?]CODEp(M,Data) (computed from these statistics). Also, each node representing the end of an actual word in the corpus is marked as such.</Paragraph> <Paragraph position="2"> The second trie, the reversed prefix trie (RPT), contains all the words in the corpus in reverse.</Paragraph> <Paragraph position="3"> Hence each node in the RPT corresponds to the suffix of a word in the corpus. We maintain a list of pointers at each node in the RPT to each node in the MST which has an identical suffix. This allows efficient access to all prefixes of a given string. Also, those nodes corresponding to a complete word in the corpus are marked.</Paragraph> <Paragraph position="4"> Initial construction of the data structures can be done in time linear in the size of the corpus, using straightforward extensions of known suffix trie construction techniques (Gusfield, 1997). Finding the best prefix p[?] can be done efficiently by storing pointers to all the prefixes in a heap, keyed by [?]CODEp. To then remove all words prefixed by p[?] and add all its continuations as new morphs (as well as p[?] itself), proceed as follows, for each continuation sk: 1. If sk is marked in RPT, then it is a complete word, and only its count needs to be updated.</Paragraph> <Paragraph position="5"> 2. Otherwise (a) Mark sk's node in MST as a complete word, and update its statistics (b) Add sRk to RPT and mark the corresponding nodes in MST as accepting stems.</Paragraph> <Paragraph position="6"> 3. Update the heap for the changed prefixes.</Paragraph> <Paragraph position="7"> Prefixes re- *terun- imin- comde- transcon- subdis- *sepre- enex- *papro- *peover- *mi null Meaningless morphs are marked by '*'; nonminimal meaningful morphs by '?'.</Paragraph> <Paragraph position="8"> extracted using Model 1a, as above.</Paragraph> <Paragraph position="9"> The complexity for resegmenting on p is O(len(p) + summationdisplay sk[?]Sp len(sk) + NSUF(Sp)log(|M|)) where NSUF(Sp) is the number of different morphs in the previous dictionary that have a suffix in Sp (which need to be updated in the heap).</Paragraph> </Section> class="xml-element"></Paper>