File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1085_metho.xml
Size: 17,742 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1085"> <Title>Contextual Dependencies in Unsupervised Word Segmentation[?]</Title> <Section position="5" start_page="674" end_page="677" type="metho"> <SectionTitle> 3 Unigram Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="674" end_page="675" type="sub_section"> <SectionTitle> 3.1 The Dirichlet Process Model </SectionTitle> <Paragraph position="0"> Our goal is a model of language that prefers sparse solutions, allows independent modification of components, and is amenable to standard search procedures. We achieve this goal by basing our model on the Dirichlet process (DP), a distribution used in nonparametric Bayesian statistics. Our unigram model of word frequencies is defined as</Paragraph> <Paragraph position="2"> where the concentration parameter a0 and the base distribution P0 are parameters of the model.</Paragraph> <Paragraph position="3"> Each word wi in the corpus is drawn from a distribution G, which consists of a set of possible words (the lexicon) and probabilities associated with those words. G is generated from a DP(a0,P0) distribution, with the items in the lexicon being sampled from P0 and their probabilities being determined by a0, which acts like the parameter of an infinite-dimensional symmetric Dirichlet distribution. We provide some intuition for the roles of a0 and P0 below.</Paragraph> <Paragraph position="4"> Although the DP model makes the distribution G explicit, we never deal with G directly. We take a Bayesian approach and integrate over all possible values of G. The conditional probability of choosing to generate a word from a particular lexical entry is then given by a simple stochastic process known as the Chinese restaurant process (CRP) (Aldous, 1985). Imagine a restaurant with an infinite number of tables, each with infinite seating capacity. Customers enter the restaurant and seat themselves. Let zi be the table chosen by the ith customer. Then</Paragraph> <Paragraph position="6"> (2) where z[?]i = z1 ...zi[?]1, n(z[?]i)k is the number of customers already sitting at table k, and K(z[?]i) is the total number of occupied tables. In our model, the tables correspond to (possibly repeated) lexical entries, having labels generated from the distribution P0. The seating arrangement thus specifies a distribution over word tokens, with each customer representing one token. This model is an instance of the two-stage modeling framework described by Goldwater et al. (2006), with P0 as the generator and the CRP as the adaptor.</Paragraph> <Paragraph position="7"> Our model can be viewed intuitively as a cache model: each word in the corpus is either retrieved from a cache or generated anew. Summing over all the tables labeled with the same word yields the probability distribution for the ith word given previously observed words w[?]i:</Paragraph> <Paragraph position="9"> where n(w[?]i)w is the number of instances of w observed in w[?]i. The first term is the probability of generating w from the cache (i.e., sitting at an occupied table), and the second term is the probability of generating it anew (sitting at an unoccupied table). The actual table assignments z[?]i only become important later, in the bigram model.</Paragraph> <Paragraph position="10"> There are several important points to note about this model. First, the probability of generating a particular word from the cache increases as more instances of that word are observed. This richget-richer process creates a power-law distribution on word frequencies (Goldwater et al., 2006), the same sort of distribution found empirically in natural language. Second, the parameter a0 can be used to control how sparse the solutions found by the model are. This parameter determines the total probability of generating any novel word, a probability that decreases as more data is observed, but never disappears. Finally, the parameter P0 can be used to encode expectations about the nature of the lexicon, since it defines a probability distribution across different novel words. The fact that this distribution is defined separately from the distribution on word frequencies gives the model additional flexibility, since either distribution can be modified independently of the other.</Paragraph> <Paragraph position="11"> Since the goal of this paper is to investigate the role of context in word segmentation, we chose the simplest possible model for P0, i.e. a unigram</Paragraph> <Paragraph position="13"> where word w consists of the phonemes m1...mn, and p# is the probability of the word boundary #. For simplicity we used a uniform distribution over phonemes, and experimented with different fixed values of p#.1 A final detail of our model is the distribution on utterance lengths, which is geometric. That is, we assume a grammar similar to the one shown in Figure 1, with the addition of a symmetric Beta(t2) prior over the probability of the U productions,2 and the substitution of the DP for the standard multinomial distribution over the W productions.</Paragraph> </Section> <Section position="2" start_page="675" end_page="676" type="sub_section"> <SectionTitle> 3.2 Gibbs Sampling </SectionTitle> <Paragraph position="0"> Having defined our generative model, we are left with the problem of inference: we must determine the posterior distribution of hypotheses given our input corpus. To do so, we use Gibbs sampling, a standard Markov chain Monte Carlo method (Gilks et al., 1996). Gibbs sampling is an iterative procedure in which variables are repeatedly 1Note, however, that our model could be extended to learn both p# and the distribution over phonemes.</Paragraph> <Paragraph position="2"> are part of h[?].</Paragraph> <Paragraph position="3"> sampled from their conditional posterior distribution given the current values of all other variables in the model. The sampler defines a Markov chain whose stationary distribution is P(h|d), so after convergence samples are from this distribution.</Paragraph> <Paragraph position="4"> Our Gibbs sampler considers a single possible boundary point at a time, so each sample is from a set of two hypotheses, h1 and h2. These hypotheses contain all the same boundaries except at the one position under consideration, where h2 has a boundary and h1 does not. The structures are shown in Figure 2. In order to sample a hypothesis, we need only calculate the relative probabilities of h1 and h2. Since h1 and h2 are the same except for a few rules, this is straightforward. Let h[?] be all of the structure shared by the two hypotheses, including n[?] words, and let d be the observed data. Then</Paragraph> <Paragraph position="6"> where the second line follows from Equation 3 and the properties of the CRP (in particular, that it is exchangeable, with the probability of a seating configuration not depending on the order in which customers arrive (Aldous, 1985)). Also,</Paragraph> <Paragraph position="8"> tion taking on the value 1 when its argument is true, and 0 otherwise. The nr term is derived by integrating over all possible values of p$, and noting that the total number of U productions in h[?] is n[?] + 1.</Paragraph> <Paragraph position="9"> Using these equations we can simply proceed through the data, sampling each potential boundary point in turn. Once the Gibbs sampler converges, these samples will be drawn from the posterior distribution P(h|d).</Paragraph> </Section> <Section position="3" start_page="676" end_page="677" type="sub_section"> <SectionTitle> 3.3 Experiments </SectionTitle> <Paragraph position="0"> In our experiments, we used the same corpus that NGS and MBDP were tested on. The corpus, supplied to us by Brent, consists of 9790 transcribed utterances (33399 words) of child-directed speech from the Bernstein-Ratner corpus (Bernstein-Ratner, 1987) in the CHILDES database (MacWhinney and Snow, 1985). The utterances have been converted to a phonemic representation using a phonemic dictionary, so that each occurrence of a word has the same phonemic transcription. Utterance boundaries are given in the input to the system; other word boundaries are not.</Paragraph> <Paragraph position="1"> Because our Gibbs sampler is slow to converge, we used annealing to speed inference. We began with a temperature of g = 10 and decreased g in 10 increments to a final value of 1. A temperature of g corresponds to raising the probabilities of h1 and h2 to the power of 1g prior to sampling.</Paragraph> <Paragraph position="2"> We ran our Gibbs sampler for 20,000 iterations through the corpus (with g = 1 for the final 2000) and evaluated our results on a single sample at that point. We calculated precision (P), recall (R), and F-score (F) on the word tokens in the corpus, where both boundaries of a word must be correct to count the word as correct. The induced lexicon was also scored for accuracy using these metrics (LP, LR, LF).</Paragraph> <Paragraph position="3"> Recall that our DP model has three parameters: t,p#, and a0. Given the large number of known utterance boundaries, we expect the value of t to have little effect on our results, so we simply fixed t = 2 for all experiments. Figure 3 shows the effects of varying of p# and a0.3 Lower values of p# cause longer words, which tends to improve recall (and thus F-score) in the lexicon, but decrease token accuracy. Higher values of a0 allow more novel words, which also improves lexicon recall, as a function of p#, with a0 = 20 and (b) as a function of a0, with p# = .5.</Paragraph> <Paragraph position="4"> but begins to degrade precision after a point. Due to the negative correlation between token accuracy and lexicon accuracy, there is no single best value for either p# or a0; further discussion refers to the solution for p# = .5,a0 = 20 (though others are qualitatively similar).</Paragraph> <Paragraph position="5"> In Table 1(a), we compare the results of our system to those of MBDP and NGS.4 Although our system has higher lexicon accuracy than the others, its token accuracy is much worse. This result occurs because our system often mis-analyzes frequently occurring words. In particular, many of these words occur in common collocations such as what's that and do you, which the system interprets as a single words. It turns out that a full 31% of the proposed lexicon and nearly 30% of tokens consist of these kinds of errors.</Paragraph> <Paragraph position="6"> Upon reflection, it is not surprising that a uni-gram language model would segment words in this way. Collocations violate the unigram assumption in the model, since they exhibit strong word-to-word dependencies. The only way the model can capture these dependencies is by assuming that these collocations are in fact words themselves.</Paragraph> <Paragraph position="7"> Why don't the MBDP and NGS unigram models exhibit these problems? We have already shown that NGS's results are due to its search procedure rather than its model. The same turns out to be true for MBDP. Table 2 shows the probabili- null is shown. DP results are with p# = .5 and a0 = 20. (a) Results on the true corpus. (b) Results on the permuted corpus.</Paragraph> <Paragraph position="8"> der each model of the true solution, the solution with no utterance-internal boundaries, and the solutions found by each algorithm. Best solutions under each model are bold.</Paragraph> <Paragraph position="9"> ties under each model of various segmentations of the corpus. From these figures, we can see that the MBDP model assigns higher probability to the solution found by our Gibbs sampler than to the solution found by Brent's own incremental search algorithm. In other words, Brent's model does prefer the lower-accuracy collocation solution, but his search algorithm instead finds a higher-accuracy but lower-probability solution.</Paragraph> <Paragraph position="10"> We performed two experiments suggesting that our own inference procedure does not suffer from similar problems. First, we initialized our Gibbs sampler in three different ways: with no utterance-internal boundaries, with a boundary after every character, and with random boundaries. Our results were virtually the same regardless of initialization. Second, we created an artificial corpus by randomly permuting the words in the true corpus, leaving the utterance lengths the same. The artificial corpus adheres to the unigram assumption of our model, so if our inference procedure works correctly, we should be able to correctly identify the words in the permuted corpus. This is exactly what we found, as shown in Table 1(b). While all three models perform better on the artificial corpus, the improvements of the DP model are by far the most striking.</Paragraph> </Section> </Section> <Section position="6" start_page="677" end_page="678" type="metho"> <SectionTitle> 4 Bigram Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="677" end_page="678" type="sub_section"> <SectionTitle> 4.1 The Hierarchical Dirichlet Process Model </SectionTitle> <Paragraph position="0"> The results of our unigram experiments suggested that word segmentation could be improved by taking into account dependencies between words.</Paragraph> <Paragraph position="1"> To test this hypothesis, we extended our model to incorporate bigram dependencies using a hierarchical Dirichlet process (HDP) (Teh et al., 2005). Our approach is similar to previous n-gram models using hierarchical Pitman-Yor processes (Goldwater et al., 2006; Teh, 2006). The HDP is appropriate for situations in which there are multiple distributions over similar sets of outcomes, and the distributions are believed to be similar. In our case, we define a bigram model by assuming each word has a different distribution over the words that follow it, but all these distributions are linked. The definition of our bigram language model as an</Paragraph> <Paragraph position="3"> ing to Hw, a DP specific to word w. Hw is linked to the DPs for all other words by the fact that they share a common base distribution G, which is generated from another DP.5 As in the unigram model, we never deal with Hw or G directly. By integrating over them, we get a distribution over bigram frequencies that can be understood in terms of the CRP. Now, each word type w is associated with its own restaurant, which represents the distribution over words that follow w. Different restaurants are not completely independent, however: the labels on the tables in the restaurants are all chosen from a common base distribution, which is another CRP.</Paragraph> <Paragraph position="4"> To understand the HDP model in terms of a grammar, we consider $ as a special word type, so that wi ranges over S[?] [?]{$}. After observing w[?]i, the HDP grammar is as shown in Figure 4, 5This HDP formulation is an oversimplification, since it does not account for utterance boundaries properly. The grammar formulation (see below) does.</Paragraph> <Paragraph position="6"> where h[?]i = (w[?]i,z[?]i); t$,tS[?], and twi are the total number of tables (across all words) labeled with $, non-$, and wi, respectively; t = t$ + tS[?] is the total number of tables; and n(wi[?]1,wi) is the number of occurrences of the bigram (wi[?]1,wi).</Paragraph> <Paragraph position="7"> We have suppressed the superscript (w[?]i) notation in all cases. The base distribution shared by all bigrams is given by P1, which can be viewed as a unigram backoff where the unigram probabilities are learned from the bigram table labels.</Paragraph> <Paragraph position="8"> We can perform inference on this HDP bigram model using a Gibbs sampler similar to our uni-gram sampler. Details appear in the Appendix.</Paragraph> </Section> <Section position="2" start_page="678" end_page="678" type="sub_section"> <SectionTitle> 4.2 Experiments </SectionTitle> <Paragraph position="0"> We used the same basic setup for our experiments with the HDP model as we used for the DP model.</Paragraph> <Paragraph position="1"> We experimented with different values of a0 and a1, keeping p# = .5 throughout. Some results of these experiments are plotted in Figure 5. With appropriate parameter settings, both lexicon and token accuracy are higher than in the unigram model (dramatically so, for tokens), and there is no longer a negative correlation between the two.</Paragraph> <Paragraph position="2"> Only a few collocations remain in the lexicon, and most lexicon errors are on low-frequency words.</Paragraph> <Paragraph position="3"> The best values of a0 are much larger than in the unigram model, presumably because all unique word types must be generated via P0, but in the bigram model there is an additional level of discounting (the unigram process) before reaching P0. Smaller values of a0 lead to fewer word types with fewer characters on average.</Paragraph> <Paragraph position="4"> Table 3 compares the optimal results of the HDP model to the only previous model incorporating bigram dependencies, NGS. Due to search, the performance of the bigram NGS model is not much different from that of the unigram model. In as a function of a0, with a1 = 10 and (b) as a function of a1, with a0 = 1000.</Paragraph> <Paragraph position="5"> in bold. HDP results are with p# = .5, a0 = 1000, and a1 = 10.</Paragraph> <Paragraph position="6"> contrast, our HDP model performs far better than our DP model, leading to the highest published accuracy for this corpus on both tokens and lexical items. Overall, these results strongly support our hypothesis that modeling bigram dependencies is important for accurate word segmentation.</Paragraph> </Section> </Section> class="xml-element"></Paper>