File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1024_evalu.xml

Size: 16,665 bytes

Last Modified: 2025-10-06 13:58:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1024">
  <Title>Exploring Asymmetric Clustering for Statistical Language Modeling</Title>
  <Section position="6" start_page="3" end_page="4" type="evalu">
    <SectionTitle>
4 Experimental Results and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Japanese Kana-Kanji Conversion Task
</SectionTitle>
      <Paragraph position="0"> Japanese Kana-Kanji conversion is the standard method of inputting Japanese text by converting a syllabary-based Kana string into the appropriate combination of ideographic Kanji and Kana. This is a similar problem to speech recognition, except that it does not include acoustic ambiguity. The performance is generally measured in terms of character error rate (CER), which is the number of characters wrongly converted from the phonetic string divided by the number of characters in the correct transcript. The role of the language model is, for all possible word strings that match the typed phonetic symbol string, to select the word string with the highest language model probability.</Paragraph>
      <Paragraph position="1"> Current products make about 5-10% errors in conversion of real data in a wide variety of domains.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Settings
</SectionTitle>
      <Paragraph position="0"> In the experiments, we used two Japanese newspaper corpora: the Nikkei Newspaper corpus, and the Yomiuri Newspaper corpus. Both text corpora have been word-segmented using a lexicon containing 167,107 entries.</Paragraph>
      <Paragraph position="1"> We performed two sets of experiments: (1) pilot experiments, in which model performance is measured in terms of perplexity and (2) Japanese Kana-Kanji conversion experiments, in which the performance of which is measured in terms of CER. In the pilot experiments, we used a subset of the Nikkei newspaper corpus: ten million words of the Nikkei corpus for language model training, 10,000 words for held-out data, and 20,000 words for testing data. None of the three data sets overlapped. In the Japanese Kana-Kanji conversion experiments, we built language models on a subset of the Nikkei Newspaper corpus, which contains 36 million words. We performed parameter optimization on a subset of held-out data from the Yomiuri Newspaper corpus, which contains 100,000 words. We performed testing on another subset of the Yomiuri Newspaper corpus, which contains 100,000 words.</Paragraph>
      <Paragraph position="2"> In both sets of experiments, word clusters were derived from bigram counts generated from the training corpora. Out-of-vocabulary words were not included in perplexity and error rate computations.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Impact of asymmetric clustering
</SectionTitle>
      <Paragraph position="0"> As described in Section 3.2, depending on the clustering metrics we chose for generating clusters, we obtained three types of clusters: both clusters (the metric of Equation (2)), conditional clusters (the metric of Equation (3)), and predicted clusters (the metric of Equation (4)). We then performed a series of experiments to investigate the impact of different types of clusters on the ACM. We used three variants of the trigram ACM: (1) the predictive cluster model P(w</Paragraph>
      <Paragraph position="2"> only predicted words are clustered, (2) the conditional cluster model P(w</Paragraph>
      <Paragraph position="4"> conditional words are clustered, and (3) the IBM</Paragraph>
      <Paragraph position="6"> ) which can be treated as a special case of the ACM of Equation (5) by using the same type of cluster for both predicted and conditional words, and setting k = 0, and l = j. For each cluster trigram model, we compared their perplexities and CER results on Japanese Kana-Kanji conversion using different types of clusters. For each cluster type, the number of clusters were fixed to the same value 2^6 just for comparison. The results are shown in Table 1. It turns out that the benefit of using different clusters in different positions is obvious. For each cluster trigram model, the best results were achieved by using the &amp;quot;matched&amp;quot; clusters, e.g. the predictive cluster model</Paragraph>
      <Paragraph position="8"> ) has the best performance when the cluster W</Paragraph>
      <Paragraph position="10"> generated by using the metric of Equation (4). In particular, the IBM model achieved the best results when predicted and conditional clusters were used for predicted and conditional words respectively. That is, the IBM model is of the</Paragraph>
      <Paragraph position="12"/>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.4 Impact of parameter optimization
</SectionTitle>
      <Paragraph position="0"> In this section, we first present our pilot experiments of finding the optimal parameter set of the ACM (l, j,</Paragraph>
      <Paragraph position="2"> ) described in Section 2.3. Then, we compare the ACM to the IBM model, showing that the superiority of the ACM results from its better structure.</Paragraph>
      <Paragraph position="3"> In this section, the performance of LMs was measured in terms of perplexity, and the size was measured as the total number of parameters of the LM: one parameter for each bigram and trigram, one parameter for each normalization parameter a that was needed, and one parameter for each unigram. We first used the conditional cluster model of the</Paragraph>
      <Paragraph position="5"> ) are shown in Figure 1. The performance was consistently improved by increasing the number of clusters j, except at the smallest sizes. The word trigram model was consistently the best model, except at the smallest sizes, and even then was only marginally worse than the conditional cluster models. This is not surprising because the conditional cluster model always discards information for predicting words.</Paragraph>
      <Paragraph position="6"> We then used the predictive cluster model of the</Paragraph>
      <Paragraph position="8"> ), where only predicted words are clustered. Some sample settings of the parameters (l, t</Paragraph>
      <Paragraph position="10"> ) are shown in Figure 2. For simplicity, we assumed t</Paragraph>
      <Paragraph position="12"> , meaning that the same pruning threshold values were used for both sub-models. It turns out that predictive cluster models achieve the best perplexity results at about 2^6 or 2^8 clusters. The models consistently outperform the baseline word trigram models.</Paragraph>
      <Paragraph position="13"> We finally returned to the ACM of Equation (5), where both conditional words and the predicted word are clustered (with different numbers of clusters), and which is referred to as the combined cluster model below. In addition, we allow different values of the threshold for different sub-models.</Paragraph>
      <Paragraph position="14"> Therefore, we need to optimize the model parameter set l, j, k, t</Paragraph>
      <Paragraph position="16"> Based on the pilot experiment results using conditional and predictive cluster models, we tried combined cluster models for values l[?][4, 10], j, k[?][8, 16]. We also allow j, k=all. Rather than plot all points of all models together, we show only the outer envelope of the points. That is, if for a given model type and a given point there is some other point of the same type with both lower perplexity and smaller size than the first point, then we do not plot the first, worse point.</Paragraph>
      <Paragraph position="17"> The results are shown in Figure 3, where the cluster number of IBM models is 2^14 which achieves the best performance for IBM models in our experiments. It turns out that when l[?][6, 8] and j, k&gt;12, combined cluster models yield the best results. We also found that the predictive cluster models give as good performance as the best combined ones while combined models outperformed very slightly only when model sizes are small. This is not difficult to explain. Recall that the predictive cluster model is a special case of the combined model where words are used in conditional positions, i.e. j=k=all. Our experiments show that combined models achieved good performance when large numbers of clusters are used for conditional words, i.e. large j, k&gt;12, which are similar to words.</Paragraph>
      <Paragraph position="18"> The most interesting analysis is to look at some sample settings of the parameters of the combined cluster models in Figure 3. In Table 2, we show the best parameter settings at several levels of model size. Notice that in larger model sizes, predictive cluster models (i.e. j=k=all) perform the best in some cases. The 'prune' columns (i.e. columns 6 and 7) indicate the Stolcke pruning parameter we used.</Paragraph>
      <Paragraph position="19"> First, notice that the two pruning parameters (in columns 6 and 7) tend to be very similar. This is desirable since applying the theory of relative entropy pruning predicts that the two pruning parameters should actually have the same value.</Paragraph>
      <Paragraph position="20"> Next, let us compare the ACM</Paragraph>
      <Paragraph position="22"> applied with different numbers of clusters  cluster model, IBM model, and word trigram model same type of cluster is used for both predictive and conditional words). Our results in Figure 3 show that the performance of IBM models is roughly an order of magnitude worse than that of ACMs. This is because in addition to the use of the symmetric cluster model, the traditional IBM model makes two more assumptions that we consider suboptimal. First, it assumes that j=l. We see that the best results come from unequal settings of j and l. Second, more importantly, IBM clustering assumes that k=0. We see that not only is the optimal setting for k not 0, but also typically the exact opposite is the optimal: k=all in which case P(w</Paragraph>
      <Paragraph position="24"> ), or k=14, 16, which is very similar. That is, we see that words depend on the previous words and that an independence assumption is a poor one. Of course, many of these word dependencies are pruned away - but when a word does depend on something, the previous words are better predictors than the previous clusters. Another important finding here is that for most of these settings, the unpruned model is actually larger than a normal trigram model - whenever k=all or 14, 16, the unpruned model P(PW</Paragraph>
      <Paragraph position="26"> ) is actually larger than an unpruned model P(w</Paragraph>
      <Paragraph position="28"> This analysis of the data is very interesting - it implies that the gains from clustering are not from compression, but rather from capturing structure.</Paragraph>
      <Paragraph position="29"> Factoring the model into two models, in which the cluster is predicted first, and then the word is predicted given the cluster, allows the structure and regularities of the model to be found. This larger, better structured model can be pruned more effectively, and it achieved better performance than a word trigram model at the same model size.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
4.5 CER results
</SectionTitle>
      <Paragraph position="0"> Before we present CER results of the Japanese Kana-Kanji conversion system, we briefly describe our method for storing the ACM in practice.</Paragraph>
      <Paragraph position="1"> One of the most common methods for storing backoff n-gram models is to store n-gram probabilities (and backoff weights) in a tree structure, which begins with a hypothetical root node that branches out into unigram nodes at the first level of the tree, and each of those unigram nodes in turn branches out into bigram nodes at the second level and so on. To save storage, n-gram probabilities such as P(w</Paragraph>
      <Paragraph position="3"> ) and backoff weights such as a(w</Paragraph>
      <Paragraph position="5"> ) are stored in a single (bigram) node array (Clarkson and Rosenfeld, 1997). Applying the above tree structure to storing the ACM is a bit complicated - there are some representation issues. For example, consider the  ) cannot be stored in a single (bigram) node array, because l [?] j and PW[?]CW. Therefore, we used two separate trees to store probabilities and backoff weights, respectively. As a result, we used four tree structures to store ACMs in practice: two for the cluster  that the effect of the storage structure cannot be ignored in a real application.</Paragraph>
      <Paragraph position="6"> In addition, we used several techniques to compress model parameters (i.e. word id, n-gram probability, and backoff weight, etc.) and reduce the storage space of models significantly. For example, rather than store 4-byte floating point values for all n-gram probabilities and backoff weights, the values are quantized to a small number of quantization levels. Quantization is performed separately on each of the n-gram probability and backoff weight lists, and separate quantization level look-up tables are generated for each of these sets of parameters. We used 8-bit quantization, which shows no performance decline in our experiments.</Paragraph>
      <Paragraph position="7"> Our goal is to achieve the best tradeoff between performance and model size. Therefore, we would like to compare the ACM with the word trigram model at the same model size. Unfortunately, the ACM contains four sub-models and this makes it difficult to be pruned to a specific size. Thus for comparison, we always choose the ACM with smaller size than its competing word trigram model to guarantee that our evaluation is under-estimated. Experiments show that the ACMs achieve statistically significant improvements over word trigram models at even smaller model sizes (p-value =8.0E-9). Some results are shown in Table 3.</Paragraph>
      <Paragraph position="8">  trigram models at different model sizes Now we discuss why the ACM is superior to simple word trigrams. In addition to the better structure as shown in Section 3.3, we assume here that the benefit of our model also comes from its better smoothing. Consider a probability such as P(Tuesday |party on). If we put the word &amp;quot;Tuesday&amp;quot; into the cluster WEEKDAY, we decompose the probability When each word belongs to one class, simple math shows that this decomposition is a strict equality. However, when smoothing is taken into consideration, using the clustered probability will be more accurate than using the non-clustered probability. For instance, even if we have never seen an example of &amp;quot;party on Tuesday&amp;quot;, perhaps we have seen examples of other phrases, such as &amp;quot;party on Wednesday&amp;quot;; thus, the probability P(WEEKDAY | party on) will be relatively high. Furthermore, although we may never have seen an example of &amp;quot;party on WEEKDAY Tuesday&amp;quot;, after we backoff or interpolate with a lower order model, we may able to accurately estimate P(Tuesday  |on WEEKDAY).</Paragraph>
      <Paragraph position="9"> Thus, our smoothed clustered estimate may be a good one.</Paragraph>
      <Paragraph position="10"> Our assumption can be tested empirically by following experiments. We first constructed several test sets with different backoff rates  . The backoff rate of a test set, when presented to a trigram model, is defined as the number of words whose trigram probabilities are estimated by backoff bigram probabilities divided by the number of words in the test set. Then for each test set, we obtained a pair of CER results using the ACM and the word trigram model respectively. As shown in Figure 4, in both cases, CER increases as the backoff rate increases from 28% to 40%. But the curve of the word trigram model has a steeper upward trend. The difference of the upward trends of the two curves can be shown more clearly by plotting the CER difference between them, as shown in Figure 5. The results indicate that because of its better smoothing, when the backoff rate increases, the CER using the ACM does not increase as fast as that using the word trigram model. Therefore, we are reasonably confident that some portion of the benefit of the ACM comes from its better smoothing.</Paragraph>
      <Paragraph position="11">  The backoff rates are estimated using the baseline trigram model, so the choice could be biased against the word trigram model.</Paragraph>
      <Paragraph position="13"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML