XML Viewer - w02-1032

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1032_metho.xml
Size: 20,469 bytes
Last Modified: 2025-10-06 14:08:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1032">
  <Title>Exploiting Headword Dependency and Predictive Clustering for Language Modeling</Title>
  <Section position="4" start_page="0" end_page="3" type="metho">
    <SectionTitle>
2 Using Headwords
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Motivation
</SectionTitle>
      <Paragraph position="0"> Japanese linguists have traditionally distinguished two types of words  , content words (jiritsugo) and function words (fuzokugo), along with the notion of the bunsetsu (phrase). Each bunsetsu typically consists of one content word, called a headword in this paper, and several function words. Figure 1 shows a Japanese example sentence and its English  In Figure 1, we find that some headwords in the sentence are expected to have a stronger dependency relation with their preceding headwords than with their immediately preceding function words. For example, the three headwords Zhi Liao ~Zhuan Nian ~Quan Kuai (chiryou 'treatment' ~ sennen 'concentrate' ~ zenkai 'full recovery') form a trigram with very strong semantic dependency. Therefore, we can hypothesize (in the trigram context) that headwords may be conditioned not only by the two immediately preceding words, but also by two previous headwords. This is our first assumption. We also note that the order of headwords in a sentence is flexible in some sense. From the  Or more correctly, morphemes. Strictly speaking, the LMs discussed in this paper are morpheme-based models rather than word-based, but we will not make this distinction in this paper.</Paragraph>
      <Paragraph position="1">  Square brackets demarcate the bunsetsu boundary, and + the morpheme boundary; the underlined words are the headwords. ADN indicates an adnominal marker, and PRES indicates a present tense marker.</Paragraph>
      <Paragraph position="2"> example in Figure 1, we find that if Zhi Liao ~Zhuan Nian ~Quan Kuai (chiryou 'treatment' ~ sennen 'concentrate' ~ zenkai 'full recovery') is a meaningful trigram, then its permutations (such as Quan Kuai ~Zhi Liao ~Zhuan Nian (zenkai 'full recovery' ~ chiryou 'treatment' ~ sennen 'concentrate')) should also be meaningful, because headword trigrams tend to capture an order-neutral semantic dependency. This reflects a characteristic of Japanese, in which arguments and modifiers of a predicate can freely change their word order, a phenomenon known as &amp;quot;scrambling&amp;quot; in linguistic literature. We can then introduce our second assumption: headwords in a trigram are permutable. Note that the permutation of headwords should be useful more generally beyond Japanese: for example, in English, the book Mary bought and Mary bought a book can be captured by the same headword trigram (Mary ~ bought ~ book) if we allow such permutations.</Paragraph>
      <Paragraph position="3"> In this subsection, we have stated two assumptions about the structure of Japanese that can be exploited for language modeling. We now turn to discuss how to incorporate these assumptions in language modeling.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Permuted headword trigram model
</SectionTitle>
      <Paragraph position="0"> (PHTM) A trigram model predicts the next word w</Paragraph>
      <Paragraph position="2"> assuming that the next word depends only on two preceding words, w</Paragraph>
      <Paragraph position="4"> . The PHTM is a simple extension of the trigram model that incorporates the dependencies between headwords. If we assume that each word token can uniquely be classified as a headword or a function word, the PHTM can be considered as a cluster-based language model with two clusters, headword H and function word F. We can then define the conditional probability of w i based on its history as the product of the two factors: the probability of the category (H or F), and the probability of w i given its category.</Paragraph>
      <Paragraph position="5"> Let h</Paragraph>
      <Paragraph position="7"> be the actual headword or function word in a sentence, and let H</Paragraph>
      <Paragraph position="9"> be the category of the  ) onto equivalence classes.</Paragraph>
      <Paragraph position="11"> the word probability given that the category of w</Paragraph>
      <Paragraph position="13"> function word. For these three probabilities, we used the standard trigram estimate (i.e., Ph (w</Paragraph>
      <Paragraph position="15"> )). The estimation of headword probability is slightly more elaborate, reflecting the two assumptions described in Section 2.1:</Paragraph>
      <Paragraph position="17"> which are the headword trigram probability with or without permutation, and P(w</Paragraph>
      <Paragraph position="19"> ), which is the probability of w i given that it is a headword, where h  in Equation (2) is motivated by the first assumption described in Section 2.1: headwords are conditioned not only on two immediately preceding words, but also on two previous headwords. In practice, we estimated the headword probability by interpolating the conditional probability based on two previous</Paragraph>
      <Paragraph position="21"> permutation), and the conditional probability based on two preceding words P(w</Paragraph>
      <Paragraph position="23"> around zero, it indicates that this assumption does not hold in real data. Note that we did not estimate the conditional probability P(w</Paragraph>
      <Paragraph position="25"> directly, because this is in the form of a 5-gram, where the number of parameters are too large to estimate.</Paragraph>
      <Paragraph position="26"> The use of l  in Equation (2) comes from the second assumption in Section 2.1: headword trigrams are permutable. This assumption can be formulated as a co-occurrence model for headword prediction: that is, the probability of a headword is determined by the occurrence of other headwords within a window. However, in our experiments, we instead used an interpolated probability</Paragraph>
      <Paragraph position="28"> ) for two reasons. First, co-occurrence models do not predict words from left to right, and are thus very difficult to interpolate with trigram models for decoding. Second, if we see n-gram models as one extreme that predicts the next word based on a strictly ordered word sequence, co-occurrence models go to the other extreme of predicting the next word based on a bag of previous words without taking word order into account at all. We prefer models that lie somewhere between the two extremes, and consider word order in a more flexible way. In PHTM of Equation (2), l  represents the impact of word order on headword prediction. When l  = 1 (i.e., the resulting model is a non-permuted headword trigram model, referred to as HTM), it indicates that the second assumption does not hold in real data. When l  is around 0.5, it indicates that a headword bag model is sufficient.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
2.3 Model parameter estimation
</SectionTitle>
      <Paragraph position="0"> Assume that all conditional probabilities in Equation (1) are estimated using maximum likelihood estimation (MLE). Then  is a strict equality when each word token is uniquely classified as a headword or a function word. This can be trivially proven as follows. Let C</Paragraph>
      <Paragraph position="2"> in our case). We have</Paragraph>
      <Paragraph position="4"> : function word There are three probabilities to be estimated in Equation (6): word trigram probability</Paragraph>
      <Paragraph position="6"> In order to deal with the data sparseness problem of MLE, we used a backoff scheme (Katz, 1987) for the parameter estimation. The backoff scheme recursively estimates the probability of an unseen n-gram by utilizing (n-1)-gram estimates. To keep the model size manageable, we also removed all n-grams with frequency less than 2.</Paragraph>
      <Paragraph position="7"> In order to classify a word uniquely as H or F, we needed a mapping table where each word in the lexicon corresponds to a category. The table was generated in the following manner. We first assumed that the mapping from part-of-speech (POS) to word category is fixed. The tag set we used included 1,187 POS tags, of which 102 count as headwords in our experiments. We then used a POS-tagger to generate a POS-tagged corpus, from which we generated the mapping table  . If a word could be mapped to both H and F, we chose the more frequent category in the corpus. Using this mapping table, we achieved a 98.5% accuracy of headword detection on the test data we used.</Paragraph>
      <Paragraph position="8"> Through our experiments, we found that</Paragraph>
      <Paragraph position="10"> ) is a poor estimator of category probability; in fact, the unigram estimate P(H</Paragraph>
      <Paragraph position="12"> achieved better results in our experiments as shown in Section 6.1. Therefore, we also used the unigram estimate for word category probability in our  Since the POS-tagger does not identify phrases, our implementation does not identify precisely one headword for a phrase, but identify multiple headwords in the case of compounds.</Paragraph>
      <Paragraph position="13"> experiments. The alternative model that uses the</Paragraph>
      <Paragraph position="15"> : function word We will denote the models using trigram for category probability estimation of Equation (6) as T-PHTM, and the models using unigram for category probability estimation of Equation (7) as U-PHTM.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="7" type="metho">
    <SectionTitle>
3 Using Clusters
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Principle
</SectionTitle>
      <Paragraph position="0"> Clustering techniques attempt to make use of similarities between words to produce a better estimate of the probability of word strings (Goodman, 2001).</Paragraph>
      <Paragraph position="1"> We have mentioned in Section 2.2 that the headword trigram model can be thought of as a cluster-based model with two clusters, the headword and the function word. In this section, we describe a method of clustering automatically similar words and headwords. We followed the techniques described in Goodman (2001) and Gao et al. (2001), and performed experiments using predictive clustering along with headword trigram models.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Predictive clustering model
</SectionTitle>
      <Paragraph position="0"> , called the conditional words. Gao et al.</Paragraph>
      <Paragraph position="1"> (2001) presents a thorough comparative study on various clustering models for Asian languages, concluding that a model that uses clusters for predicted words, called the predictive clustering model, performed the best in most cases.</Paragraph>
      <Paragraph position="3"> belongs to.</Paragraph>
      <Paragraph position="4"> In this study, we performed word clustering for words and headwords separately. As a result, we have the following two predictive clustering models, (8) for words and (9) for headwords:</Paragraph>
      <Paragraph position="6"> Substituting Equations (8) and (9) into Equation (7), we get the cluster-based PHTM of Equation (10), referred to as C-PHTM.</Paragraph>
      <Paragraph position="8"> : function word</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="7" type="sub_section">
      <SectionTitle>
3.3 Finding clusters: model estimation
</SectionTitle>
      <Paragraph position="0"> In constructing clustering models, two factors were considered: how to find optimal clusters, and the optimal number of clusters.</Paragraph>
      <Paragraph position="1"> The clusters were found automatically by attempting to minimize perplexity (Brown et al., 1992). In particular, for predictive clustering models, we tried to minimize the perplexity of the</Paragraph>
      <Paragraph position="3"> is independent of the clustering used.</Paragraph>
      <Paragraph position="4"> Therefore, in order to select the best clusters, it is sufficient to try to maximize  The clustering technique we used creates a binary branching tree with words at the leaves. By cutting the tree at a certain level, it is possible to achieve a wide variety of different numbers of clusters. For instance, if the tree is cut after the sixth level, there will be roughly 2  =64 clusters. In our experiments, we always tried the numbers of clusters that are the powers of 2. This seems to produce numbers of clusters that are close enough to optimal. In Equation (10), the optimal number of clusters we used was 2  .</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="7" end_page="7" type="metho">
    <SectionTitle>
4 Relation to Previous Work
</SectionTitle>
    <Paragraph position="0"> Our LMs are similar to a number of existing ones.</Paragraph>
    <Paragraph position="1"> One such model was proposed by ATR (Isotani and Matsunaga, 1994), which we will refer to as ATR model below. In ATR model, the probability of each word in a sentence is determined by the preceding content and function word pair. Isotani and Matsunaga (1994) reported slightly better results over word bigram models for Japanese speech recognition. Geutner (1996) interpolated the ATR model with word-based trigram models, and reported very limited improvements over word trigram models for German speech recognition.</Paragraph>
    <Paragraph position="2"> One significant difference between the ATR model and our own lies in the use of predictive clustering. Another difference is that our models use separate probability estimates for headwords and function words, as shown in Equations (6) and (7). In contrast, ATR models are conceptually more similar to skipping models (Rosenfeld, 1994; Ney et al., 1994; Siu and Ostendorf, 2000), where only one probability estimate is applied for both content and function words, and the word categories are used only for the sake of finding the content and function word pairs in the context.</Paragraph>
    <Paragraph position="3"> Another model similar to ours is Jelinek (1990), where the headwords of the two phrases immediately preceding the word as well as the last two words were used to compute a word probability. The resulting model is similar to a 5-gram model. A sophisticated interpolation formula had to be used since the number of parameters is too large for direct estimation. Our models are easier to learn because they use trigrams. They also differ from Jelinek's model in that they separately estimate the probability for headwords and function words.</Paragraph>
    <Paragraph position="4"> A significant number of sophisticated techniques for language modeling have recently been proposed in order to capture more linguistic structure from a larger context. Unfortunately, most of them suffer from either high computational cost or difficulty in obtaining enough manually parsed corpora for parameter estimation, which make it difficult to apply them successfully to realistic applications.</Paragraph>
    <Paragraph position="5"> For example, maximum entropy (ME) models (Rosenfeld, 1994) provide a nice framework for incorporating arbitrary knowledge sources, but training and using ME models is computationally extremely expensive.</Paragraph>
    <Paragraph position="6"> Another interesting idea that exploits the use of linguistic structure is structured language modeling (SLM, Chelba and Jelinek, 2000). SLM uses a statistical parser trained on an annotated corpus in order to identify the headword of each constituent, which are then used as conditioning words in the trigram context. Though SLMs have been shown to significantly improve the performance of the LM measured in perplexity, they also pose practical problems. First, the performance of SLM is contingent on the amount and quality of syntactically annotated training data, but such data may not always be available. Second, SLMs are very time-intensive, both in their training and use. Charniak (2001) and Roark (2001) also present language models based on syntactic dependency structure, which use lexicalized PCFGs that sum over the derivation probabilities. They both report improvements in perplexity over Chelba and Jelinek (2000) on the Wall Street Journal section of the Penn Treebank data, suggesting that syntactic structure can be further exploited for language modeling. The kind of linguistic structure used in our models is significantly more modest than that provided by parser-based models, yet offers practical benefits for realistic applications, as shown in the next section.</Paragraph>
  </Section>
  <Section position="7" start_page="7" end_page="7" type="metho">
    <SectionTitle>
5 Evaluation Methodology
</SectionTitle>
    <Paragraph position="0"> The most common metric for evaluating a language model is perplexity. Perplexity can be roughly interpreted as the expected branching factor of the test document when presented to a language model.</Paragraph>
    <Paragraph position="1"> Perplexity is widely used due to its simplicity and efficiency. However, the ultimate quality of a language model must be measured by its effect on the specific task to which it is applied, such as speech recognition. Lower perplexities usually result in lower error rates, but there are numerous counterexamples to this in the literature.</Paragraph>
    <Paragraph position="2"> In this study, we evaluated our language models on the application of Japanese Kana-Kanji conversion, which is the standard method of inputting Japanese text by converting the text of syllabary-based Kana string into the appropriate combination of ideographic Kanji and Kana. This is a similar problem to speech recognition, except that it does not include acoustic ambiguity. Performance on this task is generally measured in terms of the character error rate (CER), which is the number of characters wrongly converted from the phonetic string divided by the number of characters in the correct transcript. The role of the language model is to select the word string (in a combination of Kanji and Kana) with the highest probability among the candidate strings that match the typed phonetic (Kana) string. Current products make about 5-10% errors in conversion of real data in a wide variety of domains.</Paragraph>
    <Paragraph position="3"> For our experiments, we used two newspaper corpora: Nikkei and Yomiuri Newspapers. Both corpora have been word-segmented. We built language models from a 36-million-word subset of the Nikkei Newspaper corpus. We performed parameter optimization on a 100,000-word subset of the Yomiuri Newspaper (held-out data). We tested our models on another 100,000-word subset of the Yomiuri Newspaper corpus. The lexicon we used contains 167,107 entries.</Paragraph>
    <Paragraph position="4"> In our experiments, we used the so-called &amp;quot;N-best rescoring&amp;quot; method. In this method, a list of hypotheses is generated by the baseline language model (a word trigram model in this study  ), which is then rescored using a more sophisticated LM.</Paragraph>
    <Paragraph position="5"> Due to the limited number of hypotheses in the N-best list, the second pass may be constrained by the first pass. In this study, we used the 100-best list. The &amp;quot;oracle&amp;quot; CER (i.e., the CER among the hypotheses with the minimum number of errors) is presented in Table 1. This is the upper bound on performance in our experiments. The performance of the conversion using the baseline trigram model is much better than the state-of-the-art performance currently available in the marketplace. This may be due to the large amount of training data we used, and to the similarity between the training and the test data. We also notice that the &amp;quot;oracle&amp;quot; CER is  For the detailed description of the baseline trigram model, see Gao et al. (2002).</Paragraph>
    <Paragraph position="6"> relatively high due to the high out-of-vocabulary rate, which is 1.14%. Because we have only limited room for improvement, the reported results of our experiments in this study may be underestimated.</Paragraph>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
Baseline Trigram Oracle of 100-best
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML