File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0804_evalu.xml

Size: 11,465 bytes

Last Modified: 2025-10-06 13:59:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0804">
  <Title>Bilingual Word Spectral Clustering for Statistical Machine Translation</Title>
  <Section position="5" start_page="28" end_page="31" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> To test our algorithm, we applied it to the TIDES Chinese-English small data track evaluation test set.</Paragraph>
    <Paragraph position="1"> After preprocessing, such as English tokenization, Chinese word segmentation, and parallel sentence splitting, there are in total 4172 parallel sentence pairs for training. We manually labeled word alignments for 627 test sentence pairs randomly sampled from the dry-run test data in 2001, which has four human translations for each Chinese sentence. The preprocessing for the test data is different from the above, as it is designed for humans to label word alignments correctly by removing ambiguities from tokenization and word segmentation as much as possible. The data statistics are shown in Table 1.</Paragraph>
    <Section position="1" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
4.1 Building Co-occurrence Matrix
</SectionTitle>
      <Paragraph position="0"> Bilingual word co-occurrence counts are collected from the training data for constructing the matrix of C{F,E}. Raw counts are collected without word alignment between the parallel sentences. Practically, we can use word alignment as used in (Och, 1999). Given an initial word alignment inferred by HMM, the counts are collected from the aligned word pair. If the counts are L-1 normalized, then the co-occurrence matrix is essentially the bilingual word-to-word translation lexicon such as P(fj|eaj).</Paragraph>
      <Paragraph position="1"> We can remove very small entries (P(f|e) [?] 1e[?]7), so that the matrix of C{F,E} is more sparse for eigenstructure computation. The proposed algorithm is then carried out to generate the bilingual word clusters for both English and Chinese.</Paragraph>
      <Paragraph position="2"> Figure 1 shows the ranked Eigen values for the</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
Matrix
</SectionTitle>
      <Paragraph position="0"> It is clear, that using the initial HMM word alignment for co-occurrence matrix makes a difference.</Paragraph>
      <Paragraph position="1"> The top Eigen value using word alignment in plot a.</Paragraph>
      <Paragraph position="2"> (the deep blue curve) is 3.1946. The two plateaus indicate how many top K eigen vectors to choose to reduce the feature space. The first one indicates that K is in the range of 50 to 120, and the second plateau indicates K is in the range of 500 to 800. Plot b. is inferred from the raw co-occurrence counts with the top eigen value of 2.7148. There is no clear plateau, which indicates that the feature space is less structured than the one built with initial word alignment. We find 500 top eigen vectors are good enough for bilingual clustering in terms of efficiency and effectiveness. null</Paragraph>
    </Section>
    <Section position="3" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
4.2 Clustering Results
</SectionTitle>
      <Paragraph position="0"> Clusters built via the two described methods are compared. The first method bil1 is the two-step optimization approach: first optimizing the monolingual clusters for target language (English), and afterwards optimizing clusters for the source language (Chinese). The second method bil2 is our proposed algorithm to compute the eigenstructure of the co-occurrence matrix, which builds the left and right subspaces, and finds clusters in such spaces. Top 500 eigen vectors are used to construct these subspaces. For both methods, 1000 clusters are inferred for English and Chinese respectively. The number of clusters is chosen in a way that the final word alignment accuracy was optimal. Table 2 provides the clustering examples using the two algorithms.</Paragraph>
      <Paragraph position="1">  The monolingual word clusters often contain words with similar syntax functions. This happens with esp. frequent words (eg. mono-E1 and mono-E2). The algorithm tends to put rare words such as &amp;quot;carota, anglophobia&amp;quot; into a very big cluster (eg. mono-E3). In addition, the words within these monolingual clusters rarely share similar translations such as the typical cluster of &amp;quot;week, month, year&amp;quot;. This indicates that the corresponding Chinese clusters inferred by optimizing Eqn. 7 are not close in terms of translational similarity. Overall, the method of bil1 does not give us a good translational correspondence between clusters of two languages.</Paragraph>
      <Paragraph position="2"> The English cluster of mono-E3 and its best aligned candidate of bil1-C3 are not well correlated either.</Paragraph>
      <Paragraph position="3"> Our proposed bilingual cluster algorithm bil2 generates the clusters with stronger semantic meaning within a cluster. The cluster of bil2-E1 relates to the concept of &amp;quot;wine&amp;quot; in English. The mono-lingual word clustering tends to scatter those words into several big noisy clusters. This cluster also has a good translational correspondent in bil2-C1 in Chinese. The clusters of bil2-E2 and bil2-C2 are also correlated very well. We noticed that the Chinese clusters are slightly more noisy than their English corresponding ones. This comes from the noise in the parallel corpus, and sometimes from ambiguities of the word segmentation in the preprocessing steps.</Paragraph>
      <Paragraph position="4"> To measure the quality of the bilingual clusters, we can use the following two kind of metrics: * Average epsilon1-mirror (Wang et al., 1996): The epsilon1-mirror of a class Ei is the set of clusters in Chinese which have a translation probability greater than epsilon1. In our case, epsilon1 is 0.05, the same value used in (Och, 1999).</Paragraph>
      <Paragraph position="5"> * Perplexity: The perplexity is defined as proportional to the negative log likelihood of the HMM model Viterbi alignment path for each sentence pair. We use the bilingual word clusters in two extended HMM models, and measure the perplexities of the unseen test data after seven forward-backward training iterations.</Paragraph>
      <Paragraph position="6"> The two perplexities are defined as PP1 =</Paragraph>
      <Paragraph position="8"> two extended HMM models in Eqn 3 and 4.</Paragraph>
      <Paragraph position="9"> Both metrics measure the extent to which the translation probability is spread out. The smaller the better. The following table summarizes the results on epsilon1-mirror and perplexity using different methods on the unseen test data.</Paragraph>
      <Paragraph position="10">  The baseline uses no word clusters. bil1 and bil2 are defined as above. It is clear that our proposed method gives overall lower perplexity: 1611 from the baseline of 1717 using the extended HMM-1.</Paragraph>
      <Paragraph position="11"> If we use HMM-2, the perplexity goes down even more using bilingual clusters: 352.28 using bil1, and 343.64 using bil2. As stated, the four-dimensional  table of P(aj|aj[?]1,E(eaj[?]1),F(fj[?]1)) is easily subject to overfitting, and usually gives worse perplexities. null Average epsilon1-mirror for the two-step bilingual clustering algorithm is 3.97, and for spectral clustering algorithm is 2.54. This means our proposed algorithm generates more focused clusters of translational equivalence. Figure 2 shows the histogram for the cluster pairs (Fj,Ei), of which the cluster level translation probabilities P(Fj|Ei) [?] [0.05,1]. The interval [0.05,1] is divided into 10 bins, with first bin [0.05,0.1], and 9 bins divides[0.1,1] equally. The percentage for clusters pairs with P(Fj|Ei) falling in each bin is drawn.</Paragraph>
      <Paragraph position="12">  Our algorithm generates much better aligned cluster pairs than the two-step optimization algorithm. There are 120 cluster pairs aligned with P(Fj|Ei) [?] 0.9 using clusters from our algorithm, while there are only 8 such cluster pairs using the two-step approach. Figure 3 compares the epsilon1-mirror at different numbers of clusters using the two approaches. Our algorithm has a much better epsilon1-mirror than the two-step approach over different number of clusters. Overall, the extended HMM-2 is better than HMM-1 in terms of perplexity, and is easier to train.</Paragraph>
    </Section>
    <Section position="4" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
4.3 Applications in Word Alignment
</SectionTitle>
      <Paragraph position="0"> We also applied our bilingual word clustering in a word alignment setting. The training data is the TIDES small data track. The word alignments are manually labeled for 627 sentences sampled from the dryrun test data in 2001. In this manually aligned data, we include one-to-one, one-to-many, and many-to-many word alignments. Figure 4 summarizes the word alignment accuracy for different e-mirror over different settings  methods. The baseline is the standard HMM translation model defined in Eqn. 2; the HMM1 is defined in Eqn 3, and HMM2 is defined in Eqn 4. The algorithm is applying our proposed bilingual word clustering algorithm to infer 1000 clusters for both languages. As expected, Figure 4 shows that using  Extended HMM-2Figure 4: Word Alignment Over Iterations word clusters is helpful for word alignment. HMM2 gives the best performance in terms of F-measure of word alignment. One quarter of the words in the test vocabulary are unseen as shown in Table 1. These unseen words related alignment links (4778 out of 14769) will be left unaligned by translation models. Thus the oracle (best possible) recall we could get is 67.65%. Our standard t-test showed that significant interval is 0.82% at the 95% confidence level. The improvement at the last iteration of HMM is marginally significant.</Paragraph>
    </Section>
    <Section position="5" start_page="30" end_page="31" type="sub_section">
      <SectionTitle>
4.4 Applications in Phrase-based Translations
</SectionTitle>
      <Paragraph position="0"> Our pilot word alignment on unseen data showed improvements. However, we find it more effective in our phrase extraction, in which three key scores  are computed: phrase level fertilities, distortions, and lexicon scores. These scores are used in a local greedy search to extract phrase pairs (Zhao and Vogel, 2005). This phrase extraction is more sensitive to the differences in P(fj|ei) than the HMM Viterbi word aligner.</Paragraph>
      <Paragraph position="1"> The evaluation conditions are defined in NIST 2003 Small track. Around 247K test set (919 Chinese sentences) specific phrase pairs are extracted with up to 7-gram in source phrase. A trigram language model is trained using Gigaword XinHua news part. With a monotone phrase-based decoder, the translation results are reported in Table 3. The  baseline is using the lexicon P(fj|ei) trained from standard HMM in Eqn. 2, which gives a BLEU score of 0.1558 +/- 0.0113. Bil1 and Bil2 are using P(fj|ei) from HMM in Eqn. 4 with 1000 bilingual word clusters inferred from the two-step algorithm and the proposed one respectively. Using the clusters from the two-step algorithm gives a BLEU score of 0.1575, which is close to the baseline. Using clusters from our algorithm, we observe more improvements with BLEU score of 0.1644 and a NIST score of 6.582.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML