File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/e99-1010_evalu.xml
Size: 6,152 bytes
Last Modified: 2025-10-06 14:00:34
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1010"> <Title>An Efficient Method for Determining Bilingual Word Classes</Title> <Section position="7" start_page="187" end_page="187" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> The statistical machine-translation method described in (Och and Weber, 1998) makes use of bilingual word classes. The key element of this approach are the alignment templates (originally referred to as translation rules)which are pairs of phrases together with an alignment between the words of the phrases. Examples of alignment templates are shown in Figure 2. The advantage of the alignment template approach against word-based statistical translation models is that word context and local re-orderings are explicitly taken into account. null The alignment templates are automatically trained using a parallel training corpus. The translation of a sentence is done by a search process which determines the set of alignment templates which optimally cover the source sentence. The bilingual word classes are used to generalize the applicability of the alignment templates in search. If there exists a class which contains all cities in source and target language it is possible that an alignment template containing a special city can be generalized to all cities. More details are given in (Och and Weber, 1998; Och and Ney, 1999).</Paragraph> <Paragraph position="1"> We demonstrate results of our bilingual clustering method for two different bilingual corpora (see Tables 1 and 2). The EUTRANS-I corpus is a subtask of the &quot;Traveller Task&quot; (Vidal, 1997) which is an artificially generated Spanish-English corpus. The domain of the corpus is a human-to-human communication situation at a reception desk of a hotel. The EUTRANS-II corpus is a natural German-English corpus consisting of different text types belonging to the domain of tourism: bilingual Web pages of hotels, bilingual touristic brochures and business correspondence. The target language of our experiments is English.</Paragraph> <Paragraph position="2"> We compare the three described methods to generate bilingual word classes. The classes MONO are determined by monolingually optimizing source and target language classes with Eq. (4). The classes BIL are determined by bilingually optimizing classes with Eq. (10). The classes BIL-2 are determined by first optimizing mono-lingually classes for the target language (English) and afterwards optimizing classes for the source language (Eq. (11) and Eq. (12)).</Paragraph> <Paragraph position="3"> For EUTRANS-I we used 60 classes and for EUTRANS-II we used 500 classes. We chose the number of classes in such a way that the final performance of the translation system was optimal. The CPU time for optimization of bilingual word classes on an Alpha workstation was under 20 seconds for EUTRANS-I and less than two hours for EUTRANS-II.</Paragraph> <Paragraph position="4"> Table 3 provides examples of bilingual word classes for the EUTRANS-I corpus. It can be seen that the resulting classes often contain words that are similar in their syntactic and semantic functions. The grouping of words with a different meaning like today and tomorrow does not imply that these words should be translated by the same Spanish word, but it does imply that the translations of these words are likely to be in the same Spanish word class.</Paragraph> <Paragraph position="5"> To measure the quality of our bilingual word classes we applied two different evaluation measures: null 1. Average e-mirror size (Wang et al., 1996): The e-mirror of a class E is the set of classes which have a translation probability greater than e. We use e = 0.05.</Paragraph> <Paragraph position="6"> 2. The perplexity of the class transition proba- null bility on a bilingual test corpus: exp j-1. y~ maxi log (p (g (fj) Ig (ei))) j=l Both measures determine the extent to which the translation probability is spread out. A small value means that the translation probability is very focused and that the knowledge of the source language class provides much information about the target language class.</Paragraph> <Paragraph position="7"> Table 4 shows the perplexity of the obtained translation lexicon without word classes, with monolingual and with bilingual word classes. As expected the bilingually optimized classes (BIL, BIL-2) achieve a significantly lower perplexity and a lower average e-mirror than the mono-lingually optimized classes (MONO).</Paragraph> <Paragraph position="8"> The tables 6 and 7 show the translation quality of the statistical machine translation system described in (Och and Weber, 1998) using no classes (WORD) at all, mono-lingually, and bilingually optimized word classes. The translation system was trained using the bilingual training corpus without any further knowledge sources. Our evaluation criterion is the word error rate (WER) -- the minimum number of in- null sertions/deletions/substitutions relative to a reference translation.</Paragraph> <Paragraph position="9"> As expected the translation quality improves using classes. For the small EuTRANS-I task the word error rates reduce significantly. The word error rates for the EUTRANS-II task are much larger because the task has a very large vocabulary and is more complex. The bilingual classes show better results than the monolingual classes MONO. One explanation for the improvement in translation quality is that the bilingually optimized classes result in an increased average size of used alignment templates. For example the average length of alignment templates with the EUTRANS-I corpus using WORD is 2.85 and using BIL-2 it is 5.19. The longer the average alignment template length, the more context is used in the translation and therefore the translation quality is higher. An explanation for the superiority of BIL-2 over BIL is that by first optimizing the English classes mono-lingually, it is much more probable that longer sequences of classes occur more often thereby increasing the average alignment template size.</Paragraph> </Section> class="xml-element"></Paper>