File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-2016_intro.xml
Size: 2,774 bytes
Last Modified: 2025-10-06 14:01:43
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-2016"> <Title>Cognates Can Improve Statistical Translation Models</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In the context of machine translation, the term cognates denotes words in different languages that are similar in their orthographic or phonetic form and are possible translations of each other. The similarity is usually due either to a genetic relationship (e.g. English night and German nacht) or borrowing from one language to another (e.g. English sprint and Japanese supurinto). In a broad sense, cognates include not only genetically related words and borrowings but also names, numbers, and punctuation. Practically all bitexts (bilingual parallel corpora) contain some kind of cognates. If the languages are represented in different scripts, a phonetic transcription or transliteration of one or both parts of the bitext is a pre-requisite for identifying cognates.</Paragraph> <Paragraph position="1"> Cognates have been employed for a number of bitextrelated tasks, including sentence alignment (Simard et al., 1992), inducing translation lexicons (Mann and Yarowsky, 2001), and improving statistical machine translation models (Al-Onaizan et al., 1999). Cognates are particularly useful when machine-readable bilingual dictionaries are not available. Al-Onaizan et al. (1999) experimented with using bilingual dictionaries and cognates in the training of Czech-English translation models. They found that appending probable cognates to the training bitext significantly lowered the perplexity score on the test bitext (in some cases more than when using a bilingual dictionary), and observed improvement in word alignments of test sentences.</Paragraph> <Paragraph position="2"> In this paper, we investigate the problem of incorporating the potentially valuable cognate information into the translation models of Brown et al. (1990), which, in their original formulation, consider lexical items in abstraction of their form. For training of the models, we use the GIZA program (Al-Onaizan et al., 1999). A list of likely cognate pairs is extracted from the training corpus on the basis of orthographic similarity, and appended to the corpus itself. The objective is to reinforce the co-ocurrence count between cognates in addition to already existing co-ocurrences. The results of experiments conducted on a variety of bitexts show that cognate identification can improve word alignments, which leads to better translation models, and, consequently, translations of higher quality. The improvement is achieved without modifying the statistical training algorithm.</Paragraph> </Section> class="xml-element"></Paper>