File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3102_intro.xml
Size: 9,378 bytes
Last Modified: 2025-10-06 14:04:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3102"> <Title>Initial Explorations in English to Turkish Statistical Machine Translation</Title> <Section position="3" start_page="0" end_page="8" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The availability of large amounts of so-called parallel texts has motivated the application of statistical techniques to the problem of machine translation starting with the seminal work at IBM in the early 90's (Brown et al., 1992; Brown et al., 1993). Statistical machine translation views the translation process as a noisy-channel signal recovery process in which one tries to recover the input &quot;signal&quot; e, from the observed output signal f.1 Early statistical machine translation systems used a purely word-based approach without taking into account any of the morphological or syntactic properties of the languages (Brown et al., 1993). Limitations of basic word-based models prompted researchers to exploit morphological and/or syntactic/phrasal structure (Niessen and Ney, (2004), Lee,(2004), Yamada and Knight (2001), Marcu and Wong (2002), Och and Ney (2004),Koehn et al.</Paragraph> <Paragraph position="1"> (2003), among others.) In the context of the agglutinative languages similar to Turkish (in at least morphological aspects) , there has been some recent work on translating from and to Finnish with the significant amount of data in the Europarl corpus. Although the BLEU (Papineni et al., 2002) score from Finnish to English is 21.8, the score in the reverse direction is reported as 13.0 which is one of the lowest scores in 11 European languages scores (Koehn, 2005). Also, reported from and to translation scores for Finnish are the lowest on average, even with the large number of 1Denoting English and French as used in the original IBM Project which translated from French to English using the parallel text of the Hansards, the Canadian Parliament Proceedings. sentences available. These may hint at the fact that standard alignment models may be poorly equipped to deal with translation from a poor morphology language like English to an complex morphology language like Finnish or Turkish.</Paragraph> <Paragraph position="2"> This paper presents results from some very preliminary explorations into developing an English-to-Turkish statistical machine translation system and discusses the various problems encountered. Starting with a baseline word model trained from about 20K aligned sentences, we explore various ways of exploiting morphological structure to improve upon the baseline system. As Turkish is a language with agglutinative word structures, we experiment with morphologically segmented and disambiguated versions of the parallel text, in order to also uncover relations between morphemes and function words in one language with morphemes and functions words in the other, in addition to relations between open class content words; as a cursory analysis of sentence aligned Turkish and English texts indicates that translations of certain English words are actually morphemes embedded into Turkish words. We choose a morphological segmentation representation on the Turkish side which abstracts from word-internal morphological variations and conflates the statistics from allomorphs so that data sparseness can be alleviated to a certain extent.</Paragraph> <Paragraph position="3"> This paper is organized as follows: we start with the some of the issues of building an SMT system into Turkish followed by a short overview Turkish morphology to motivate its effect on the word alignment problem with English. We then present results from our explorations with a baseline system and with morphologically segmented parallel aligned texts, and conclude after a short discussion.</Paragraph> <Paragraph position="4"> 2 Issues in building a SMT system for Turkish The first step of building an SMT system is the compilation of a large amount of parallel texts which turns out to be a significant problem for the Turkish and English pair. There are not many sources of such texts and most of what is electronically available are parallel texts diplomatic or legal domains from NATO, EU, and foreign ministry sources. There is also a limited amount data parallel news corpus available from certain news sources. Although we have collected about 300K sentence parallel texts, most of these require significant clean-up (from HTML/PDF sources) and we have limited our training data in this paper to about 22,500 sentence sub-set of these parallel texts which comprises the sub-set of sentences of 40 words or less from the 30K sentences that have been cleaned-up and sentence aligned.2 a03 The main aspect that would have to be seriously considered first for Turkish in SMT is the productive inflectional and derivational morphology. Turkish word forms consist of morphemes concatenated to a root morpheme or to other morphemes, much like &quot;beads on a string&quot; (Oflazer, 1994). Except for a very few exceptional cases, the surface realizations of the morphemes are conditioned by various local regular morphophonemic processes such as vowel harmony, consonant assimilation and elisions. Further, most morphemes have phrasal scopes: although they attach to a particular stem, their syntactic roles extend beyond the stems. The morphotactics of word forms can be quite complex especially when multiple derivations are involved. For instance, the derived modifier saVglamlas,tirdiVgimizdaki 4 would be broken into surface morphemes as follows: saVglam+las,+tir+diVg+imiz+da+ki Starting from an adjectival root saVglam, this word form first derives a verbal stem saVglamlas,, meaning &quot;to become strong&quot;. A second suffix, the causative surface morpheme +tir which we treat as a verbal derivation, forms yet another verbal stem meaning &quot;to cause to become strong&quot; or &quot;to make strong (fortify)&quot;. The immediately following participle suffix parallel texts in order not to exceed the maximum number of words recommended for GIZA++ training.</Paragraph> <Paragraph position="5"> 4Literally, &quot;(the thing existing) at the time we caused (something) to become strong&quot;. Obviously this is not a word that one would use everyday, but already illustrates the difficulty as one Turkish &quot;word&quot; would have to be aligned to a possible discontinues sequence of English words if we were to attempt a word level alignment. Turkish words (excluding noninflecting frequent words such as conjunctions, clitics, etc.) found in typical running text average about 10 letters in length. The average number of bound morphemes in such words is about 2.</Paragraph> <Paragraph position="6"> +diVg, produces a participial nominal, which inflects in the normal pattern for nouns (here, for 1a0a1 per-son plural possessor which marks agreement with the subject of the verb, and locative case). The final suffix, +ki, is a relativizer, producing a word which functions as a modifier in a sentence, modifying a noun somewhere to the right.</Paragraph> <Paragraph position="7"> However, if one further abstracts from the morphophonological processes involved one could get a lexical form saVglam+lAs,+DHr+DHk+HmHz+DA+ki In this representation, the lexical morphemes except the lexical root utilize meta-symbols that stand for a set of graphemes which are selected on the surface by a series of morphographemic processes which are rooted in morphophonological processes some of which are discussed below, but have nothing whatsoever with any of the syntactic and semantic relationship that word is involved in. For instance, A stands for back and unrounded vowels a and e, in orthography, H stands for high vowels i, i, u and &quot;u, and D stands for d and t, representing alveolar consonants. Thus, a lexical morpheme represented as +DHr actually represents 8 possible allomorphs, which appear as one of +dir, +dir, +dur, +d&quot;ur, +tir, +tir, +tur, +t&quot;ur depending on the local morphophonemic context. Thus at this level of representation words that look very different on the surface, look very similar. For instance, although the words masasinda 'on his table' and defterinde 'in his notebook' in Turkish look quite different, the lexical morphemes except for the root are the same: masasinda has the lexical structure masa+sH+ndA, while defterinde has the lexical structure defter+sH+ndA.</Paragraph> <Paragraph position="8"> The use of this representation is particularly important for Turkish for the following reason. Allomorphs which differ because of local word-internal morphographemic and morphotactical constraints almost always correspond to the same words or units in English when translated. When such units are considered by themselves as the units in alignment, statistics get fragmented and the model quality suffers. On the other hand, this representation if directly used in a standard SMT model such as IBM Model 4, will most likely cause problems, since now, the distortion parameters will have to take on the task of generating the correct sequence of morphemes in a word (which is really a local word-internal problem to be solved) in addition to generating the correct sequence of words.</Paragraph> </Section> class="xml-element"></Paper>