File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3102_metho.xml

Size: 13,563 bytes

Last Modified: 2025-10-06 14:11:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3102">
  <Title>Initial Explorations in English to Turkish Statistical Machine Translation</Title>
  <Section position="4" start_page="8" end_page="9" type="metho">
    <SectionTitle>
3 Aligning English-Turkish Sentences
</SectionTitle>
    <Paragraph position="0"> If an alignment between the components of parallel Turkish and English sentences is computed, one obtains an alignment like the one shown in Figure 1, where it is clear that Turkish words may actually correspond to whole phrases in the English sentence.</Paragraph>
    <Paragraph position="1">  and an English sentence One major problem with this situation is that even if a word occurs many times in the English side, the actual Turkish equivalent could be either missing from the Turkish part, or occur with a very low frequency, but many inflected variants of the form could be present. For example, Table 1 shows the occurrences of different forms for the root word faaliyet 'activity' in the parallel texts we experimented with. Although, many forms of the root word appear, none of the forms appear very frequently and one may even have to drop occurrences of frequency 1 depending on the word-level alignment model used, further worsening the sparseness problem.5 To overcome this problem and to get the maximum benefit from the limited amount of parallel texts, we decided to perform morphological analysis of both the Turkish and the English texts to be able to uncover relationships between root words, suffixes and function words while aligning them.</Paragraph>
    <Paragraph position="2"> 5A noun root in Turkish may have about hundred inflected forms and substantially more if productive derivations are considered, meanwhile verbs can have thousands of inflected and derived forms if not more.</Paragraph>
    <Section position="1" start_page="9" end_page="9" type="sub_section">
      <SectionTitle>
Wordform Count Gloss
</SectionTitle>
      <Paragraph position="0"> faaliyet 3 'activity' faaliyete 1 'to the activity' faaliyetinde 1 'in its activity' faaliyetler 3 'activities' faaliyetlere 6 'to the activities'  faaliyetlerinde 1 'in their activities' faaliyetlerine 5 'to their activities' faaliyetlerini 1 'their activities (acc.)' faaliyetlerinin 2 'of their activities' faaliyetleriyle 1 'with their activities' faaliyette 2 'in (the) activity' faaliyetteki 1 'that which is in activity' Total 41 On the Turkish side, we extracted the lexical morphemes of each word using a version of the morphological analyzer (Oflazer, 1994) that segmented the Turkish words along morpheme boundaries and normalized the root words in cases they were deformed due to a morphographemic process. So the word faaliyetleriyle when segmented into lexical morphemes becomes faaliyet +lAr +sH +ylA. Ambiguous instances were disambiguated statistically (K&amp;quot;ulekc,i and Oflazer, 2005).</Paragraph>
      <Paragraph position="1"> Similarly, the English text was tagged using TreeTagger (Schmid, 1994), which provides a lemma and a POS for each word. We augmented this process with some additional processing for handling derivational morphology. We then dropped any tags which did not imply an explicit morpheme (or an exceptional form). The complete set of tags that are used from the Penn-Treebank tagset is shown in Table 2.6 To make the representation of the Turkish texts and English texts similar, tags are marked with a '+' at the beginning of all tags to indicate that such tokens are treated as &amp;quot;morphemes.&amp;quot; For instance, the English word activities was segmented as activ6The tagset used by the TreeTagger is a refinement of Penn-Treebank tagset where the second letter of the verb part-of-speech tags distinguishes between &amp;quot;be&amp;quot; verbs (B), &amp;quot;have&amp;quot; verbs (H) and other verbs (V),(Schmid, 1994).</Paragraph>
      <Paragraph position="2"> ity +NNS. The alignments we expected to obtain are depicted in Figure 2 for the example sentence given earlier in Figure 1.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="9" end_page="11" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We proceeded with the following sequence of experiments: null (1) Baseline: As a baseline system, we used a pure word-based approach and used Pharaoh Training tool (2004), to train on the 22,500 sentences, and decoded using Pharaoh (Koehn et al., 2003) to obtain translations for a test set of 50 sentences. This gave us a baseline BLEU score of 0.0752.</Paragraph>
    <Paragraph position="1"> (2) Morpheme Concatenation: We then trained the same system with the morphemic representation  of the parallel texts as discussed above. The decoder now produced the translations as a sequence of root words and morphemes. The surface words were then obtained by just concatenating all the morphemes following a root word (until the next root word) taking into just morphographemic rules but not any morphotactic constraints. As expected this &amp;quot;morpheme-salad&amp;quot; produces a &amp;quot;word-salad&amp;quot;, as most of the time wrong morphemes are associated with incompatible root words violating many morphotactic constraints. The BLEU score here was 0.0281, substantially worse than the baseline in (1) above.</Paragraph>
    <Paragraph position="2">  (3) Selective Morpheme Concatenation: With  a small script we injected a bit of morphotactical knowledge into the surface form generation process and only combined those morphemes following a root word (in the given sequence), that gave rise to a valid Turkish word form as checked by a morphological analyzer. Any unused morphemes were ignored. This improved the BLEU score to 0.0424 which was still below the baseline.</Paragraph>
    <Paragraph position="3"> (4) Morpheme Grouping: Observing that certain sequence of morphemes in Turkish texts are translations of some continuous sequence of functional words and tags in English texts, and that some morphemes should be aligned differently depending on the other morphemes in their context, we attempted a morpheme grouping. For example the morpheme sequence +DHr +mA marks infinitive form of a causative verb which in Turkish inflects like a noun; or the lexical morpheme sequence +yAcAk +DHr usually maps to &amp;quot;it/he/she will&amp;quot;. To find such groups of morphemes and functional words, we applied a sequence of morpheme groupings by extracting frequently occuring n-grams of morphemes as follows (much like the grouping used by Chiang (2005): in a series of iterations, we obtained high-frequency bi-grams from the morphemic representation of parallel texts, of either morphemes, or of previously such identified morpheme groups and neighboring morphemes until up to four morphemes or one root 3 morpheme could be combined. During this process we ignored those combinations that contain punctuation or a morpheme preceding a root word. A similar grouping was done on the English side grouping function words and morphemes before and after root words.</Paragraph>
    <Paragraph position="4"> The aim of this process was two-fold: it let frequent morphemes to behave as a single token and help Pharaoh with identification of some of the phrases. Also since the number of tokens on both sides were reduced, this enabled GIZA++ to produce somewhat better alignments.</Paragraph>
    <Paragraph position="5"> The morpheme level translations that were obtained from training with this parallel texts were then converted into surface forms by concatenating the morphemes in the sequence produced. This resulted in a BLEU score of 0.0644.</Paragraph>
    <Paragraph position="6">  (5) Morpheme Grouping with Selective Morpheme Concatenation: This was the same as (4)  with the morphemes selectively combined as in (3). The BLEU score of 0.0913 with this approach was now above the baseline.</Paragraph>
    <Paragraph position="7"> Table 3 summarizes the results in these five experiments: null  Exp. System BLEU (1) Baseline 0.0752 (2) Morph. Concatenation. 0.0281 (3) Selective Morph. Concat. 0.0424 (4) Morph. Grouping and Concat. 0.0644 (5) Morph. Grouping + (3) 0.0913  In an attempt to factor out and see if the translations were at all successful in getting the root words in the translations we performed the following: We morphologically analyzed and disambiguated the reference texts, and reduced all to sequences of root words by eliminating all the morphemes. We performed the same for the outputs of (1) (after morphological analysis and disambiguation), (2) and (4) above, that is, threw away the morphemes ((3) is the same as (2) and (5) same as (4) here). The translation root word sequences and the reference root word sequences were then evaluated using the BLEU (which would like to label here as BLEU-r for BLEU root, to avoid any comparison to previous results, which will be misleading. These scores are shown in Figure 4.</Paragraph>
    <Paragraph position="8"> The results in Tables 3 and 4 indicate that with the standard models for SMT, we are still quite far from even identifying the correct root words in the trans- null (4) Exp. System BLEU (1) Baseline 0.0955 (2) Morph. Concatenation. 0.0787 (4) Morph. Grouping 0.1224  lations into Turkish, let alone getting the morphemes and their sequences right. Although some of this may be due to the (relatively) small amount of parallel texts we used, it may also be the case that splitting the sentences into morphemes can play havoc with the alignment process by significantly increasing the number of tokens per sentence especially when such tokens align to tokens on the other side that is quite far away.</Paragraph>
    <Paragraph position="9"> Nevertheless the models we used produce some quite reasonable translations for a small number of test sentences. Table 5 shows the two examples of translations that were obtained using the standard models (containing no Turkish specific manipulation except for selective morpheme concatenation). We have marked the surface morpheme boundaries in the translated and reference Turkish texts to indicate where morphemes are joined for exposition purposes here - they neither appear in the reference translations nor in the produced translations!</Paragraph>
  </Section>
  <Section position="6" start_page="11" end_page="12" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> Although our work is only an initial exploration into developing a statistical machine translation system from English to Turkish, our experiments at least point out that using standard models to determine the correct sequence of morphemes within the words, using more powerful mechanisms meant to determine the (longer) sequence of words in sentences, is probably not a good idea. Morpheme ordering is a very local process and the correct sequence should be determined locally though the existence of morphemes could be postulated from sentence level features during the translation process.</Paragraph>
    <Paragraph position="1"> Furthermore, insisting on generating the exact sequence of morphemes could be an overkill. On the other hand, a morphological generator could take a root word and a bag of morphemes and Table 5: Some good SMT results Input: international terrorism also remains to be an important  generate possible legitimate surface words by taking into account morphotactic constraints and morphographemic constraints, possibly (and ambiguously) filling in any morphemes missing in the translation but actually required by the morphotactic paradigm. Any ambiguities from the morphological generation could then be filtered by a language model.</Paragraph>
    <Paragraph position="2"> Such a bag-of-morphemes approach suggests that we do not actually try to determine exactly where the morphemes actually go in the translation but rather determine the root words (including any function words) and then associate translated morphemes with the (bag of the) right root word. The resulting sequence of root words and their bags-of-morpheme can be run through a morphological generator which can handle all the word-internal phenomena such as proper morpheme ordering, filling in morphemes or even ignoring spurious morphemes, handling local morphographemic phenomena such as vowel harmony, etc. However, this approach of not placing morphemes into specific position in the translated output but just associating them with certain root words requires that a significantly different alignment and decoding models be developed.</Paragraph>
    <Paragraph position="3"> Another representation option that could be em- null ployed is to do away completely with morphemes on the Turkish side and just replace them with morphological feature symbols (much like we did here for English). This has the advantage of better handling allomorphy - all allomorphs including those that are not just phonological variants map to the same feature, and homograph morphemes which signal different features map to different features. So in a sense, this would provide a more accurate decomposition of the words on the Turkish side, but at the same time introduce a larger set of features since default feature symbols are produced for any morphemes that do not exist on the surface. Removing such redundant features from such a representation and then using reduced features could be an interesting avenue to pursue. Generation of surface words would not be a problem since, our morphological generator does not care if it is input morphemes or features.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML