XML Viewer - w05-0806

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0806_metho.xml
Size: 17,007 bytes
Last Modified: 2025-10-06 14:09:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0806">
  <Title>Augmenting a Small Parallel Text with Morpho-syntactic Language Resources for Serbian-English Statistical Machine Translation</Title>
  <Section position="4" start_page="41" end_page="44" type="metho">
    <SectionTitle>
2 Language Resources
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
2.1 Language Characteristics
</SectionTitle>
      <Paragraph position="0"> Serbian, as a Slavic language, has a very rich inflectional morphology for all open word classes. There are six distinct cases affecting not only common nouns but also proper nouns as well as pronouns, adjectives and some numbers. Some nouns and adjectives have two distinct plural forms depending on the number (if it is larger than four or not). There are also three genders for the nouns, pronouns, adjectives and some numbers leading to differences between the cases and also between the verb participles for past tense and passive voice.</Paragraph>
      <Paragraph position="1"> As for verbs, person and many tenses are expressed by the suffix, and the subject pronoun (e.g.</Paragraph>
      <Paragraph position="2"> I, we, it) is often omitted (similarly as in Spanish and Italian). In addition, negation of three quite important verbs, &amp;quot;biti&amp;quot; (to be, auxiliary verb for past tense, conditional and passive voice), &amp;quot;imati&amp;quot; (to have) and &amp;quot;hteti&amp;quot; (to want, auxiliary verb for the future tense), is done by adding the negative particle to the verb as a prefix.</Paragraph>
      <Paragraph position="3"> As for syntax, Serbian has a quite free word order, and there are no articles, neither indefinite nor definite.</Paragraph>
      <Paragraph position="4"> All these characteristics indicate that morpho-syntactic knowledge might be very useful for statistical machine translation involving Serbian language, especially when only scarce amounts of parallel text are available.</Paragraph>
    </Section>
    <Section position="2" start_page="41" end_page="43" type="sub_section">
      <SectionTitle>
2.2 Parallel Corpora
</SectionTitle>
      <Paragraph position="0"> Finding high-quality bilingual or multilingual parallel corpora involving Serbian language is a difficult task. For example, there are several web-sites with the news in both Serbian and English (some of them in other languages as well), but these texts are only comparable and not parallel at all. To our knowledge, the only currently available Serbian-English parallel text suitable for statistical machine translation is a manually created electronic version of the Assimil language course which has been used for some preliminary experiments in (Popovi'c et al., 2004; Popovi'c and Ney, 2004). We have used this corpus for systematical investigations described in this work.</Paragraph>
      <Paragraph position="1">  The electronic form of Assimil language course contains about 3k sentences and 25k running words of various types of conversations and descriptions as well as a few short newspaper articles. Detailed corpus statistics can be seen in Table 1. Since the domain of the corpus is basically not restricted, the vocabulary size is relatively large. Due to the rich morphology, the vocabulary for Serbian is almost two times larger than for English. The average sentence length for Serbian is about 8.5 words per sentence, and for English about 9.5. This difference is mainly caused by the lack of articles and omission of some subject pronouns in Serbian .</Paragraph>
      <Paragraph position="2"> The development and test set (500 sentences) are randomly extracted from the original corpus and the rest is used for training (referred to as 2.6k).</Paragraph>
      <Paragraph position="3"> In order to investigate the scenario with extremely scarce training material, a reduced training corpus (referred to as 200) has been created by random extraction of 200 sentences from the original training corpus.</Paragraph>
      <Paragraph position="4"> The morpho-syntactic annotation of the English part of the corpus has been done by the constraint grammar parser ENGCG for morphological and syntactic analysis of English language. For each word, this tool provides its base form and sequence of morpho-syntactic tags.</Paragraph>
      <Paragraph position="5"> For the Serbian corpus, to our knowlegde there is no available tool for automatic annotation of this language. Therefore, the base forms have been introduced manually and the POS tags have been provided partly manually and partly automatically using a statistical maximum-entropy based POS tagger similar to the one described in (Ratnaparkhi, 1996).</Paragraph>
      <Paragraph position="6"> First, the 200 sentences of the reduced training corpus have been annotated completely manually. Then the first 500 sentences of the rest of the training corpus have been tagged automatically and the errors have been manually corrected. Afterwards, the POS tagger has been trained on the extended corpus (700 sentences), the next 500 sentences of the rest are annotated, and the procedure has been repeated until the annotation has been finished for the complete corpus.</Paragraph>
      <Paragraph position="7">  The short phrases used as an additional bilingual knowledge source in our experiments have been collected from the web and contain about 350 standard words and short expressions with an average entry length of 1.8 words for Serbian and 2 words for English. Table 2 shows that about 30% of words from the phrase vocabulary are not present in the original Serbian corpus and about 70% of those words are not contained in the reduced corpus. For the English language those numbers are smaller, about 20% for the original corpus and 60% for the reduced one. These percentages are indicating that this parallel text, although very scarce, might be an useful additional training material.</Paragraph>
      <Paragraph position="8"> The phrases have also been morpho-syntactically annotated in the same way as the main corpus.</Paragraph>
      <Paragraph position="9">  In addition to the standard development and test set described in Section 2.2.1, we also tested our translation systems on a short external parallel text collected from the BBC News web-site containing 22 sentences about relations between USA and Ukraine after the revolution. As can be seen in Table 1, this text contains very large portion of out-of-vocabulary words (almost two thirds of Serbian words and almost half of English words are not seen in the training corpus), and has an average sentence length about two times larger than the training corpus. null 3 Transformations in the Source Language Standard SMT systems usually regard only full forms of the words, so that translation of full forms which have not been seen in the training corpus is not possible even if the base form has been seen.</Paragraph>
      <Paragraph position="10"> Since the inflectional morphology of the Serbian language is very rich, as described in Section 2.1, we investigate the use of the base forms instead of the full forms to overcome this problem for the translation into English. We propose two types of transformations of the Serbian corpus: conversion of the full forms into the base forms and additional treatment of the verbs.</Paragraph>
      <Paragraph position="11"> For the other translation direction, we propose removing the articles in the English part of the corpus as the Serbian language does not have any.</Paragraph>
    </Section>
    <Section position="3" start_page="43" end_page="43" type="sub_section">
      <SectionTitle>
3.1 Transformations of the Serbian Text
</SectionTitle>
      <Paragraph position="0"> Serbian full forms of the words usually contain information which is not relevant for translation into English. Therefore, we propose conversion of all Serbian words in their base forms. Although for some other inflected languages like German and Spanish this method did not yield any translation improvement, we still considered it as promising because the number of Serbian inflections is considerably higher than in the other two languages. Table 1 shows that this transformation significantly reduces the Serbian vocabulary size so that it becomes comparable to the English one.</Paragraph>
      <Paragraph position="1">  Inflections of Serbian verbs might contain relevant information about the person, which is especially important when the pronoun is omitted.</Paragraph>
      <Paragraph position="2"> Therefore, we apply an additional treatment of the verbs. Whereas all other word classes are still replaced only by their base forms, for each verb a part of the POS tag referring to the person is taken and the verb is converted into a sequence of this tag and its base form. For the three verbs described in Section 2.1, the separation of the negative particle is also applied: each negative full form is transformed into the sequence of the POS tag, negative particle and base form. The detailed statistics of this corpus is not reported since there are no significant changes, only the number of running words and average sentence length increase thus becoming closer to the values of the English corpus.</Paragraph>
    </Section>
    <Section position="4" start_page="43" end_page="44" type="sub_section">
      <SectionTitle>
3.2 Transformations of the English Text
</SectionTitle>
      <Paragraph position="0"> Since the articles are one of the most frequent word classes in English, but on the other side there are no arcticles at all in Serbian, we propose removing the articles from the English corpus for translation into Serbian. Each English word which has been detected as an article by means of its POS tag has been removed from the corpus. In Table 1, it can be seen that this method significantly reduces the number of running words and the average sentence length of the English corpus thus becoming comparable to the values of the Serbian corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="44" end_page="46" type="metho">
    <SectionTitle>
4 Translation Experiments and Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="44" end_page="44" type="sub_section">
      <SectionTitle>
4.1 Experimental Settings
</SectionTitle>
      <Paragraph position="0"> In order to systematically investigate the impact of the bilingual training corpus size and the effects of the morpho-syntactic information on the translation quality, the translation systems were trained on the full training corpus (2.6k) and on the reduced training corpus (200), both with and without short phrases. The translation is performed in both directions, i.e. from Serbian to English and other way round. For the Serbian to English translation systems, three versions of the Serbian corpus have been used: original (baseline), base forms only (sr base) and base forms with additional treatment of the verbs (sr base+v-pos). For the translation into Serbian, the systems were trained on two versions of the English corpus: original (baseline) and without articles (en no-article).</Paragraph>
      <Paragraph position="1"> The baseline translation system is the Alignment Templates system with scaling factors (Och and Ney, 2002). Word alignments are produced using GIZA++ toolkit without symmetrisation (Och and Ney, 2003). Preprocessing of the source data has been done before the training of the system, therefore modifications of the training and search procedure were not necessary for the translation of the transformed source language corpora.</Paragraph>
      <Paragraph position="2"> Although the development set has been used to optimise the scaling factors, results obtained for this set do not differ from those for the test set. Therefore only the joint error rates (Development+Test) are reported.</Paragraph>
      <Paragraph position="3"> As for the external test set, results for this text are reported only for the full corpus systems, since for the reduced corpus the error rates are higher but the effects of using phrases and morpho-syntactic information are basically the same.</Paragraph>
    </Section>
    <Section position="2" start_page="44" end_page="46" type="sub_section">
      <SectionTitle>
4.2 Translation Results
</SectionTitle>
      <Paragraph position="0"> The evaluation metrics used in our experiments are WER (Word Error Rate), PER (Positionindependent word Error Rate) and BLEU (BiLingual Evaluation Understudy) (Papineni et al., 2002). Since BLEU is an accuracy measure, we use 1-BLEU as an error measure.</Paragraph>
      <Paragraph position="1"> 4.2.1 Translation from Serbian into English Error rates for the translation from Serbian into English are shown in Table 3 and some examples are shown in Table 6. It can be seen that there is a significant decrease in all error rates when the full forms are replaced with their base forms. Since the redundant information contained in the inflection is removed, the system can better capture the relevant information and is capable of producing correct or approximatively correct translations even for unseen full forms of the words (marked by &amp;quot;UNKNOWN &amp;quot; in the baseline result example). The treatment of the verbs yields some additional improvements.</Paragraph>
      <Paragraph position="2"> From the first translation example in Table 6 it can be seen how the problem of some out-of-vocabulary words can be overcomed with the use of the base forms. The second and third example are showing the advantages of the verb treatment, the third one illustrates the effect of separating the negative particle. null Reduction of the training corpus to only 200 sentences (about 8% of the original corpus) leads to a loss of error rates of about 45% relative. However, the degradation is not higher than 35% if phrases and morpho-syntactic information are available in addition to the reduced corpus.</Paragraph>
      <Paragraph position="3"> The use of the phrases can improve the translation quality to some extent, especially for the systems with the reduced training corpus, but these improvements are less remarkable than those obtained by replacing words with the base forms.</Paragraph>
      <Paragraph position="4"> The best system with the complete corpus as well as the best one with the reduced corpus use the phrases and the transformed Serbian corpus where the verb treatment has been applied.</Paragraph>
      <Paragraph position="5">  Table 4 shows results for the translation from English into Serbian. As expected, all error rates are higher than for the other translation direction. Translation into the morphologically richer language always has poorer quality because it is difficult to find the correct inflection.</Paragraph>
      <Paragraph position="6"> The performance of the reduced corpus is degraded for about 40% relative for the baseline system and for about 30% when the phrases are used and the transformation of the English corpus has been applied.</Paragraph>
      <Paragraph position="7">  The importance of the phrases seems to be larger for this translation direction. Removing the English articles does not have the significant role for the translation systems with full corpus, but for the reduced corpus it has basically the same effect as the use of phrases. The best system with the reduced corpus has been built with the use of phrases and removal of the articles.</Paragraph>
      <Paragraph position="8"> Table 7 shows some examples of the translation into Serbian with and without English articles. Although these effects are not directly obvious, it can be seen that removing of the redundant information enables better learning of the relevant information so that system is better capable of producing semantically correct output. The first example illustrates an syntactically incorrect output with the wrong inflection of the verb (&amp;quot;Vcitam&amp;quot; means &amp;quot;I read&amp;quot;). The output of the system without articles is still not completely correct, but the semantic is completely preserved. The second example illustrates an output produced by the baseline system which is neither syntactically nor semantically correct (&amp;quot;you have I drink&amp;quot;). The output of the new system still has an error in the verb, informal form of &amp;quot;you&amp;quot; instead of the formal one, but nevertheless both the syntax and semantics are correct.</Paragraph>
      <Paragraph position="9"> 4.2.3 Translation of the External Text Translation results for the external test can be seen in Table 5. As expected, the high number of out-of-vocabulary words results in very high error rates. Certain improvement is achieved with the phrases, but the most significant improvements are yielded by the use of Serbian base forms and removal of English articles. Verb treatment in this case does not outperform the base forms system, probably because there are not so many different verb forms as in the other corpus, and only a small number of pronouns is missing.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML