File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/h05-1110_evalu.xml

Size: 5,062 bytes

Last Modified: 2025-10-06 13:59:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1110">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 875-882, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Inducing a multilingual dictionary from a parallel multitext in related languages</Title>
  <Section position="7" start_page="879" end_page="881" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> The output of the system so far is a multi-lingual word translation model. We will evaluate it by producing a tri-lingual dictionary (Russian-Ukrainian-Belorussian), picking a highest probability translation for each word, from the corresponding Bibles.</Paragraph>
    <Paragraph position="1"> Unfortunately, we do not have a good hand-built tri-lingual dictionary to compare it to, but only one good bilingual one, Russian-Ukrainian5. We will therefore take the Russian-Ukrainian portion of our dictionary and compare it to the hand-built one.</Paragraph>
    <Paragraph position="2"> Our evaluation metric is the number of entries that match between these dictionaries. If a word has several translations in the hand-built dictionary, match- null ing any of them counts as correct. It is worth noting that for all the dictionaries we generate, the total number of entries is the same, since all the words that occur in the source portion of the corpus have an entry. In other words, precision and recall are proportional to each other and to our evaluation metric.</Paragraph>
    <Paragraph position="3"> Not all of the words that occur in our dictionary occur in the hand-built dictionary and vice versa. An absolute upper limit of performance, therefore, for this evaluation measure is the number of left-hand-side entries that occur in both dictionaries.</Paragraph>
    <Paragraph position="4"> In fact, we cannot hope to achieve this number.</Paragraph>
    <Paragraph position="5"> First, because the dictionary translation of the word in question might never occur in the corpus. Second, even if it does, but never co-occurs in the same sentence as its translation, we will not have any basis to propose it as a translation.6. Therefore we have a &amp;quot;achievable upper limit&amp;quot;, the number of words that have their &amp;quot;correct&amp;quot; translation co-occur at least once. We will compare our performance to this upper limit.</Paragraph>
    <Paragraph position="6"> Since there is no manual tuning involved we do not have a development set, and use the whole bible for training (the dictionary is used as a test set, as described above).</Paragraph>
    <Paragraph position="7"> We evaluate the performance of the model with just the GIZA component as the baseline, and add all the other components in turn. There are two possible models to evaluate at each step. The pairwise model is the model given in equation 1 under the parameter setting given by Algorithm 2, with Belorussian used as a third language. The joint model is the full model over these three languages as estimated by Algorithm 2. In either case we pick a highest probability Ukrainian word as a translation of a given Russian word.</Paragraph>
    <Paragraph position="8"> The results for Russian-Ukrainian bibles are presented in Table 1. The &amp;quot;oracle&amp;quot; setting is the setting obtained by tuning on the test set (the dictionary). We see that using a third language to tune works just as well, obtaining the true global maximum for the model. Moreover, the joint model (which is more flexible than the model in Equation 1) does even better. This was unexpected for us, be6Strictly speaking, we might be able to infer the word's existence in some cases, by performing morphological analysis and proposing a word we have not seen, but this seems too hard at the moment  cause the joint model relies on three pairwise models equally, and Russian-Belorussian and Ukrainian-Belorussian models are bound to be less reliable for Russian-Ukrainian evaluation. It appears, however, that our Belorussian bible is translated directly from Russian rather than original languages, and parallels Russian text more than could be expected.</Paragraph>
    <Paragraph position="9"> To insure our results are not affected by this fact we also try Polish separately and in combination with Belorussian (i.e. a model over 4 languages), as shown in Table 2.</Paragraph>
    <Paragraph position="10"> These results demonstrate that the joint model is not as good for Polish, but it still finds the optimal parameter setting. This leads us to propose the following extension: let us marginalize joint Russian-Ukrainian-Belorussian model into just Russian-Ukrainian, and add this model as yet another component to Equation 1. Now we cannot use Belorussian as a third language, but we can use Polish, which we know works just as well for tuning.</Paragraph>
    <Paragraph position="11"> The resulting performance for the model is 85.7%, our best result to date.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML