XML Viewer - p97-1063

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/p97-1063_evalu.xml
Size: 10,250 bytes
Last Modified: 2025-10-06 14:00:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1063">
  <Title>A Word-to-Word Model of Translational Equivalence</Title>
  <Section position="6" start_page="493" end_page="495" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> A word-to-word model of translational equivalence can be evaluated either over types or over tokens.</Paragraph>
    <Paragraph position="1"> It is impossible to replicate the experiments used to evaluate other translation models in the literature, because neither the models nor the programs that induce them are generally available. For each kind of evaluation, we have found one case where we can come close.</Paragraph>
    <Paragraph position="2"> We induced a two-class word-to-word model of translational equivalence from 13 million words of the Canadian Hansards, aligned using the method in (Gale &amp; Church, 1991). One class represented content-word links and the other represented function-word links 4. Link types with negative log-likelihood were discarded after each iteration.</Paragraph>
    <Paragraph position="3"> Both classes' parameters converged after six iterations. The value of class-based models was demonstrated by the differences between the hidden parameters for the two classes. (A +,A-) converged at (.78,00016) for content-class links and at (.43,.000094) for function-class links.</Paragraph>
    <Section position="1" start_page="493" end_page="494" type="sub_section">
      <SectionTitle>
5.1 Link Types
</SectionTitle>
      <Paragraph position="0"> The most direct way to evaluate the link types in a word-level model of translational equivalence is to treat each link type as a candidate translation lexicon entry, and to measure precision and recall. This evaluation criterion carries much practical import, because many of the applications mentioned in Section 1 depend on accurate broad-coverage translation lexicons. Machine readable bilingual dictionaries, even when they are available, have only limited coverage and rarely include domain-specific terms (Resnik &amp; Melamed, 1997).</Paragraph>
      <Paragraph position="1"> We define the recall of a word-to-word translation model as the fraction of the bitext vocabulary represented in the model. Translation model precision is a more thorny issue, because people disagree about the degree to which context should play a role in judgements of translational equivalence. We handevaluated the precision of the link types in our model in the context of the bitext from which the model 4Since function words can be identified by table lookup, no POS-tagger was involved.</Paragraph>
      <Paragraph position="2"> was induced, using a simple bilingual concordancer.</Paragraph>
      <Paragraph position="3"> A link type (u, v) was considered correct if u and v ever co-occurred as direct translations of each other.</Paragraph>
      <Paragraph position="4"> Where the one-to-one assumption failed, but a link type captured part of a correct translation, it was judged &amp;quot;incomplete.&amp;quot; Whether incomplete links are correct or incorrect depends on the application.</Paragraph>
      <Paragraph position="5">  intervals at varying levels of recall.</Paragraph>
      <Paragraph position="6"> We evaluated five random samples of 100 link types each at three levels of recall. For our bitext, recall of 36%, 46% and 90% corresponded to translation lexicons containing 32274, 43075 and 88633 words, respectively. Figure 5 shows the precision of the model with 95% confidence intervals. The upper curve represents precision when incomplete links are considered correct, and the lower when they are considered incorrect. On the former metric, our model can generate translation lexicons with precision and recall both exceeding 90%, as well as dictionary-sized translation lexicons that are over 99% correct. Though some have tried, it is not clear how to extract such accurate lexicons from other published translation models. Part of the difficulty stems from the implicit assumption in other models that each word has only one sense. Each word is assigned the same unit of probability mass, which the model distributes over all candidate translations. The correct translations of a word that has several correct translations will be assigned a lower probability than the correct translation of a word that has only one correct translation. This imbalance foils thresholding strategies, clever as they might be (Gale &amp; Church, 1991; Wu ~z Xia, 1994; Chen, 1996). The likelihoods in the word-to-word model remain unnormalized, so they do not compete.</Paragraph>
      <Paragraph position="7"> The word-to-word model maintains high precision even given much less training data. Resnik &amp; Melamed (1997) report that the model produced  translation lexicons with 94% precision and 30% recall, when trained on French/English software manuals totaling about 400,000 words. The model was also used to induce a translation lexicon from a 6200-word corpus of French/English weather reports. Nasr (1997) reported that the translation lexicon that our model induced from this tiny bitext accounted for 30% of the word types with precision between 84% and 90%. Recall drops when there is tess training data, because the model refuses to make predictions that it cannot make with confidence. For many applications, this is the desired behavior.</Paragraph>
    </Section>
    <Section position="2" start_page="494" end_page="495" type="sub_section">
      <SectionTitle>
5.2 Link Tokens
</SectionTitle>
      <Paragraph position="0"> type of error errors made by errors made  The most detailed evaluation of link tokens to date was performed by (Macklovitch &amp; Hannan, 1996), who trained Brown et al.'s Model 2 on 74 million words of the Canadian Hansards. These authors kindly provided us with the links generated by that model in 51 aligned sentences from a held-out test set. We generated links in the same 51 sentences using our two-class word-to-word model, and manually evaluated the content-word links from both models. The IBM models are directional; i.e. they posit the English words that gave rise to each French word, but ignore the distribution of the English words. Therefore, we ignored English words that were linked to nothing.</Paragraph>
      <Paragraph position="1"> The errors are classified in Table 1. The &amp;quot;wrong link&amp;quot; and &amp;quot;missing link&amp;quot; error categories should be self-explanatory. &amp;quot;Partial links&amp;quot; are those where one French word resulted from multiple English words, but the model only links the French word to one of its English sources. &amp;quot;Class conflict&amp;quot; errors resulted from our model's refusal to link content words with function words. Usually, this is the desired behavior, but words like English auxiliary verbs are sometimes used as content words, giving rise to content words in French. Such errors could be overcome by a model that classifies each word token, for example using a part-of-speech tagger, instead of assigning the same class to all tokens of a given type. The bitext pre-processor for our word-to-word model split hyphenated words, but Macklovitch &amp;Hannan's preprocessor did not. In some cases, hyphenated words were easier to link correctly; in other cases they were more difficult. Both models made some errors because of this tokenization problem, albeit in different places.</Paragraph>
      <Paragraph position="2"> The &amp;quot;paraphrase&amp;quot; category covers all link errors that resulted from paraphrases in the translation. Neither IBM's Model 2 nor our model is capable of linking multi-word sequences to multi-word sequences, and this was the biggest source of error for both models.</Paragraph>
      <Paragraph position="3"> The test sample contained only about 400 content words 5, and the links for both models were evaluated post-hoc by only one evaluator. Nevertheless, it appears that our word-to-word model with only two link classes does not perform any worse than IBM's Model 2, even though the word-to-word model was trained on less than one fifth the amount of data that was used to train the IBM model. Since it doesn't store indirect associations, our word-to-word model contained an average of 4.5 French words for every English word. Such a compact model requires relatively little computational effort to induce and to apply.</Paragraph>
      <Paragraph position="4">  rors made by the word-to-word model and the IBM Model 2. Solid lines are links made by both models; dashes lines are links made by the IBM model only. Only content-class links are shown. Neither model makes the correct links (ddchaPSnds,screaming) and (ddmontde, dangerous).</Paragraph>
      <Paragraph position="5">  In addition to the quantitative differences between the word-to-word model and the IBM model, there is an important qualitative difference, illustrated in Figure 6. As shown in Table 1, the most common kind of error for the word-to-word model was a missing link, whereas the most common error for IBM's Model 2 was a wrong link. Missing links are more informative: they indicate where the model has failed. The level at which the model trusts its own judgement can be varied directly by changing the likelihood cutoff in Step 1 of the competitive linking algorithm. Each application of the word-to-word model can choose its own balance between link token precision and recall. An application that calls on the word-to-word model to link words in a bitext could treat unlinked words differently from linked words, and avoid basing subsequent decisions on uncertain inputs. It is not clear how the precision/recall trade-off can be controlled in the IBM models.</Paragraph>
      <Paragraph position="6"> One advantage that Brown et al.'s Model i has over our word-to-word model is that their objective function has no local maxima. By using the EM algorithm (Dempster et al., 1977), they can guarantee convergence towards the globally optimum parameter set. In contrast, the dynamic nature of the competitive linking algorithm changes the Pr(datalmodel ) in a non-monotonic fashion. We have adopted the simple heuristic that the model &amp;quot;has converged&amp;quot; when this probability stops increasing. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML