File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-1019_evalu.xml

Size: 4,806 bytes

Last Modified: 2025-10-06 13:58:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1019">
  <Title>Unit Completion for a Computer-aided Translation System</Title>
  <Section position="5" start_page="138" end_page="139" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> We collected completion results on a test corpus of 747 sentences (13,386 english tokens and 14,506 french ones) taken from the Hansard corpus. These sentences have been selected randomly among sentences that have not been used for the training.</Paragraph>
    <Paragraph position="1"> Around 18% of the source and target words are not known by the translation model.</Paragraph>
    <Paragraph position="2"> The baseline models (line 1 and 2) are obtained without any unit model (i.e. /~ = 1 in equation 2).</Paragraph>
    <Paragraph position="3"> The first one is obtained with an IBM-like model 1 while the second is an IBM-like model 2. We observe that for the pair of languages we considered, model 2 improves the amount of saved keystrokes of almost 3% compared to model 1. Therefore we made use of alignment probabilities for the other models.</Paragraph>
    <Paragraph position="4"> The three next blocks in table 4 show how the parameter estimation method affects performance.</Paragraph>
    <Paragraph position="5"> Training models under the C1 method gives the worst results. This results from the fact that the word-to-word probabilities trained on the sequence based corpus (predicted by Mu in equation 2) are less accurate than the ones learned from the token based corpus. The reason is simply that there are less occurrences of each token, especially if many units are identified by the grouping operator.</Paragraph>
    <Paragraph position="6"> In methods C2 and C3, the unit model of equation 2 only makes predictions pu(tls ) when s is a source unit, thus lowering the noise compared to method PS1.</Paragraph>
    <Paragraph position="7"> We also observe in these three blocks the influence of sequence filtering: the more we filter, the better the results. This holds true for all estimation methods tried. In the fifth block of table 4 we observe the positive influence of the NP-filtering, especially when using the third estimation method.</Paragraph>
    <Paragraph position="8"> The best combination we found is reported in line 15. It outperforms the baseline by around 1.5%.</Paragraph>
    <Paragraph position="9"> This model has been obtained by retaining all sequences seen at least two times in the training corpus for which the likelihood test value was above 5 and the entropy score above 0.2 (5rl (2, 2, 5, 0.2)). In terms of the coverage of this unit model, it is interesting to note that among the 747 sentences of the test session, there were 228 for which the model did not propose any units at all. For 425 of the remaining sentences, the model proposed at least one helpful (good or partially good) unit. The active vocabulary for these sentences contained an average of around 2.5 good units per sentence, of which only half (495) were proposed during the session. The fact that this model outperforms others despite its relatively poor coverage (compared to the others) may be explained by the fact that it also removes part of the noise introduced by decoupling the identification of the salient units from the training procedure. Furthermore, as we mentionned earlier, the more we filter, the less the grouping scheeme presented in equation 4 remains necessary, thus reducing a possible source of noise.</Paragraph>
    <Paragraph position="10"> The fact that this model outperforms others, despite its relatively poor coverage, is due to the fact</Paragraph>
    <Section position="1" start_page="139" end_page="139" type="sub_section">
      <SectionTitle>
Eichler C)ptlons
</SectionTitle>
      <Paragraph position="0"> l am pleased to t~lce ]part in this debate tod W .</Paragraph>
      <Paragraph position="1"> Using rod W &amp;quot;s technologies, it is possible for all C~m~dia~s to register their votes on isstles of public spending and public I)orro~ving.</Paragraph>
      <Paragraph position="2"> II me falt plalelr de prendre la parole auJourd'hui dana le cadre de C/e d~bat.</Paragraph>
      <Paragraph position="3"> Gr~ice &amp; la technologle moderne, toue lea Canadlen= peuvent 6e prononcer sur le=; question= de d6pen=e== et d&amp;quot; ernprunta de I&amp;quot; I~tat .  target text is typed in the bottom half with suggestions given by the menu at the insertion point. that it also removes part of the noise that is introduced by dissociating the identification of the salient units from the training procedure. ~rthermore, as we mentioned earlier, the more we filter, the less the grouping scheme presented in equation 4 remains necessary, thus further reducing an other possible source of noise.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML