File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1024_metho.xml

Size: 13,687 bytes

Last Modified: 2025-10-06 14:13:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1024">
  <Title>Evaluation in the ARPA Machine Translation Program:</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. MT EVALUATION IN THE ARPA
MT INITIATIVE
</SectionTitle>
    <Paragraph position="0"> The mission of the ARPA MT initiative is &amp;quot;to make revolutionary advances in machine translation technology&amp;quot; (Doddington, personal communication). The focus of the investigation is the &amp;quot;core MT technology.&amp;quot; This focus tends, ultimately, away from the tools of MT and toward the (fully automatic) central engines. It is well understood that practical MT will always use tools by which humans interact with the algorithms in the translation process. However, the ARPA aim is to concentrate on fully-automatic (FA) output in order to assess the viability of radical new approaches.</Paragraph>
    <Paragraph position="1"> The May-August 1993 evaluation was the second in the continuing series, along with dry runs and pre-tests of particular evaluation methods. In 1992, evaluation methods were built on human testing models. One method employed the same criteria used in the U.S.</Paragraph>
    <Paragraph position="2"> government to determine the competence of human translators. The other method was an &amp;quot;SAT&amp;quot;-type evaluation for determining the comprehensibility of English texts translated manually into the test source languages and then back into English. The methods have been replaced by methods which maintain familiarity in terms of human testing, but which are both more sensitive and more portable to other settings and systems. The Fluency, Adequacy, and Comprehension evaluations developed for the 1993 evaluation are described below; system outputs from 1992 were subjected to 1993 methods, which determined their enhanced sensitivity (White et al., op. cit.).</Paragraph>
    <Paragraph position="3"> The 1993 evaluation included output from the three research systems, five production systems, and translations from novice translators. Professional translators produced reference translations, by which outputs were compared in tile Adequacy evaluation, and which were used as controls in the Comprehension evaluation.</Paragraph>
    <Paragraph position="4"> The research systems were:</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="135" type="metho">
    <SectionTitle>
* CANDIDE (IBM Research: French - English(FE)),
</SectionTitle>
    <Paragraph position="0"> produced both FA and human-assisted (HA) outputs.</Paragraph>
    <Paragraph position="1">  Candide uses a statistics-based, language modeling MT technique.</Paragraph>
    <Paragraph position="2"> * PANGLOSS (Carnegie Mellon, New Mexico State, University of Southern California: Spanish - English (SE)), produced three output types: fully automatic pre-processing, interactive pre-processing, and post-edited (PE). Both pre-processing operations are mapped into one version (XP) for evaluation purposes, though the difference in performance between the operational types was measured. The Pangloss system uses both knowledge-based and linguistic techniques.</Paragraph>
    <Paragraph position="3"> * LINGSTAT (Dragon Systems: Japanese - English (JE)), performed in human-assisted mode. Lingstat is a hybrid MT system, combining statistical and linguistic techniques.</Paragraph>
    <Paragraph position="4"> To provide comparison against state of the art FAMT, production systems ran in fully automatic mode. These systems are in current commercial use and developed over a wide range of subject areas. SPANAM, from the PAN</Paragraph>
  </Section>
  <Section position="5" start_page="135" end_page="135" type="metho">
    <SectionTitle>
AMERICAN HEALTH ORGANIZATION (PAHO)
</SectionTitle>
    <Paragraph position="0"> produced SE. SYSTRAN, a commercial system, produced FE. 'naree unidentified systems based in Japan each contributed JE. Their outputs were made available to the test and evaluation by Professor Makoto Nagao at Kyoto University.</Paragraph>
    <Paragraph position="1"> Manual translations (MA) were provided by novice, usually student, translators at each of the research sites. These persons also developed the haman-assisted outputs, controlled for pre-/post-test bias. Finally, expert manual translation of the same material into English was performed as a reference set as noted above.</Paragraph>
  </Section>
  <Section position="6" start_page="135" end_page="135" type="metho">
    <SectionTitle>
SYSTEM TESTS
</SectionTitle>
    <Paragraph position="0"> The first phase of the ARPA MT Evaluation was tile System Test. The research and production sites each received a set of 22 French, Japanese or Spanish source texts for translation into English. Each set comprised eight general news stories and 14 articles on financial mergers and acquisitions, retrieved from commercial databases. The lexical domain was extended in 1993 to include general news texts to determine whether the training and development of the systems was generalizable to other subject domains. French and Spanish texts ranged between 300 and 500 words; Japanese articles between 600 and 1,000 characters.</Paragraph>
  </Section>
  <Section position="7" start_page="135" end_page="136" type="metho">
    <SectionTitle>
EVALUATION COMPONENTS
</SectionTitle>
    <Paragraph position="0"> The evaluators were eleven highly verbal native speakers of American English. Evaluation books were assembled according to a matrix based on a Latin square, designed to guarantee that each passage was evaluated once and that no evaluator saw more than one translation version of a passage. Because of technical problems, two of the Kyoto system outputs were evaluated in a subsequent evaluation that reproduced as closely as possible the construct of the preceding evaluation. The 1993 series tested the systems with source-only text, measuring the results with a suite of three different evaluations.</Paragraph>
    <Paragraph position="1"> All participants evaluated first for fluency, then adequacy and finally for comprehensibility. Fluency and an adequacy components contained the same 22 texts. The comprehension component included a subset of nine to twelve of these texts. The Comprehension Evaluation was presented to evaluators last, in order to avoid biasing the performance of the fluency and adequacy over the passages that appeared in the comprehension set.</Paragraph>
    <Section position="1" start_page="135" end_page="135" type="sub_section">
      <SectionTitle>
Fluency Evaluation
</SectionTitle>
      <Paragraph position="0"> The Fluency Evaluation assessed intuitive native speaker senses about the well-formedness of the English output on a sentence by sentence basis. Evaluators assigned a score from one to five with five denoting a perfectly formed English sentence.</Paragraph>
      <Paragraph position="1"> Adequacy Evaluation The Adequacy Evaluation measured the extent to which meaning present in expert translations is present in the FAMT, HAMT, PE and MA versions. In order to avoid bias toward any natural language processing approach, passages were broken down into linguistic components corresponding to grammatical units of varying depths, generally confined to clause level constituents between 5 and 20 words in length. Average word count within a unit was 11 for SE and FE, 12 for JE. The average number of fragments for a passage varied: 33 for FE, 41 for JE, 31 for SE. The evaluators viewed parallel texts, an expert translation broken into brackets on the left and the version to be evaluated presented in paragraph form on the right. They were instructed to ascertain the meaning present in each bracketed fragment and rate the degree to which it was present in the right column on a scale of one to five. IF tile meaning was absent or almost incomprehensible, the score was one; if it was completely represented the score was five.</Paragraph>
    </Section>
    <Section position="2" start_page="135" end_page="136" type="sub_section">
      <SectionTitle>
Comprehension Evaluation
</SectionTitle>
      <Paragraph position="0"> The Comprehension Evaluation measured the amount of information that is correctly conveyed, i.e. the degree to which a reader can find integral information in the passage version. This evaluation was in the format of a standardized comprehension test. Questions were developed based on tile expert versions and then applied to all translation versions. Evaluators were instructed to base their answers only on information present in the  translation. The Comprehension Evaluation is probably the most portable evaluation, as it is a common test format for literate English speakers.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="136" end_page="136" type="metho">
    <SectionTitle>
RESULTS OF THE 1993 EVALUATION
</SectionTitle>
    <Paragraph position="0"> The evaluations resulted in a total of over 12,500 decision points. These are in turn represented on two axes: the time ratio (x-axis) and normalized quality (yaxis). Both axes represent results as scores on a 0-1 scale. The time ratio is the ratio of the time taken to produce a system translation compared to the time taken for the novice MA translation. Thus, the novice MA translations all appear at time value 1. Since time taken to translate is not recorded for the FAMT systems, all of these are set at time 0. The quality (that is, fluency, adequacy, or comprehension) axis is the raw score, divided by the scoring scale (5 for fluency/adequacy, 6 for comprehension), in turn divided by the number of decision points (sentences for fluency, fragments for adequacy, or questions for comprehension) in the total passage set for that language pair.</Paragraph>
    <Paragraph position="1"> Common characteristics can be observed in all of the evaluation measurements taken in 1993. First, it is evident that all of the HAMT systems performed better in time than the corresponding MA systems. This is a change from 1992, where one system took more time to operate than it took the same persons to translate manually. Each PE system also performed better in adequacy, and very slightly better in fluency, than the MA translations. While a reasonable and desirable result, this outcome was not necessarily expected at a relatively early stage in the development of the research systems.</Paragraph>
    <Paragraph position="2"> Another general observation is that PE versions scored better in quality than non-post-edited (i.e., raw FAMT or interactively pre-processed) versions. This too is an expected and desirable result. The benchmark FAMT for French and Spanish (SPANAM and SYSTRAN, respectively) scored better in quality than the non-post-edited research systems, except in fluency, where CANDIDE scored .040 higher than SYSTRAN's .540.</Paragraph>
    <Paragraph position="3"> It was expected that comprehension scores would rise with the amount of human intervention. This proved true for FE. At .896, CANDIDE HAMT scored highest for FE comprehension; SYSTRAN (.813) scored above CANDIDE FAMT (.729). PANGLOSS SE scores also demonstrated this trend: FA at .583, HA at .750 and PE at .833, however, the HA and PE are unexpectedly below SPANAM (.854). LINGSTAT HA .771 also scored higher than the JE FAMT: KYOTO A (.479) KYOTO B (.5625) and KYOTO C (.563).</Paragraph>
  </Section>
  <Section position="9" start_page="136" end_page="136" type="metho">
    <SectionTitle>
COMPARISON BETWEEN 1992 AND
1993 SYSTEM PERFORMANCE
</SectionTitle>
    <Paragraph position="0"> Figures 1 and 2 show comparisons and trends between 1992 and 1993 for the elements of data and evaluation that are comparable. These include the fluency and adequacy measures for all of the 1993 test output and that portion of the 1992 data that was based on source-only text. The Comprehension Evaluation was not compared, since the 1992 data involved back-translations, and the numbers of questions per passage was different, thus creating the potential for uncontrolled bias in the comparison.</Paragraph>
    <Paragraph position="1"> In 1993 all systems improved both in time and in fluency / adequacy over 1992. The PANGLOSS system shows the most apparent improvement in time, from 1.403 in 1992 to. 691 in 1993. LINGSTAT also shows a considerable improvement from .721 to .395. All ARPA research systems showed improvement in fluency and adequacy over 1992 scores. CANDIDE FAMT scores increase from .511 to .580 in fluency and .575 to .670 in adequacy. PANGLOSS PE improved from .679 to .712 for fluency and rose from .748 to .801 in adequacy.</Paragraph>
    <Paragraph position="2"> LINGSTAT improved from .790 in 1992 to .859 in fluency and went from .671 to .707 in adequacy.</Paragraph>
    <Paragraph position="3"> It should also be noted that the benchmark systems used in both 1992 and 1993 (SYSTRAN French and SPANAM) showed improved fluency/adequacy scores as well. For fluency, SYSTRAN improved from .466 to .540; for adequacy, SYSTRAN went from .686 to .743.</Paragraph>
    <Paragraph position="4"> SPANAM went from .557 to .634 for fluency and from .674 to .790 for adequacy. It was verified that these are reflections of system improvements.</Paragraph>
    <Paragraph position="5"> 1993 demonsUated a significant increase in sensitivity of the evaluation methodology. Sensitivity is gauged by computing an F ratio, i.e., the correlation between independent values. A high F ratio indicates that the range of values is wide; the wider the range of values the more sensitive the method is. For the Fluency Evaluation, the F ratio rose from 3.158 in 1992 to 12.084 in 1993. In the Adequacy Evaluation, the F ratio rose from 2.753 to 6.696.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML