File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-2067_metho.xml

Size: 26,253 bytes

Last Modified: 2025-10-06 14:12:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-2067">
  <Title>A New Quantitative Quality Measure for Machine 'lYanslation Systems</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
:~Behavior Design Corporation
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Science-based Industrial Park
</SectionTitle>
    <Paragraph position="0"> llsinchu, &amp;quot;l~tiwan 3(10, R.O.C.</Paragraph>
    <Paragraph position="1"> if a quantified cost function is provided, it can be used directly for parameter tuning. Thus. a systematic and standardized approach for performance evaluation and the establishment of a common l~3Stillg I~lse are urgeudy required.</Paragraph>
    <Paragraph position="2"> Most conventional approaches evaluate system lx~rfonnance by human inspection and subjective measures. While post-editiug, the post-editors ean provide I~_xlback on the quality of machine translafioo, which then is used for dictionary update and linguistic analyses of errors \[Ross 82\]. Also, feedback can be obtained from professional translators who annotate carefully on the printouts of raw translation output \[Pigo 82\]. From human feedback, system designers tend to overtune the system.</Paragraph>
    <Paragraph position="3"> Another approach, plvposed in \[King 901, is Ix) collect the test suites and divide them into two sections: one to look at source language coverage, the other to extanlae translatioual problems. Such an approach can avuid the over-taning problem caused by hnmau feedback. The advantage of human inspection is tbat humans can tlinl)oint the real linguistic pp.thlenls and make corrections, tlowever, there are several disadvantages: (1). It is too costly for human inslw.ction of the translation output quality. To get significant statistics on the real system performance, a large volume of text must be provided. The cost for human inspection is thus extremely high. (2). It will take too Ioug for the results to come out. For this reason and the cost consideration, it can uot be repeated frequently.</Paragraph>
    <Paragraph position="4"> Therefore, it can not provide a quick suggestion to a system designer when the system is changed or when the domain is 'alerted. For a system that must handle a wide variety of types of text, it fails IX) provide immediate help to adapt to the particular domain or field. (3). It is not easy to achieve consistency and objectiveness. Eveu for the same person, it is very likely that he/she would judge a translation result differently at different time, especially when the evaluation criteria are loosely defined.</Paragraph>
    <Paragraph position="5"> Based on the above problems with human inspection, some automatic approaches were proposed to eval-AcrEs DE COLING-92, NAI~S. 23-28 AO~r 1992 4 3 3 PREC. OF COLING-92, NAICrUS, AUG. 23-2g, 1992 uate translation output quality. In \[Yu 91\], for example, a corpus of 3,200 sentences were collected. Then, some test points are selected by linguists based on the sentences in the corpus. The test points are what linguists think the most impo~mt features for the sentences in the corpus. Each test point is assigned a weight according to its importance in translation. The test points are coded in programs, therefore the testing can be done automatically. The advantage of this approach is that since their criteria are purely linguistic, they can do a very delicate evaluation and find the real linguistics problems involved.</Paragraph>
    <Paragraph position="6"> However, to acquire significant statistics on the performance, a large corpus is required. Corpus collecting and test points selecting are very time-consuming. Furthermore, to achieve high grade in quality with respect to these test points, the system might be over-tuned to the set of particular test points such that they fail to reveal their real performance on a broader domain. The system designer might thus be misled by such a close-test or training-set performance and have an over-optimistically evaluated figure of performance. (See \[Devi 82, Chapter 10\] for detailed comments on performance evaluation.) We propose a new quantitative quality measure to evaluate the performance of machine translation systems. The method is to compare the raw translation output of an MT system with the final revised version for the customers, and then the editing efforts required to convert the raw translation to the final version is computed. Compared with the above proposals, the evaluation process can be done quickly and automatically.</Paragraph>
    <Paragraph position="7"> Moreover, application of such a measure to improve the system performance on-line on a parumeterized and feedhack-controlled system is easy. Finally, since the revised version is used directly as a reference, the performance measure can reflect the real quality gap between the system performance and customer expectation.</Paragraph>
    <Paragraph position="8">  2. Performance Evaluation Using Bi-Text Corpus 2.1. Criteria for a Good Measure  From the above discussion, it is desirable to have a performance measure and a performance evaluation process with the following properties: \[1\] low cost: minimal human interference is involved and can be done automatically.</Paragraph>
    <Paragraph position="9"> \[2\] high speed: it can give system designers quick response and immediate help (even on-line, for a parameterized system); it can also provide positive stimulation to the system designer psychologically \[3\] exacmess: the difference between customer expectation and real system performance can be reflected.</Paragraph>
    <Paragraph position="10"> Because the design goal of a system is to optimize some gain or minimiz~e some cost, a good performance measure is definitely an important factor on the improvement of the system.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2. A Distance Measure Approach
</SectionTitle>
      <Paragraph position="0"> To achieve the goals outlined in the previous section, a quantitative measure is proposed. In our approach, we first establish a bi-text corpus composed of source language sentences and the corresponding target language sentences. The target sentences are the revised version of the raw translation which were post-edited to tailor to the customers' need (for publication). Therefore, the target sentences are what customers really want. Then, we employ a distance measure method to evaluate the minimum distance between the raw translation output and the target sentences in the bi-text corpus. By distance, we mean the editing efforts needed to edit the raw translation output into the revised version. In other words, we would like to know the number of key strokes required to be performed for such editing. The smaller the distance is, the higher the translation output quality is.</Paragraph>
      <Paragraph position="1"> The sentence pairs in the bi-text corpus is the source sentence and the target sentence post-edited for the customers. The reason for adopting the revised version text as the measure reference is that even the machine translated texts are error-free judged by system designers, it may not be the final version customers really wanL In general, the system designers, who are aware of the limitation and restriction of an MT system, tend to give loose quality criteria, and thus an overoptimistic evaluation. Human inspection can only achieve correctness and readability, but the acceptability to customers is usually low. We try to offer customers the solution they really need. Thus, every trial to fine-tune the output quality should be directed to fit customers' needs \[Chert 91, Wu 91\].</Paragraph>
      <Paragraph position="2"> This approach has several advantages over other methods: (1) Since the final revised version is used for comparison, it will reflect the real quality gap between the capability of the system and the expectation of the customers. According to oar experience on providing translation services with the ArehTran MTS, for most translation materials, even for manuals or announcement, the final versions are intended for publication, not just for information retrieval. Therefore, traditional quality measures which are graded loosely like 'correct', 'understandable', ... and so on, provides little information on how the system should be tuned. Thus, it's reasonable to adopt the final revised version as the measure reference. (2) Human power is more expansive than computer power. Since this approach involves no human interference, the evaluation cost is fairly low. (3) The current system performance can be reported very soon because of high computer speed. With the quick feedback, more performance improving strategies can be tried out, and thus research efficiency is improved. (4) We can show improvement to raise research morale and excite enthusiasm, for a clear indicator of performance improvement AcrEs DE COLING-92, NA)cr~, 23-28 AOt3&amp;quot;r 1992 4 3 4 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 is the strongest incentive R)r R&amp;D engineers. System problems can be located and solved quickly, thus a lot of work is saved. (5) Because the final revised version is used as the measure reference, the text can' be classifted into different domains and styles. With the quick feedback, it can help adapt system to different donlains and styles.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3. Distance Measure and Weight
Assignment
</SectionTitle>
      <Paragraph position="0"> Four primitive editing operations are defined, namely insertion, deletion, replacement and swap \[Wagn 74, Lowr 75\]. Since each operation requires different editing el-forts, different weights must be assigned. We assigql the weight according to tile number of key strokes needed for each operation under the most popular editor.</Paragraph>
      <Paragraph position="1"> For Chinese editing, tile weights we ttssign tot insertion is 5, deletion 1, replacement 5 and swap 6. The deletion operation is the least costly operation for its simplicity.</Paragraph>
      <Paragraph position="2"> The insertion aurl replacement operations take more efforts including cursor addressing, entering and leaving editing mode. The swap operation needs a little more eftort than insertion and deletion. (The swap cost is defined here to be the cost of one insertion plus one deletion. For a post-editing facility with a special swap editing functkm, the swap cost should be a function of the distance between the characters to be swapped. This cost might be less tiran the cost of one insertion plus one deletion for adjacent words. For tile present, the cost is used for  The evaluation of the distance between the raw output sentence and the final revised sentence can be formulated as a &amp;quot;shortest path&amp;quot; searching problem. The problem can be solved with the well-known dynamic programming technique \[Bell 57\]. Figure I shows a diagram tt)r the dynmnic programming problem. Assanre that R : {ell, c12, ..., el,, } is the raw output sentence composed of m characters ell through c1,,,, and Q = { c21, c~2, ..., c2,, } is the final version sentence composed of n characters c~1 through c2,,.</Paragraph>
      <Paragraph position="3"> Ill Figure 1, the big square Ires m x n grids with weight (cost on&amp;quot; distmlce) associated with eacln edge and diagonal. Many l~tths call be picked from the Start to the End. Any path along tile edge or diagonal from the Start to tim End represents a sequence of editing operations that changes the raw output sentence Io the final revised sentence, qhe cost incurred for each path is the accumulative weights along the path travelled. The cost/distance of a path stands for the editing eflbrts to make tbe clmnges. Therefore, the minimmn distance path stands ibr the least cost. &amp;quot;lhe goal is to pick the path with the minimum cost, or shortest distance.</Paragraph>
      <Paragraph position="4"> There are three directions to go at each position: right, up or up-right. We can make an analogy between \[inding the shortest path and lrerlbrnling the fewest editing operations to convert the raw output sentence into the final version sentence. When we are at the Start i)omt, we have the raw output sentence. If we go right, a deletion operation is performed. If we go upward, an inseltion iS performed. If we go along the diagonal, citlter one of two cases will happen. When ~:rli and c:~j on the two edges of file diagonal arc the same, no operation is per* formed, and no cost is izlctlncd. If, however, they are different, a replacement is performed. When we evanm..</Paragraph>
      <Paragraph position="5"> ally reach the End point, we have edited the raw output sentence into the final versiou sentence. The second |lath traversal is required to compute the nmnber of swap operation. If deletion of one character is to be performed, and that character will have to Ire inserted in the following operations, then one deletion and one insertion are rephtced by onc swap operation. By the same token, If tile iuserted character will Ire deleted later, the insertion and deletion are saved by performing one swap. If the shortest path is picked, then we have exfited the sentence with least effort.</Paragraph>
      <Paragraph position="6"> Tile dislar|cc betwcell the raw output seul~uce aud the revised seatcoce can be formulated as follows: \]) :~ lt) i x ~Ii .4, lt; d x 914 ~ l()r x ~tr ~ ItY~ X rt s where hi, ha, )tr and ns are tile numbers of operation for insertion, deletion, replacement and swap performed respectively; wl, wd, w~ and w, are the weights for these operations. D is the total distance for one specific editing sequence; that is to say, D is the number of key strokes required to lXlSt-edit the raw translation sentence into file final version sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5. An Example
</SectionTitle>
      <Paragraph position="0"> The solid path in figure 2 gives an example to show the steps performed using dynamic programming to find the least cost for editing one sentence. The raw output sentence is &amp;quot;This is my two computer&amp;quot; ill file X axis, and the revised version sentence is '&amp;quot;ll;is  computer is mine&amp;quot; in the Y axis. One insertion and deletion along the path are marked an &amp;quot;X&amp;quot; because the word &amp;quot;computer&amp;quot; appears in different locations in two sentences. Therefore, a swap operation is performed to save one insertion and one deletion. Totally, there are one replacement (&amp;quot;my&amp;quot; to &amp;quot;mine&amp;quot;), one deletion (&amp;quot;own&amp;quot;) and one swap (&amp;quot;computer&amp;quot;). Table 1 shows the path with the least cost. The first row in Table 1 is the raw output sentence, the second row the revised sentence, and die third the editing operations performed. The least cost is 12 (i.e., 1 x 5 + 1 x 1 + 1 x 6 : 12). And, the average cost is 2.4 per word. Note the difference between such a measure with a conventional subjective approach. The two sentences might be judged as equally readable and understandable by human inspection. However, to tailor to the final output for publication, we still need 2.4 units of cost per word. If we follow the dotted path in Figure 2, we will get the path not of the least cost. Table 2 shows this path. In this case, there are onc deletion and three replacements. The cost incurred is 16 (i.e.,  the steps in dynamic programming Raw This is my own comp Rev This comp is mine X  3. Application to Performance Evaluation and Improvement  As discussed above, the most direct application of the preference measure is, of course, to show the current status of the system peffomaance. This function directly serves several purposes.</Paragraph>
      <Paragraph position="1"> \[1\] With this performance measure applied to a large bi-text corpus, one can show to the potential customers the current system performance in terms of the editing efforts required to get high quality translation. Furthermore, because the performance measure is an objective measure, it can be used to compare the system performance with other systems bused on the same testing base.</Paragraph>
      <Paragraph position="2"> \[2\] The quick response makes it possible for the system designers and the linguists to get a clear idea about the advantages or faults of a particular strategy or formalism. From the quick feedback of the measarement, one can try different approaches in rather short time. Hence, the research pace will be accelerated rapidly. And the system designers can make sure the system is on the right track.</Paragraph>
      <Paragraph position="3"> \[3\] Psychologically, a clear indicator of performance improvement is the strongest incentive for R&amp;D teams. According to our working experience, the research team members tend to become upset when their ideas can not be fully implemented and justiffed in a reasonable time. With a clear performance indicator and quick response, the team members usually get excited and their morale is raised substantially.</Paragraph>
      <Paragraph position="4"> The following case study shows how the quick and automatic performance evaluation method help make decision on some designing issues and highlight research directions. In a recent evaluation run, a bi-text corpus, containing 6,110 English-Chinese sentence pairs are used to evaluate a particular version of the ArehTran English-Chinese MT system. The Chinese sentences are the revised version of the corresponding English sentences, which are to be published us a Chinese technical manual. The revised Chinese sentences are used as the reference for comparison with the unrevised version. The editing effort required to post-edit the unrevised version is then evaluated using the proposed distance measure. It takes only about 30 seconds to get the required measure. The experimental results are shown in Table 3.</Paragraph>
      <Paragraph position="5"> At first, we think the editing cost might be too high to get the required high quality, and we suspect that the probabilistic disambiguation mechanism for the analysis module \[Chan 92a, Liu 90, Su 88, 89, 91b\] might not be properly tuned. So we use an adaptive learning algorithm \[Amar 67, Kata 90, Su 91a\] to adjust the probabilistic disambiguation modules of the system. Table 4 shows the comparison of the original status of the MTS and its best-tuned case. In the best case, the translation with the least cost is selected.</Paragraph>
      <Paragraph position="6">  number of insertions per sentence 3.85 number of deletions per sentence 3.48 number of replacements per sentence 3.72 number of swaps per sentence 1.63  original status and of the best case Table 4 does show some improvement after ttming the disambiguation module. However, the improvement is not apparent. This implies that the disambiguation part is not the major bottleneck for the quality gap. In fact, most translations are readable and understandable under human's judgement. So we examined the other parts of the system. We found that the biggest problem is that the translation style does not lit customers' need. We thus conclude that more efforts should be concentrated on the transfer and generation phases, and a transfer and generation model that is capable of adaptiug the system to different domains and styles {Chan 92bl is required. This case study shows that a quick performance evaluation does play an important role in directing the research direction.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Parameterized Feedback Control System
</SectionTitle>
    <Paragraph position="0"> Based on the Performance Measure Through the quick and automatic quantitative dislar~ce ntcasure, the system performance can be on-line reported in terms of an objective cost function. Therefore, it can be applied in the guided searching in a Ira rameteriTgd, feedback controlled system. &amp;quot;lhe following sections show how the quick performance measure helps to construct such a feedback system. Without a quick performance evaluator, these models will not be made possible.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1. Ambiguity Resolution and Lexicon
</SectionTitle>
      <Paragraph position="0"> Selection in a Feedback System aeferen~e ..... i l~exical 1_ ,.~Syntacfic \[ d Semantic \]~Trrmlsfer&amp;i~  A parameterized feexlback-cont~olled MT system can be modeled as in Figure 3. The control of the system is governed by its static knowledge and a dynanfically adjustable parameter set which are used to select the best interpretation among the various ambiguities, or to select the most preferred style in the transfer and generation phases. The probabilistic translation model proposed in \[Chert 91\] is one such example. In this model, the best analysis is to be selected ~ maximize an analysis score or score function \[Chart 92a, Liu 90, Su 88, 89, 91b\] of the 1011owing form: Score ~ F' (Se*,t,, Sy,tj, Lexk \[ Words) whel~ ,1/2'etlti, Sylty, Le~:k is a particu'lar set of semantic annotation, syntactic structure and lexical category cosresponding to some ambiguous construct of the sentence Words. Furthermore, the best transfer and generation is to be selected to maximize file following transfer score Stx/ and generation score Sa~,,: s,~/= P(T~ 17;) ~ I' (7 ~, 17/';) where 7',, 7; are the target and source of intemtediate representations in the form of an annotated syntax tree (AST); qL &amp;quot;/; are the normalized version of the AST's, called the normal forms (NF) of the AST's, which are Aca~s DE COLING-92, Nnl, rrl's. 23-28 noel' 1992 4 3 7 PROC. OF COLING-92, NANTI'.S, AU6. 23-28, 1992 used particularly for transfer and generation, and t is the generated target sentence \[Chan 92b\].</Paragraph>
      <Paragraph position="1"> In such a system, we are to formulate a model as closely to the unknown real language model as possible. Thus, the main task is to estimate the parameters which characterize the model. Because it is not always possible to acquire sufficient data, particularly for complex language models, the estimated parameters might not be able to characterize the real model Under such circumstances, we can adaptively adjust the estimated parameters according to the error feedback.</Paragraph>
      <Paragraph position="2"> In Figure 3, the analysis phase (lexical, syntactic and semantic analyses) of the system is characterized by a set of selection parameters. The input is fed into the lexical analysis phase. The output is generated and acts as the input of transfer phase. In the feedback controlled scheme, a set of revised text, such as the 6,110 Chinese sentences in the previous section, can be used as the reference, and be compared with the translation output of 6,110 sentences. Under the scoring mechanism, the preferred analysis selected may not correspond to the translation with the shortest distance from the reference. Under this circumstance, we can adjust the selection parameters according to the error ~ (the difference between the reference and the system outpu0 iteratively. Through this adaptive learning procedure \[Amar 67, Chia 92, Kata 90\], the estimated parameters will approach the real parameters very closely. In this way, it will help in ambignity resolution and lexicon selection. Such a system is made possible to automatically fine-tune the system because the performance measure proposed in this paper provides an on-line response to the itemtively changed parameters.</Paragraph>
      <Paragraph position="3">  model for the transfer and generation phases As another application of the quick performance measure, we can construct a feedback controlled transfer and generation model. Figure 4 shows such a conceptual model for the parameterized transfer and generation phases \[Chan 92b\], where AST \[Chan 92a\] is a syntax tree whose nodes are annotated with syntactic and semantic features and NF is a normalized version of AST, which consists of only atomic transfer units. (See also the previous section for the transfer score and generation score). The Source NF and Target NF are characterized by a set of selection parameters. By jointly considering the parameters characterizing the Source and Target NF, we can adaptively adjust the parameters from the feed-back of both directions just like in the previous section. Also the feedback control will make the parameters be tuned to fit the stylistic characteristics of the revised target sentences. Hence, more natural sentences could be generated and less editing effort could be expected in order to get high quality translation. Again, only a quick performance evaluator can make such feedback system practical.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Conclusion
</SectionTitle>
    <Paragraph position="0"> The need for performance evaluation is rising, for both customers and system designers. We proposed a performance evaluation method with which system performance can be evaluated automatically and quickly.</Paragraph>
    <Paragraph position="1"> The approach to improve system performance and feed-back controlled MT system is proposed based on such measure. Because the revised text is used directly as reference, the performance measure can indicate real quality gap between users' expectation and system capability.</Paragraph>
    <Paragraph position="2"> Though we can not measure the very fine detailed features because there is not very much linguistic knowledge incorporated, our approach has many advantages over conventional approaches. There is no need for human interference. The criteria are consistent and objective. And, we are trying to offer the solutions what users really need. Most important of all, from the feedback of measurement, it is fairly easy for system fine-tuning.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML