File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-2007_intro.xml
Size: 2,802 bytes
Last Modified: 2025-10-06 14:02:57
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-2007"> <Title>Pattern Visualization for Machine Translation Output</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Over the last few years, several automatic metrics for machine translation (MT) evaluation have been introduced, largely to reduce the human cost of iterative system evaluation during the development cycle (Papineni et al., 2002; Melamed et al., 2003). All are predicated on the concept of n-gram matching between the sentence hypothesized by the translation system and one or more reference translations--that is, human translations for the test sentence. Although the formulae underlying these metrics vary, each produces a single number representing the &quot;goodness&quot; of the MT system output over a set of reference documents. We can compare the numbers of competing systems to get a coarse estimate of their relative performance. However, this comparison is holistic.</Paragraph> <Paragraph position="1"> It provides no insight into the specific competencies or weaknesses of either system.</Paragraph> <Paragraph position="2"> Ideally, we would like to use automatic methods to provide immediate diagnostic information about the translation output--what the system does well, and what it does poorly. At the most general level, we want to know how our system performs on the two most basic problems in translation - word translation and reordering. Holistic metrics are at odds with day-to-day hypothesis testing on these two problems. For instance, during the development of a new MT system we may may wish to compare competing reordering models. We can incorporate each model into the system in turn, and rank the results on a test corpus using BLEU (Papineni et al., 2002). We might then conclude that the model used in the highest-scoring system is best. However, this is merely an implicit test of the hypothesis; it does not tell us anything about the specific strengths and weaknesses of each method, which may be different from our expectations. Furthermore, if we understand the relative strengths of each method, we may be able to devise good ways to combine them, rather than simply using the best one, or combining strictly by trial and error. In order to fine-tune MT systems, we need fine-grained error analysis.</Paragraph> <Paragraph position="3"> What we would really like to know is how well the system is able to capture systematic reordering patterns in the input, which ones it is successful with, and which ones it has difficulty with. Word n-grams are little help here: they are too many, too sparse, and it is difficult to discern general patterns from them.</Paragraph> </Section> class="xml-element"></Paper>