File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/w06-3114_abstr.xml
Size: 2,587 bytes
Last Modified: 2025-10-06 13:45:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3114"> <Title>Out-of-domain test set</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We evaluated machine translation performance for six European language pairs that participated in a shared task: translating French, German, Spanish texts to English and back. Evaluation was done automatically using the BLEU score and manually on fluency and adequacy.</Paragraph> <Paragraph position="1"> For the 2006 NAACL/HLT Workshop on Machine Translation, we organized a shared task to evaluate machine translation performance. 14 teams from 11 institutions participated, ranging from commercial companies, industrial research labs to individual graduate students.</Paragraph> <Paragraph position="2"> The motivation for such a competition is to establish baseline performance numbers for defined training scenarios and test sets. We assembled various forms of data and resources: a baseline MT system, language models, prepared training and test sets, resulting in actual machine translation output from several state-of-the-art systems and manual evaluations. All this is available at the workshop website1. The shared task is a follow-up to the one we organized in the previous year, at a similar venue (Koehn and Monz, 2005). As then, we concentrated on the translation of European languages and the use of the Europarl corpus for training. Again, most systems that participated could be categorized as statistical phrase-based systems. While there is now a number of competitions -- DARPA/NIST (Li, 2005), IWSLT (Eck and Hori, 2005), TC-Star -- this one focuses on text translation between various European languages.</Paragraph> <Paragraph position="3"> This year's shared task changed in some aspects from last year's: was done by the participants. This revealed interesting clues about the properties of automatic and manual scoring.</Paragraph> <Paragraph position="4"> * We evaluated translation from English, in addition to into English. English was again paired with German, French, and Spanish.</Paragraph> <Paragraph position="5"> We dropped, however, one of the languages, Finnish, partly to keep the number of tracks manageable, partly because we assumed that it would be hard to find enough Finnish speakers for the manual evaluation.</Paragraph> <Paragraph position="6"> * We included an out-of-domain test set. This allows us to compare machine translation performance in-domain and out-of-domain.</Paragraph> </Section> class="xml-element"></Paper>