XML Viewer - w06-2402

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2402_intro.xml
Size: 4,409 bytes
Last Modified: 2025-10-06 14:04:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2402">
  <Title>Grouping Multi-word Expressions According to Part-Of-Speech in Statistical Machine Translation</Title>
  <Section position="2" start_page="0" end_page="9" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Statistical machine translation (SMT) was originally focused on word to word translation and was based on the noisy channel approach (Brown et al., 1993). Present SMT systems have evolved from the original ones in such a way that mainly differ from them in two issues: first, word-based translation models have been replaced by phrase-based translation models (Zens et al., 2002) and (Koehn et al., 2003); and second, the noisy channel approach has been expanded to a more general maximum entropy approach in which a log-linear combination of multiple feature functions is implemented (Och and Ney, 2002).</Paragraph>
    <Paragraph position="1"> Nevertheless, it is interesting to call the attention about one important fact. Despite the change from a word-based to a phrase-based translation approach, word to word approaches for inferring alignment models from bilingual data (Vogel et al., 1996; Och and Ney, 2003) continue to be widely used.</Paragraph>
    <Paragraph position="2"> On the other hand, from observing bilingual data sets, it becomes evident that in some cases it is just impossible to perform a word to word alignment between two phrases that are translations of each other. For example, certain combination of words might convey a meaning which is somehow independent from the words it contains. This is the case of bilingual pairs such as &amp;quot;fire engine&amp;quot; and &amp;quot;cami'on de bomberos&amp;quot;.</Paragraph>
    <Paragraph position="3"> Notice that a word-to-word alignment strategy would most probably1 provide the following Viterbi alignments for words contained in the previous example: &amp;quot;cami'on:truck&amp;quot;, &amp;quot;bomberos:firefighters&amp;quot;, &amp;quot;fuego:fire&amp;quot;, and &amp;quot;m'aquina:engine&amp;quot;.</Paragraph>
    <Paragraph position="4"> Of course, it cannot be concluded from these examples that a SMT system which uses a word to word alignment strategy will not be able to handle properly the kind of word expression described above. This is because there are other models and feature functions involved which can actually help the SMT system to get the right translation.</Paragraph>
    <Paragraph position="5"> However these ideas motivate for exploring alternatives for using multi-word expression information in order to improve alignment quality and consequently translation accuracy. In this sense, our idea of a multi-word expression (hereafter MWE) refers in principle to word sequences which cannot be translated literally word-to-word.</Paragraph>
    <Paragraph position="6"> However, the automatic technique studied in this work for extracting and identifying MWEs does not necessarily follow this definition rigorously.</Paragraph>
    <Paragraph position="7"> In a preliminary study (Lambert and Banchs, 2005), we presented a technique for extracting bilingual multi-word expressions (BMWE) from parallel corpora. In that study, BMWEs identified in a small corpus2 were grouped as a unique to- null ken before training alignment models. As a result, both alignment quality and translation accuracy were slightly improved.</Paragraph>
    <Paragraph position="8"> In this paper we applied the same BMWE extraction technique, with various improvements, to a large corpus (EPPS, described in section 4.1).</Paragraph>
    <Paragraph position="9"> Since this is a statistical technique, and frequencies of multi-word expressions are low (Baldwin and Villavicencio, 2002), the size of the corpus is an important factor. A few very basic rules based on part-of-speech have also been added to filter out noisy entries in the dictionary. Finally, BMWEs have been classified into three categories (nouns, verbs and others). In addition to the impact of the whole set, the impact of each category has been evaluated separately.</Paragraph>
    <Paragraph position="10"> The technique will be explained in section 3, after presenting the baseline translation system used (section 2). Experimental results are presented in section 4. Finally some conclusions are presented and further work in this area is depicted.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML