File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1039_intro.xml

Size: 3,224 bytes

Last Modified: 2025-10-06 14:01:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1039">
  <Title>Phrasal Cohesion and Statistical Machine Translation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Statistical machine translation (SMT) seeks to develop mathematical models of the translation process whose parameters can be automatically estimated from a parallel corpus. The first work in SMT, done at IBM (Brown et al., 1993), developed a noisy-channel model, factoring the translation process into two portions: the translation model and the language model. The translation model captures the translation of source language words into the target language and the reordering of those words. The language model ranks the outputs of the translation model by how well they adhere to the syntactic constraints of the target language.1 The prime deficiency of the IBM model is the re-ordering component. Even in the most complex of 1Though usually a simple word n-gram model is used for the language model.</Paragraph>
    <Paragraph position="1"> the five IBM models, the reordering operation pays little attention to context and none at all to higher-level syntactic structures. Many attempts have been made to remedy this by incorporating syntactic information into translation models. These have taken several different forms, but all share the basic assumption that phrases in one language tend to stay together (i.e. cohere) during translation and thus the word-reordering operation can move entire phrases, rather than moving each word independently.</Paragraph>
    <Paragraph position="2"> (Yarowsky et al., 2001) states that during their work on noun phrase bracketing they found a strong cohesion among noun phrases, even when comparing English to Czech, a relatively free word order language. Other than this, there is little in the SMT literature to validate the coherence assumption. Several studies have reported alignment or translation performance for syntactically augmented translation models (Wu, 1997; Wang, 1998; Alshawi et al., 2000; Yamada and Knight, 2001; Jones and Havrilla, 1998) and these results have been promising. However, without a focused study of the behavior of phrases across languages, we cannot know how far these models can take us and what specific pitfalls they face.</Paragraph>
    <Paragraph position="3"> The particulars of cohesion will clearly depend upon the pair of languages being compared. Intuitively, we expect that while French and Spanish will have a high degree of cohesion, French and Japanese may not. It is also clear that if the cohesion between two closely related languages is not high enough to be useful, then there is no hope for these methods when applied to distantly related languages. For this reason, we have examined phrasal cohesion for French and English, two languages which are fairly close syntactically but have enough differences to be Association for Computational Linguistics.</Paragraph>
    <Paragraph position="4"> Language Processing (EMNLP), Philadelphia, July 2002, pp. 304-311. Proceedings of the Conference on Empirical Methods in Natural interesting.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML