File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-2051_intro.xml
Size: 2,479 bytes
Last Modified: 2025-10-06 14:03:31
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2051"> <Title>Bridging the Inflection Morphology Gap for Arabic Statistical Machine Translation</Title> <Section position="4" start_page="0" end_page="201" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The problem of translating from a language exhibiting rich inflectional morphology to a language exhibiting relatively poor inflectional morphology presents several challenges to the existing components of the statistical machine translation (SMT) process. This inflection gap causes an abundance of surface word forms in the source language compared with relatively few forms in the target language. This mismatch aggravates several issues We use the term surface form to refer to a series of characters separated by whitespace found in natural language processing: more unknown words forms in unseen data, more words occurring only once, more distinct words and lower token-to-type ratios (mean number of occurrences over all distinct words) in the source language than in the target language.</Paragraph> <Paragraph position="1"> Lexical relationships under the standard IBM models (Brown et al., 1993) do not account for many-to-many mappings, and phrase extraction relies heavily on the accuracy of the IBM word-to-word alignment. In this work, we propose an approach to bridge the inflectional gap that addresses the issues described above through a series of pre-processing steps based on the Buckwalter Arabic Morphological Analyzer (BAMA) tool (Buckwalter, 2004). While (Lee et al., 2003) develop accurate segmentation models of Arabic surface word forms using manually segmented data, we rely instead on the translated context in the target language, leveraging the manually constructed lexical gloss from BAMA to select the appropriate segmented sense for each Arabic source word.</Paragraph> <Paragraph position="2"> Our technique, applied as preprocessing to the source corpus, splits and normalizes surface words based on the target sentence context. In contrast to (Popovic and Ney, 2004) and (Niessen and Ney, 2004), we do not modify the IBM models, and we leave reordering effects to the decoder. Statistically significant improvements (Zhang and Vogel, 2004) in BLEU and NIST translation score over a lightly stemmed baseline are reported on the available and well known BTEC IWSLT'05 Arabic-English corpus (Eck and Hori, 2005).</Paragraph> </Section> class="xml-element"></Paper>