File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2013_metho.xml

Size: 10,909 bytes

Last Modified: 2025-10-06 14:10:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2013">
  <Title>Arabic Preprocessing Schemes for Statistical Machine Translation</Title>
  <Section position="4" start_page="49" end_page="49" type="metho">
    <SectionTitle>
3 Arabic Linguistic Issues
</SectionTitle>
    <Paragraph position="0"> Arabic is a morphologically complex language with a large set of morphological features. These features are realized using both concatenative (affixes and stems) and templatic (root and patterns) morphology with a variety of morphological and phonological adjustments that appear in word orthography and interact with orthographic variations. Certain letters in Arabic script are often spelled inconsistently which leads to an increase in both sparsity (multiple forms of the same word) and ambiguity (same form corresponding to multiple words). For example, variants of Hamzated Alif,  or a1 are often written without their Hamza (a2 ): a3 . Another example is the optionality of diacritics in Arabic script. We assume all of the text we are using is undiacritized.</Paragraph>
    <Paragraph position="1"> Arabic has a set of attachable clitics to be distinguished from inflectional features such as gender, number, person and voice. These clitics are written attached to the word and thus increase its ambiguity. We can classify three degrees of cliticization that are applicable in a strict order to a word base:</Paragraph>
  </Section>
  <Section position="5" start_page="49" end_page="49" type="metho">
    <SectionTitle>
[CONJ+ [PART+ [Al+ BASE +PRON]]]
</SectionTitle>
    <Paragraph position="0"> At the deepest level, the BASE can have a definite article (Al+ the)3 or a member of the class of pronominal enclitics, +PRON, (e.g. +hm their/them). Next comes the class of particle proclitics (PART+): l+ to/for, b+ by/with, k+ as/such and s+ will/future. Most shallow is the class of conjunction proclitics (CONJ+): w+ and and f+ then.</Paragraph>
    <Paragraph position="1"> 3Arabic transliterations are provided in the Buckwalter transliteration scheme (Buckwalter, 2002).</Paragraph>
    <Paragraph position="2"> These phenomena highlight two issues related to preprocessing: First, ambiguity in Arabic words is an important issue to address. To determine whether a clitic or feature should be split off or abstracted off requires that we determine that said feature is indeed present in the word we are considering in context - not just that it is possible given an analyzer or, worse, because of regular expression matching.</Paragraph>
    <Paragraph position="3"> Secondly, once a specific analysis is determined, the process of splitting off or abstracting off a feature must be clear on what the form of the resulting word is to be. For example, the word a4a6a5a8a7a10a9a11a7a10a12 ktbthm has two possible readings (among others) as their writers or I wrote them. Splitting off the pronominal clitic +hm without normalizing the t to p in the nominal reading leads to the coexistence of two forms of the noun: ktbp and ktbt. This increased sparsity is only worsened by the fact that the second form is also the verbal form (thus increased ambiguity).</Paragraph>
  </Section>
  <Section position="6" start_page="49" end_page="50" type="metho">
    <SectionTitle>
4 Preprocessing: Schemes and Techniques
</SectionTitle>
    <Paragraph position="0"> A scheme is a specification of the form of preprocessed output; whereas a technique is the method used to create such output. We examine six different schemes and three techniques.</Paragraph>
    <Section position="1" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
4.1 Preprocessing Techniques
</SectionTitle>
      <Paragraph position="0"> The different techniques chosen illustrate three degrees of linguistic knowledge dependence. The first is very light and cheap. The second is more expensive, requiring the use of a morphological analyzer.</Paragraph>
      <Paragraph position="1"> And the third is yet more expensive than the second; it is a disambiguation system that requires an analyzer and a disambiguated training corpus.</Paragraph>
      <Paragraph position="2"> a13 REGEX is the baseline technique. It is simply greedy regular expression matching to modify strings and/or split off prefix/suffix substrings that look like clitics indicated by specific schemes.</Paragraph>
      <Paragraph position="3"> REGEX cannot be used with complex schemes such as EN and MR (see Section 4.2).</Paragraph>
      <Paragraph position="4"> a13 BAMA, Buckwalter Arabic Morphological Analyzer (Buckwalter, 2002), is used to obtain possible word analyses. Using BAMA prevents incorrect greedy REGEX matches. Since BAMA produces multiple analyses, we always select one in a consistent arbitrary manner (first in a sorted list of analyses). null a13 MADA, The Morphological Analysis and Disambiguation for Arabic tool, is an off-the-shelf resource for Arabic disambiguation (Habash and  Rambow, 2005). MADA selects among BAMA analyses using a combination of classifiers for 10 orthogonal dimensions, including POS, number, gender, and pronominal clitics.</Paragraph>
      <Paragraph position="5"> For BAMA and MADA, applying a preprocessing scheme involves moving features (as specified by the scheme) out of the chosen word analysis and regenerating the word without the split off features (Habash, 2004). The regeneration guarantees the normalization of the word form.</Paragraph>
    </Section>
    <Section position="2" start_page="50" end_page="50" type="sub_section">
      <SectionTitle>
4.2 Preprocessing Schemes
</SectionTitle>
      <Paragraph position="0"> Table 1 exemplifies the effect of the different schemes on the same sentence.</Paragraph>
      <Paragraph position="1"> a13 ST: Simple Tokenization is the baseline preprocessing scheme. It is limited to splitting off punctuations and numbers from words and removing any diacritics that appear in the input. This scheme requires no disambiguation.</Paragraph>
      <Paragraph position="2"> a13 D1, D2, and D3: Decliticizations. D1 splits off the class of conjunction clitics (w+ and f+). D2 splits off the class of particles (l+, k+, b+ and s+) beyond D1. Finally D3 splits off what D2 does in addition to the definite article (Al+) and all pronominal clitics.</Paragraph>
      <Paragraph position="3"> a13 MR: Morphemes. This scheme breaks up words into stem and affixival morphemes.</Paragraph>
      <Paragraph position="4"> a13 EN: English-like. This scheme is intended to minimize differences between Arabic and English.</Paragraph>
      <Paragraph position="5"> It decliticizes similarly to D3; however, it uses lexeme and English-like POS tags instead of the regenerated word and it indicates the pro-dropped verb subject explicitly as a separate token.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="50" end_page="51" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We use the phrase-based SMT system, Portage (Sadat et al., 2005). For training, Portage uses IBM word alignment models (models 1 and 2) trained in both directions to extract phrase tables. Maximum phrase size used is 8. Trigram language models are implemented using the SRILM toolkit (Stolcke, 2002). Decoding weights are optimized using Och's algorithm (Och, 2003) to set weights for the four components of the log-linear model: language model, phrase translation model, distortion model, and word-length feature. The weights are optimized over the BLEU metric (Papineni et al., 2001). The Portage decoder, Canoe, is a dynamic-programming beam search algorithm, resembling the algorithm described in (Koehn, 2004a).</Paragraph>
    <Paragraph position="1"> All of the training data we use is available from the Linguistic Data Consortium (LDC). We use an Arabic-English parallel corpus of about 5 million words for translation model training data.4 We created the English language model from the English side of the parallel corpus together with 116 million words from the English Gigaword Corpus (LDC2005T12) and 128 million words from the English side of the UN Parallel corpus (LDC2004E13).</Paragraph>
    <Paragraph position="2"> English preprocessing comprised down-casing, separating punctuation from words and splitting off &amp;quot;'s&amp;quot;. Arabic preprocessing was varied using the proposed schemes and techniques. Decoding weight optimization was done on 200 sentences from the 2003 NIST MT evaluation test set. We used two different test sets: (a) the 2004 NIST MT evaluation test set (MT04) and (b) the 2005 NIST MT evaluation test set (MT05). MT04 is a mix of news, editorials and speeches, whereas MT05, like the training data, is purely news. We use the evaluation metric BLEU-4 (Papineni et al., 2001).</Paragraph>
    <Paragraph position="3"> We conducted all possible combinations of schemes and techniques discussed in Section 4 with different training corpus sizes: 1%, 10% and 100%.</Paragraph>
    <Paragraph position="4"> The results of the experiments are summarized in  BLEU-4 difference to be significant at the 95% confidence level for 1% training. For all other training sizes, the difference must be over 1.7% BLEU-4. Error intervals were computed using bootstrap resampling (Koehn, 2004b).</Paragraph>
  </Section>
  <Section position="8" start_page="51" end_page="51" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> Across different schemes, EN performs the best under scarce-resource condition; and D2 performs best under large-resource condition. Across techniques and under scarce-resource conditions, MADA is better than BAMA which is better than REGEX. Under large-resource conditions, this difference between techniques is statistically insignificant, though it's generally sustained across schemes.</Paragraph>
    <Paragraph position="1"> The baseline for MT05, which is fully in news genre like training data, is considerably higher than MT04 (mix of genres). To investigate the effect of different schemes and techniques on different genres, we isolated in MT04 those sentences that come from the editorial and speech genres. We performed similar experiments as reported above on this subset of MT04. We found that the effect of the choice of the preprocessing technique+scheme was amplified.</Paragraph>
    <Paragraph position="2"> For example, MADA+D2 (with 100% training) on non-news improved the system score 12% over the baseline ST (statistically significant) as compared to 2.4% for news only.</Paragraph>
    <Paragraph position="3"> Further analysis shows that combination of output from all six schemes has a large potential improvement over all of the different systems, suggesting a high degree of complementarity. For example, a 19% improvement in BLEU score (for MT04 under MADA with 100% training) (from 37.1 in D2 to 44.3) was found from an oracle combination created by selecting for each input sentence the output with the highest sentence-level BLEU score.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML