File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0712_metho.xml

Size: 15,316 bytes

Last Modified: 2025-10-06 14:09:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0712">
  <Title>An Integrated Approach for Arabic-English Named Entity Translation</Title>
  <Section position="4" start_page="87" end_page="88" type="metho">
    <SectionTitle>
3 Integrated Approach for Named Entity
Translation
</SectionTitle>
    <Paragraph position="0"> We introduce an integrated approach for Named Entity (NE) translation using phrase based translation, word based translation and transliteration approaches in a single framework. Our unified approach could handle, in principle, any NE type for any languages pair.</Paragraph>
    <Paragraph position="1"> The level of complication in NE translation depends on the NE type, the original source of the names, the standard de facto translation for certain named entities and the presence of acronyms. For example persons names tend to be phonetically transliterated, but different sources might use different transliteration styles depending on the original source of the names and the idiomatic translation that has been established. Consider the following two names: &amp;quot; a0a2a1a3a0a2a4a5a7a6 a8 : jAk $yrAk&amp;quot; &amp;quot;Jacques Chirac&amp;quot; &amp;quot;a9 a5a7a10 a11a12a0a2a1a3 :jAk strw&amp;quot; &amp;quot;Jack Straw&amp;quot; Although the first names in both examples are the same in Arabic, their transliterations should be different. One might be able to distinguish between the two by looking at the last names. This example illustrates why transliteration may not be good for frequently used named entities. Transliteration is more appropriate for unknown NEs.</Paragraph>
    <Paragraph position="2"> For locations and organizations, the translation can be a mixture of translation and transliteration.</Paragraph>
    <Paragraph position="3"> For example: a13a15a14a17a16a19a18</Paragraph>
    <Paragraph position="5"> These examples highlight some of the complications of NE translation that are difficult to overcome using any phrase based, word based or transliteration approach independently. An approach that integrates phrase and word based translation with transliteration in a systematic and flexible framework could provide a more complete solution to the problem.</Paragraph>
    <Paragraph position="6"> Our system utilizes a parallel corpus to separately acquire the phrases for the phrase based sys- null tem, the translation matrix for the word based system, and training data for the transliteration system. More details about the three systems will be presented in the next section. Initially, the corpus is automatically annotated with NE types in the source and target languages using NE identifiers similar to the systems described in (Florian et al., 2004) for NE detection.</Paragraph>
  </Section>
  <Section position="5" start_page="88" end_page="90" type="metho">
    <SectionTitle>
4 Translation and Transliteration Mod-
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="88" end_page="89" type="sub_section">
      <SectionTitle>
ules
4.1 Word Based NE Translation
</SectionTitle>
      <Paragraph position="0"> * Basic multi-cost NE Alignment We introduce a novel NE alignment technique to align NEs from a parallel corpus that has been automatically annotated with NE types for source and target languages. We use IBM Model1, as introduced in (Brown et. al, 1993), with a modified alignment cost. The cost function has some similarity with the multi-cost aligning approach introduced by Huang (Huang et al. 2003) but it is significantly different. The cost for aligning any source and target NE word is defined as:</Paragraph>
      <Paragraph position="2"> Where: ew and fw are the target and source words respectively and 1l , 2l and 3l are the cost weighting parameters.</Paragraph>
      <Paragraph position="3"> The first term )|( fe wwp represents the translation log probability of target word ( ew ) given the source word ( fw ). The second term ),( fe wwEd is length-normalized phonetic based edit distance between the two words. This phonetic-based edit distance employs an Editex style (Zobel and Dart, 1996) distance measure, which groups letters that can result in similar pronunciations, but doesn't require that the groups be disjoint, and can thus reflect the correspondences between letters with similar pronunciation more accurately. The Editex distance (d) between two letters a and b is: d(a,b) = 0 if both are identical = 1 if they are in the same group = 2 otherwise The Editex distance between two words is the summation of Editex distance between their letters and length-normalized edit distance is:</Paragraph>
      <Paragraph position="5"> where ),( fe wwd is the &amp;quot;Editex&amp;quot; style edit distance and |)||,max( |fe ww is the maximum of the two lengths for the source and target, normalizing the edit distance.</Paragraph>
      <Paragraph position="6"> The Editex edit distance is deployed between English words and &amp;quot;romanized&amp;quot; Arabic words with a grouping of similar consonants and a grouping of similar vowels. This helps in identifying the correspondence between rare NEs during the alignment. For example, consider two rare NE phrases that occur once in the training:  If a pure Model-1 alignment was used, then the model would have concluded that all words could be aligned to all others with equal probability. However, the multi-cost alignment technique could align two named entities using a single training sample. This approach has significant effect in correctly aligning rare NEs.</Paragraph>
      <Paragraph position="7"> The term Tag(we,wf ) in the alignment cost function is the NE type cost which increases the alignment cost when the source and target words are annotated with different types and is zero otherwise. null The parameters of the cost function ( 1l , 2l , 3l ) can be tuned according to the NE category and to frequency of a NE. For example, in the case of person's names, it might be advantageous to use a larger l2 (boosting the weight of transliteration).</Paragraph>
      <Paragraph position="8">  In the case of organization and location names; many content words, which are words other than the NEs, occur in the NE phrases. These content words might be aligned incorrectly to rare NE words. A two-phase alignment approach is deployed to overcome this problem. The first phase is aligning the content words using a content-wordonly translation matrix. The successfully aligned content words are removed from both the source and target sentences. In the second phase, the remaining words are subsequently aligned using the multi-cost alignment technique described in the previous section. This two-phase approach filters out the words that might be incorrectly aligned using the single phase alignment techniques. Thus the alignment accuracy is enhanced; especially for organization names since organization names used to contain many content words.</Paragraph>
      <Paragraph position="9"> The following example illustrates the technique, consider two sentences to be aligned and to avoid language confusion let's assume symbolic sentences by denoting:  The example clarify that the elimination of some content words facilitates the task of NEs alignment since many of the words that might lead to confusion have been eliminated.</Paragraph>
      <Paragraph position="10"> As shown in the above example, different mismatched identification of NEs could result from different identifiers. The &amp;quot;Multi-cost Named Entity Alignment by Content Words Elimination&amp;quot; technique helps in reducing alignment errors due to identification errors by reducing the candidate words for alignment and thus reducing the aligner confusion.</Paragraph>
    </Section>
    <Section position="2" start_page="89" end_page="89" type="sub_section">
      <SectionTitle>
4.2 Phrase Based Named Entity Transla-
</SectionTitle>
      <Paragraph position="0"> tion For phrase-based NE translation, we used an approach similar to that presented by Tillman (Tillmann, 2003) for block generation with modifications suitable for NE phrase extraction. A block is defined to be any pair of source and target phrases. This approach starts from a word alignment generated by HMM Viterbi training (Vogel et. Al, 1996), which is done in both directions between source and target. The intersection of the two alignments is considered a high precision alignment and the union is considered a low precision alignment. The high precision alignments are used to generate high precision blocks which are further expanded using low precision alignments.</Paragraph>
      <Paragraph position="1"> The reader is referred to (Tillmann, 2003) for detailed description of the algorithm.</Paragraph>
      <Paragraph position="2"> In our approach, for extracting NE blocks, we limited high precision alignments to NE phrases of the same NE types. In the expansion phase, the multi-cost function described earlier is used. Thus the blocks are expanded based on a cost depending on the type matching cost, the edit distance cost and the translation probability cost.</Paragraph>
      <Paragraph position="3"> To explain this procedure, consider the following sentences pair:</Paragraph>
      <Paragraph position="5"> &amp;quot;Japanese Foreign Minister Nobutaka Machimura has summoned the Chinese ambassador</Paragraph>
    </Section>
    <Section position="3" start_page="89" end_page="90" type="sub_section">
      <SectionTitle>
Wang Yee
</SectionTitle>
      <Paragraph position="0"> The underlined words are the words that have been identified by the NE identifiers as person names. In the Arabic sentence, the identifier missed the second name of the first Named Entity (mA$ymwrA) and did not identify the word as person name by mistake. The high precision block generation technique will generate the following two blocks:</Paragraph>
      <Paragraph position="2"> The expansion technique will try to expand the blocks on all the four possible dimensions (right and left of the blocks in the target and source) of each block. The result of the expansion will be:  Therefore, the multi-cost expansion technique enables expansions sensitive to the translation probability and the edit distance and providing a mechanism to overcome NE identifiers errors.</Paragraph>
    </Section>
    <Section position="4" start_page="90" end_page="90" type="sub_section">
      <SectionTitle>
4.3 Named Entity Transliteration
</SectionTitle>
      <Paragraph position="0"> NE transliteration is essential for translating Out Of Vocabulary (OOV) words that are not covered by the word or phrase based models. As mentioned earlier, phonetic and orthographic differences between Arabic and English make NE transliteration challenging.</Paragraph>
      <Paragraph position="1"> We used a block based transliteration method, which transliterates sequence of letters from the source language to sequence of letters in the target language. These source and target sequences construct the blocks which enables the modeling of vowels insertion. For example, consider Arabic name &amp;quot;a0 a16a2a1a17a18 $kry,&amp;quot; which is transliterated as &amp;quot;Shoukry.&amp;quot; The system tries to model bi-grams from the source language to n-grams in the target language as follows: $k a41 shouk kra41 kr ry a41 ry To obtain these block translation probabilities, we use the translation matrix, generated in section 4.1 from the word based alignment models. First, the translation matrix is filtered out to only preserve highly confident translations; translations with probabilities less than a certain threshold are filtered out. Secondly, the resulting high confident translations are further refined by calculating phonetic based edit distance between both romanized Arabic and English names. Name pairs with an edit distance greater than a predefined threshold are also filtered out. The remaining highly confident name pairs are used to train a letter to letter translation matrix using HMM Viterbi training (Vogel et al., 1996).</Paragraph>
      <Paragraph position="2"> Each bi-gram of letters on the source side is aligned to an n-gram of letters sequence on the target side, such that vowels have very low cost to be aligned to NULL. The block probabilities are calculated and refined iteratively for each source and target sequences. Finally, for a source block s and a target block t, the probability of s being translated as t is the ratio of their co-occurrence and total source occurrence: )(),()|( sNstNstP = .</Paragraph>
      <Paragraph position="3"> The resulting block translation probabilities and the letter to letter translation probabilities are combined to construct a Weighted Finite State Transducer (WFST) for translating any source sequence to a target sequence.</Paragraph>
      <Paragraph position="4"> Furthermore, the constructed translation WFST is composed with two language models (LM) transducers namely a letter trigram model and a word unigram model. The trigram letter based LM acts to provide high recall results while the word based unigram LM acts for providing high precisin results.</Paragraph>
    </Section>
    <Section position="5" start_page="90" end_page="90" type="sub_section">
      <SectionTitle>
4.4 System Integration and Decoding
</SectionTitle>
      <Paragraph position="0"> The three constructed models in the steps above, namely phrase-based NE translation, word-based translation, and transliteration, are used to generate hypotheses for each source NE phrase.</Paragraph>
      <Paragraph position="1"> We used a dynamic programming beam search decoder similar to the decoder described by Tillman (Tillmann, 2003).</Paragraph>
      <Paragraph position="2"> We employed two language models that were built from NE phrases extracted from monolingual target data for each NE category under consideration. The first language model is a trigram language model on NE phrases. The second language model is a class based language model with a class for unknown NEs. Every NE that do exist in the monolingual data but out of the vocabulary of the phrase and word translation models are considered unknown. This helps in correctly scoring OOV hypothesis produced by the transliteration module.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="90" end_page="91" type="metho">
    <SectionTitle>
5 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> We test our system for Arabic to English NE translation for three NE categories, namely names of persons, organizations, and locations. The system has been trained on a news domain parallel corpus containing 2.8 million Arabic words and 3.4 million words. Monolingual English data was annotated with NE types and the extracted named entities were used to train the various language models described earlier.</Paragraph>
    <Paragraph position="1"> We manually constructed a test set as follows:  The BLEU score (Papineni et al., 2002) with a single reference translation was deployed for evaluation. BLEU-3 which uses up to 3-grams is deployed since three words phrase is a reasonable length for various NE types. Table 1 reports the results for person names; the baseline system is a  category with the same three systems presented before:  It is also worth mentioning that evaluating the system using a single reference has limitations; many good translations are considered wrong because they do not exist in the single reference.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML