File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1051_metho.xml

Size: 16,914 bytes

Last Modified: 2025-10-06 14:08:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1051">
  <Title>Language Model Based Arabic Word Segmentation</Title>
  <Section position="4" start_page="2" end_page="4" type="metho">
    <SectionTitle>
3 Morpheme Segmentation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Trigram Language Model
</SectionTitle>
      <Paragraph position="0"> Given an Arabic sentence, we use a trigram language model on morphemes to segment it into a sequence of morphemes {m</Paragraph>
      <Paragraph position="2"> The input to the morpheme segmenter is a sequence of Arabic tokens - we use a tokenizer that looks only at white space and other punctuation, e.g. quotation marks, parentheses, period, comma, etc. A sample of a manually segmented corpus is given below  .</Paragraph>
      <Paragraph position="3"> Here multiple occurrences of prefixes and suffixes per word are marked with an underline.</Paragraph>
      <Paragraph position="5"> Awl fy jA}z +p Al# nmsA Al# EAm Al# mADy Ely syAr +p fyrAry $Er b# AlAm fy bTn +h ADTr +t +h Aly Al# AnsHAb mn Al# tjArb w# hw s# y# Ewd Aly lndn l# AjrA' Al# fHwS +At Al# Drwry +p Hsb mA A$Ar fryq  A manually segmented Arabic corpus containing about 140K word tokens has been provided by LDC (http://www.ldc.upenn.edu). We divided this corpus into training and the development test sets as described in Section 5.</Paragraph>
      <Paragraph position="6"> jAgwAr. w# s# y# Hl sA}q Al# tjArb fy jAgwAr Al# brAzyly lwsyAnw bwrty mkAn AyrfAyn fy Al# sbAq gdA Al# AHd Al*y s# y# kwn Awly xTw +At +h fy EAlm sbAq +At AlfwrmwlA Many instances of prefixes and suffixes in Arabic are meaning bearing and correspond to a word in English such as pronouns and prepositions. Therefore, we choose a segmentation into multiple prefixes and suffixes. Segmentation into one prefix and one suffix per word, cf. (Darwish 2002), is not very useful for applications like statistical machine translation, (Brown et al. 1993), for which an accurate word-to-word alignment between the source and the target languages is critical for high quality translations.</Paragraph>
      <Paragraph position="7"> The trigram language model probabilities of morpheme sequences, p(m</Paragraph>
      <Paragraph position="9"> estimated from the morpheme-segmented corpus. At token boundaries, the morphemes from previous tokens constitute the histories of the current morpheme in the trigram language model. The trigram model is smoothed using deleted interpolation with the bigram and unigram models, (Jelinek 1997), as in (1):  = 1.</Paragraph>
      <Paragraph position="10"> A small morpheme-segmented corpus results in a relatively high out of vocabulary rate for the stems. We describe below an unsupervised acquisition of new stems from a large unsegmented Arabic corpus. However, we first describe the segmentation algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Decoder for Morpheme Segmentation
</SectionTitle>
      <Paragraph position="0"> We take the unit of decoding to be a sentence that has been tokenized using white space and punctuation. The task of a decoder is to find the morpheme sequence which maximizes the trigram probability of the input sentence, as in</Paragraph>
      <Paragraph position="2"> ), N = number of morphemes in the input.</Paragraph>
      <Paragraph position="3"> Search algorithm for (2) is informally described for each word token as follows: Step 1: Compute all possible segmentations of the token (to be elaborated in 3.2.1). Step 2: Compute the trigram language model score of each segmentation. For some segmentations of a token, the stem may be an out of vocabulary item. In that case, we use an &amp;quot;UNKNOWN&amp;quot; class in the trigram language model with the model probability given by</Paragraph>
      <Paragraph position="5"> UNK_Fraction is 1e-9 determined on empirical grounds. This allows us to segment new words with a high accuracy even with a relatively high number of unknown stems in the language model vocabulary, cf. experimental results in Tables 5 &amp; 6.</Paragraph>
      <Paragraph position="6"> Step 3: Keep the top N highest scored segmentations.</Paragraph>
      <Paragraph position="7">  Possible segmentations of a word token are restricted to those derivable from a table of prefixes and suffixes of the language for decoder speed-up and improved accuracy. Table 2 shows examples of atomic (e.g. l , t ) and multi-component (e.g. lnullnullnullw , nullnull ) prefixes and suffixes, along with their component morphemes in native Arabic.</Paragraph>
      <Paragraph position="8">  We have acquired the prefix/suffix table from a 110K word manually segmented LDC corpus (51 prefixes &amp; 72 suffixes) and from IBM-Egypt (additional 14 prefixes &amp; 122 suffixes). The performance improvement by the additional prefix/suffix list ranges from 0.07% to 0.54% according to the manually segmented training corpus size. The smaller the manually segmented corpus size is, the bigger the performance improvement by adding additional prefix/suffix list is.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
Prefixes Suffixes
</SectionTitle>
      <Paragraph position="0"> l #l t t + lnullnullnull #l #b nullnull +h t + lnullnullnullw #l #b #w nullw + nw+h Table 2 Prefix/Suffix Table Each token is assumed to have the structure prefix*-stem-suffix*, and is compared against the prefix/suffix table for segmentation. Given a word token, (i) identify all of the matching prefixes and suffixes from the table, (ii) further segment each matching prefix/suffix at each character position, and (iii) enumerate all prefix*-stem-suffix* sequences derivable from (i) and (ii).</Paragraph>
      <Paragraph position="1"> Table 3 shows all of its possible segmentations of the token aw hr (wAkrrhA; 'and I repeat it'),  where [?] indicates the null prefix/suffix and the Seg Score is the language model probabilities of each segmentation S1 ... S12. For this token, there are two matching prefixes #w (w#) and #w (wA#) from the prefix table, and two matching suffixes +(+A) and h +(+hA) from the suffix table. S1, S2, &amp; S3 are the segmentations given the null prefix [?] and suffixes [?], +A, +hA. S4, S5, &amp; S6 are the segmentations given the prefix w# and suffixes [?], +A, +hA. S7, S8, &amp; S9 are the segmentations given the prefix wA# and suffixes [?], +A, +hA. S10, S11, &amp; S12 are the segmentations given the prefix sequence w# A# derived from the prefix wA# and suffixes [?], +A, +hA. As illustrated by S12, derivation of sub-segmentations of the matching prefixes/suffixes enables the system to identify possible segmentations which would have been missed otherwise. In this case, segmentation including the derived prefix sequence w # # ra+h (w# A# krr +hA) happens to be the correct one.</Paragraph>
      <Paragraph position="2">  While the number of possible segmentations is maximized by sub-segmenting matching  prefixes and suffixes, some of illegitimate sub-segmentations are filtered out on the basis of the knowledge specific to the manually segmented corpus. For instance, subsegmentation of the suffix hA into +h +A is ruled out because there is no suffix sequence +h +A in the training corpus. Likewise, subsegmentation of the prefix Al into A# l# is filtered out. Filtering out improbable prefix/suffix sequences improves the segmentation accuracy, as shown in Table 5.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="4" end_page="5" type="metho">
    <SectionTitle>
4 Unsupervised Acquisition of New
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
Stems
</SectionTitle>
      <Paragraph position="0"> Once the seed segmenter is developed on the basis of a manually segmented corpus, the performance may be improved by iteratively expanding the stem vocabulary and retraining the language model on a large automatically segmented Arabic corpus.</Paragraph>
      <Paragraph position="1"> Given a small manually segmented corpus and a large unsegmented corpus, segmenter development proceeds as follows.</Paragraph>
      <Paragraph position="2"> Initialization: Develop the seed segmenter  does not improve the performance any more.</Paragraph>
      <Paragraph position="3"> Unsupervised acquisition of new stems from an automatically segmented new corpus is a three-step process: (i) select new stem candidates on the basis of a frequency threshold, (ii) filter out new stem candidates containing a sub-string with a high likelihood of being a prefix, suffix, or prefix-suffix. The likelihood of a sub-string being a prefix, suffix, and prefix-suffix of a token is computed as in  (5) to (7), (iii) further filter out new stem candidates on the basis of contextual information, as in (8).</Paragraph>
      <Paragraph position="4"> (5) P score = number of tokens with prefix P / number of tokens starting with sub-string P (6) S score = number of tokens with suffix S / number of tokens ending with sub-string S (7) PS score = number of tokens with prefix P  and suffix S / number of tokens starting with sub-string P and ending with sub-string S Stem candidates containing a sub-string with a high prefix, suffix, or prefix-suffix likelihood are filtered out. Example sub-strings with the prefix, suffix, prefix-suffix likelihood 0.85 or higher in a 110K word manually segmented corpus are given in Table 4. If a token starts with the sub-string (sn), and end with (hA), the sub-string's likelihood of being the prefix-suffix of the token is 1. If a token starts with the sub-string nullnull (ll), the sub-string's likelihood of being the prefix of the token is 0.945, etc.</Paragraph>
      <Paragraph position="5">  (8) Contextual Filter: (i) Filter out stems co null occurring with prefixes/suffixes not present in the training corpus. (ii) Filter out stems whose prefix/suffix distributions are highly disproportionate to those seen in the training corpus.</Paragraph>
      <Paragraph position="6"> According to (8), if a stem is followed by a potential suffix +m, not present in the training corpus, then it is filtered out as an illegitimate stem. In addition, if a stem is preceded by a prefix and/or followed by a suffix with a significantly higher proportion than that observed in the training corpus, it is filtered out. For instance, the probability for the suffix +A to follow a stem is less than 50% in the training corpus regardless of the stem properties, and therefore, if a candidate stem is followed by +A with the probability of over 70%, e.g. mAnyl +A, then it is filtered out as an illegitimate stem.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="5" end_page="7" type="metho">
    <SectionTitle>
5 Performance Evaluations
</SectionTitle>
    <Paragraph position="0"> We present experimental results illustrating the impact of three factors on segmentation error rate: (i) the base algorithm, i.e. language model training and decoding, (ii) language model vocabulary and training corpus size, and (iii) manually segmented training corpus size.</Paragraph>
    <Paragraph position="1"> Segmentation error rate is defined in (9).</Paragraph>
    <Paragraph position="2"> (9) (number of incorrectly segmented tokens / total number of tokens) x 100 Evaluations have been performed on a development test corpus containing 28,449 word tokens. The test set is extracted from 20001115_AFP_ARB.0060.xml.txt through 20001115_AFP_ARB.0236.xml.txt of the LDC Arabic Treebank: Part 1 v 2.0 Corpus.</Paragraph>
    <Paragraph position="3"> Impact of the core algorithm and the unsupervised stem acquisition has been measured on segmenters developed from 4 different sizes of manually segmented seed corpora: 10K, 20K, 40K, and 110K words.</Paragraph>
    <Paragraph position="4"> The experimental results are shown in Table 5. The baseline performances are obtained by assigning each token the most frequently occurring segmentation in the manually segmented training corpus. The column headed by '3-gram LM' indicates the impact of the segmenter using only trigram language model probabilities for decoding.</Paragraph>
    <Paragraph position="5"> Regardless of the manually segmented training corpus size, use of trigram language model probabilities reduces the word error rate of the corresponding baseline by approximately 50%.</Paragraph>
    <Paragraph position="6"> The column headed by '3-gram LM + PS Filter' indicates the impact of the core algorithm plus Prefix-Suffix Filter discussed in Section 3.2.2. Prefix-Suffix Filter reduces the word error rate ranging from 7.4% for the smallest (10K word) manually segmented corpus to 21.8% for the largest (110K word) manually segmented corpus - around 1% absolute reduction for all segmenters. The column headed by '3-gram LM + PS Filter + New Stems' shows the impact of unsupervised stem acquisition from a 155 million word Arabic corpus. Word error rate reduction due to the unsupervised stem acquisition is 38% for the segmenter developed from the 10K word manually segmented corpus and 32% for the segmenter developed from 110K word manually segmented corpus.</Paragraph>
    <Paragraph position="7"> Language model vocabulary size (LM VOC Size) and the unknown stem ratio (OOV ratio) of various segmenters is given in Table 6. For unsupervised stem acquisition, we have set the frequency threshold at 10 for every 10-15 million word corpus, i.e. any new morphemes occurring more than 10 times in a 10-15 million word corpus are considered to be new stem candidates. Prefix, suffix, prefix-suffix likelihood score to further filter out illegitimate stem candidates was set at 0.5 for the segmenters developed from 10K, 20K, and 40K manually segmented corpora, whereas it was set at 0.85 for the segmenters developed from a 110K manually segmented corpus.</Paragraph>
    <Paragraph position="8"> Both the frequency threshold and the optimal prefix, suffix, prefix-suffix likelihood scores were determined on empirical grounds.</Paragraph>
    <Paragraph position="9"> Contextual Filter stated in (8) has been applied only to the segmenter developed from 110K manually segmented training corpus.</Paragraph>
    <Paragraph position="10">  Comparison of Tables 5 and 6 indicates a high correlation between the segmentation error rate and the unknown stem ratio.</Paragraph>
    <Paragraph position="11">  Without the Contextual Filter, the error rate of the same segmenter is 3.1%.</Paragraph>
    <Paragraph position="12">  segmenters according to three factors: (i) errors due to unknown stems, (ii) errors involving mnullnullnullnullnull (Alywm), and (iii) errors due to other factors. Interestingly, the segmenter developed from a 110K manually segmented corpus has the lowest percentage of &amp;quot;unknown stem&amp;quot; errors at 39.6% indicating that our unsupervised acquisition of new stems is working well, as well as suggesting to use a larger unsegmented corpus for unsupervised stem acquisition.</Paragraph>
    <Paragraph position="13"> mnullnullnullnullnull (Alywm) should be segmented differently depending on its part-of-speech to capture the semantic ambiguities. If it is an adverb or a proper noun, it is segmented as mnullnullnullnullnull 'today/Al-Youm', whereas if it is a noun, it is segmented as l # mnullnullnull 'the day.' Proper segmentation of mnullnullnullnullnull primarily requires its part-of-speech information, and cannot be easily handled by morpheme trigram models alone.</Paragraph>
    <Paragraph position="14"> Other errors include over-segmentation of foreign words such as nullnullnullnullnullnullnullnullnullnull (bwtyn) as b # nullnullnullnullnullnullw and nullnullnullnullnullnullnullnullnull (lytr) 'litre' as l #y # nullnullnull . These errors are attributed to the segmentation ambiguities of these tokens: nullnullnullnullnullnullnullnullnullnull is ambiguous between ' nullnullnullnullnullnullnullnullnullnull (Putin)' and 'b # nullnullnullnullnullnullw (by aorta)'. nullnullnullnullnullnullnullnullnull is ambiguous between ' nullnullnullnullnullnullnullnullnull (litre)' and ' l #y # nullnullnull (for him to harm)'. These errors may also be corrected by incorporating part-of-speech information for disambiguation.</Paragraph>
    <Paragraph position="15"> To address the segmentation ambiguity problem, as illustrated by ' nullnullnullnullnullnullnullnullnullnull (Putin)' vs. ' b # nullnullnullnullnullnullw (by aorta)', we have developed a joint model for segmentation and part-of-speech tagging for which the best segmentation of an input sentence is obtained according to the formula (10), where t</Paragraph>
    <Paragraph position="17"> By using the joint model, the segmentation word error rate of the best performing segmenter has been reduced by about 10%  from 2.9% (cf. the last column of Table 5) to 2.6%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML