File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0504_metho.xml

Size: 16,683 bytes

Last Modified: 2025-10-06 14:08:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0504">
  <Title>An HMM Approach to Vowel Restoration in Arabic and Hebrew</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Evaluation Methodology
</SectionTitle>
    <Paragraph position="0"> We compare a baseline approach using a unigram model to a bigram model. We train both models on a corpus of diacritisized text, and then check the models' performance on an unseen test set, by removing the vowel diacritics from part of the corpus. For both Hebrew and Arabic, we evaluate performance by measuring the percentage of words in the test set whose vowel pattern was restored correctly, i.e. the vowel pattern suggested by the system exactly matched the original. We refer to this performance measure as word accuracy. For Hebrew, we also divided the vowel symbols into separate groups, each one corresponding to a specific phonetic value. We then measured the percentage of words whose individual letters were fitted with a vowel diacritic belonging to the same phonetic group as the correct vowel diacritic in the test set. In other words, the restored vowels, while perhaps not agreeing exactly with the original pattern, all belonged to the correct phonetic group. This performance measure, which corresponds to vocalization of non-voweled text, is useful for applications such as text-to-speech systems.2 We refer to this performance measure as phonetic group accuracy.</Paragraph>
    <Paragraph position="1"> There is an unfortunate lack of data for vowel-annotated text in both modern Hebrew and Arabic. The only easily accessible sources are the Hebrew Bible and the Qur'an, for which on-line versions transliterated into Latin characters are available. Ancient Hebrew and Arabic bear enough syntactical and semantic resemblance to their modern language equivalents to justify usage of these ancient texts as corpora. For Hebrew, we used the Westminster Hebrew Morphological Database (1998), a corpus containing a complete transcription of the graphical form of the Massoretic text of the Hebrew Bible containing roughly 300,000 words. For the Qur'an, we used the transliterated version publicly available from the sacred text archive at 2 In modern Hebrew, it is generally sufficient to associate each vowel symbol with its phonetic group in order to vocalize the word correctly.</Paragraph>
    <Paragraph position="2"> ba-ra elo-him missing # www.sacred-texts.com. This corpus contains roughly 90,000 words.</Paragraph>
    <Paragraph position="3"> For both languages, we tested our model on 10% of the corpus. We measured performance by evaluating word accuracy for both Hebrew and Arabic. In addition, we measured phonetic group accuracy for Hebrew.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Baseline : A Unigram Model
</SectionTitle>
    <Paragraph position="0"> To assess the difficulty of the problem, we counted the number of times each diacriticized word appeared in the training set. For each non-voweled word encountered in the test set, we searched through all of the words with the same non-voweled structure and picked the diacriticized word with the highest count in the table. Figure 1 shows the ambiguity distribution in the training set.</Paragraph>
    <Paragraph position="2"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Hebrew Arabic
</SectionTitle>
      <Paragraph position="0"> Note that for both languages, only about 30% of the words in the training set were unambiguous, i.e. had a single interpretation.</Paragraph>
      <Paragraph position="1"> For the baseline model, we achieved a word accuracy rate of 68% for Hebrew and 74% for Arabic. We note that even though the size of the Arabic training set was about a third of the size of the Hebrew training set, we still achieved a higher success rate of restoring vowels in Arabic. We attribute this to the fact that there are only three possible missing vowel diacritics in modern Arabic text, compared to twelve in Hebrew.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 A Bigram Model
</SectionTitle>
    <Paragraph position="0"> We constructed a bigram Hidden Markov Model (HMM) where hidden states were vowel-annotated (diacritisized) words, and observations were vowel-less words. One example of a path through the HMM for reconstructing a Hebrew sentence is given in Figure 2; ovals represent hidden states that correspond to diacritisized words; rectangles represent observations of vowel-less words; solid edges link the states that mark the transition through the model for generating the desired sentence; each edge carries with it a probability mass, representing the probability of transitioning between the two hidden states connected by the edge. This technique was used for Arabic in a similar way.</Paragraph>
    <Paragraph position="1"> Our model consists of a set of hidden states nTT ,..,1 where each hidden state corresponds to  pronounced /be-reshit bara elohim/ an observed word in our training corpus. Thus, each hidden state corresponds to a word containing its complete vowel pattern. From each hidden state iT , there is a single emission, which simply consists of the word in its non-voweled form. If we make the assumption that the probability of observing a given word depends only on the previous word, we can compute the probability of observing a sentence nn wwW ,...,1,1 = by summing over all of the possible hidden states that the HMM traversed while generating the sentence, as denoted in the following equation. null</Paragraph>
    <Paragraph position="3"> These probabilities of transitions through the states of the model are approximated by bigram counts, as described below. Note that the symbol &amp;quot;#&amp;quot; in the figure serves to &amp;quot;anchor&amp;quot; the initial state of the HMM and facilitate computation.</Paragraph>
    <Paragraph position="4"> Thereafter, the hidden states actually consist of vowel-annotated bigrams. The probability of any possible path in our model that generates this phrase can be computed as follows: )|()( 1,1 [?][?]= iin wwpiWp This equation decomposes into the following maximum likelihood probability estimations, denoted by p^ , in which c(word) denotes the number of instances that word had occurred in the training set and c(word1, word2) denotes the number of joint occurrences of word1 and  In order to be able to compute the likelihood of each bigram, we kept a look-up table consisting of counts for all individual and joint occurrences in the training set. We implemented the Viterbi algorithm to find the most likely path transitions through the hidden states that correspond to the observations. The likelihood of observing the sentence nW ,1 while traversing the hidden state path nT ,1 is taken to be ),( ,1,1 nn TWp . We ignore the normalizing factor )( ,1 nWp . More formally, the most likely path through the model is defined as</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Dealing with Sparse Data
</SectionTitle>
      <Paragraph position="0"> Because our bigram model is trained from a finite corpus, many words are bound to be missing from it. For example, in the unigram model, we found that as many as 16% of the Hebrew words in the test set were not present. The amount of unseen bigrams was even higher, as much as 20 percent. This is not surprising, as we expect some unseen bigrams to consist of words that were both seen before individually. We did not specifically deal with sparse data in the uni-gram base line model.</Paragraph>
      <Paragraph position="1"> As many of the unseen unigrams were non ambiguous, we would have liked to look up the missing words in a vowel-annotated dictionary and copy the vowel pattern found in the dictionary. However, as noted in Section 2, morphology in both Hebrew and Arabic is nonconcatenative. Since dictionaries contain only the root form of verbs and nouns, without a sound morphological analyzer we could not decipher the root. Therefore, proceeded as follows: We employed a technique proposed by Katz (1987) that combines a discounting method along with a back-off method to obtain a good estimate of unseen bigrams. We use the Good-Turing discounting method (Gale &amp; Sampson 1991) to decide how much total probability mass to set aside for all the events we haven't seen, and a simple back-off algorithm to tell us how to distribute this probability. Formally, we define</Paragraph>
      <Paragraph position="3"> Here, dP is the discounted estimate using the Good-Turing method, p is a probability estimated by the number of occurrences and )1(wa is a normalizing factor that divides the unknown if c(w2,w1)&gt;0 if c(w2,w1)=0 probability mass of unseen bigrams beginning with w1.</Paragraph>
      <Paragraph position="4">  In order to compute Pd we create a separate discounting model for each word in the training set. The reason for this is simple: If we use only one model over all of the bigram counts, we would really be approximating )1,2( wwPd . Because we wish to estimate )1|2( wwPd , we define the discounted frequency counts as follows: null</Paragraph>
      <Paragraph position="6"> where cn is the number of different bigrams in the corpus that have frequency c. Following Katz, we estimate the probability of unseen bi-grams to be p(w2|w1) [?] If the missing bigram is composed of two individually observed words, this technique allows us to estimate the probability mass of the unseen bigram. In some cases, the unseen bigram consists of individual words that have never been seen. In other words, w2 itself is unseen and c(w2) cannot be computed. In this case, we estimate the probability for p(w2|w1) by computing p(unseen|w1). We do this by allocating some probability mass to unseen words, keeping a special count for bigrams that were seen less then k times.3 We allocate a separate hidden state for unseen words, as depicted in Figure 2. In this case, we do not attempt to fit any vowel pattern to the unseen word; the word is left bare of its diacritics. However, we can still assign a probability mass, p(unseen|w1), to prevent the Viterbi algorithm from computing a zero prob3 k was arbitrarily set to three in our experiment. Alternatively, we could get a more exact estimation of the missing probability mass by discounting the unigram probabilities of w2.</Paragraph>
      <Paragraph position="7"> ability. We can compute the probabilities p(w2|unseen) in a similar manner.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> Figure 3 presents our results using the bigram HMM model, where &amp;quot;Hebrew 1&amp;quot; measures word accuracy be in Hebrew, &amp;quot;Hebrew 2&amp;quot; measures phonetic group accuracy, and &amp;quot;Arabic&amp;quot; measures word accuracy in Arabic. Using the bigram model for Hebrew, we achieved 81% word accuracy and 87% phonetic group accuracy. For Arabic, we achieved 86% word accuracy. For Hebrew, the system was more successful in restoring the phonetic group vowel pattern than restoring the exact diacritics. This is because the number of possible vowel symbols in Hebrew is larger than in Arabic. However, for text-to-speech systems, it is sufficient to associate each vowel with the correct phonetic group. For word accuracy, most of the errors in Hebrew (11%) and in Arabic (8%) were due to words that were not found in the training corpus.</Paragraph>
      <Paragraph position="1"> Therefore, we believe that acquiring a sufficiently large modern corpus of the language would greatly improve performance. However, the number of parameters for our model is quadratic in the number of word types in the training set. Therefore, we suggest using limited morphological analysis to improve performance of the system by attempting to identify the stem or root of the words in the test set, as well as the conjugation. Since conjugation templates in Semitic languages have fixed vowel patterns, even limited success in morphological analyses would greatly improve performance of the system, while not incurring a blowup in the number of parameters.</Paragraph>
      <Paragraph position="3"/>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> Performing a full morphological analysis of a Hebrew or Arabic sentence would greatly assist the vowel restoration problem. That is, if we could correctly parse each word in a sentence, we could eliminate ambiguity and restore the correct vowel pattern of the word according to its grammatical form and part of speech.</Paragraph>
    <Paragraph position="1"> For Arabic, a morphological analyzer, developed by the Xerox Research Centre Europe (Beesley 1998) is freely available.4 The system uses finite state transducers, traditionally used for modeling concatenative morphology. Since the system is word based, it cannot disambiguate words in context and outputs all possible analyses for each word. The system relies on handcrafted rules and lexicon that govern Arabic morphology.</Paragraph>
    <Paragraph position="2"> For Hebrew, a morphological analyzer called Nakdan Text exists, as part of the Rav Milim project for the processing of modern Hebrew (Choueka and Neeman 1995). Given a sentence in modern Hebrew, Nakdan Text restores its vowel diacritics by first finding all possible morphological analyses and vowel patterns of every word in the sentence. Then, for every such word, it chooses the correct context-dependent vowel pattern using short-context syntactical rules as well as some probabilistic models. The authors report 95% success rate in restoring vowel patterns. It is not clear if this refers to word accuracy or letter accuracy.5 Segel (1997) devised a statistical Hebrew lexical analyzer that takes contextual dependencies into account. Given a non-voweled Hebrew texts as input and achieves 95% word accuracy on test data extracted from the Israeli daily Ha'aretz. However, this method requires fully analyzed Hebrew text to train on. Segel used a morphological hand-analyzed training set consisting of only 500 sentences. Because there is currently no tree bank of analyzed Hebrew text, this method is not applicable to other domains, such as novels or medical texts.</Paragraph>
    <Paragraph position="3">  fifth Bar Ilan international symposium on Artificial Intelligence, but no summary or article was included in its proceedings, and to the best of our knowledge no article has been published describing the methods of Nakdan text.</Paragraph>
    <Paragraph position="4"> Kontorovich and Lee (2001) use an HMM approach to vocalizing Hebrew text. Their model consists of fourteen hidden states, with emissions for each word of the training set.</Paragraph>
    <Paragraph position="5"> Initially, the parameters of the model are chosen at random and training of the model is done using the EM algorithm. They achieve a success rate of 81%, when unseen words are discarded from the test set.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Future Work
</SectionTitle>
    <Paragraph position="0"> Since most of the errors in the model can be attributed to missing words, we plan to address this problem from two perspectives. First, we plan to include a letter-based HMM to be used for fitting an unseen word with a likely vowel pattern. The model would be trained separately on words from the training set. Its hidden states would correspond to vowels in a language, making this model language dependent. We also plan to use a trigram model for the task of vowel restoration, backing off to a bigram model for sparse trigrams.</Paragraph>
    <Paragraph position="1"> Second, we plan to use some degree of morphological analysis to assist us with the restoration of unseen words. At the very least, we could use a morphological analyzer as a dictionary for words that have unique diacritization, but are missing from the model. Since analyzers for Arabic that are commonly available (Beesley 1998) are word based, they output all possible morphological combinations of the word, and it is still unclear how we could choose the most likely parse given the context.</Paragraph>
    <Paragraph position="2"> Finally, since the size of our corpora is relatively small, we also plan to use cross validation to get a better estimate of the generalization error.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML