File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1612_intro.xml

Size: 6,952 bytes

Last Modified: 2025-10-06 14:02:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1612">
  <Title>Automatic diacritization of Arabic for Acoustic Modeling in Speech Recognition</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Motivation and Prior Work
</SectionTitle>
    <Paragraph position="0"> We rst describe the Arabic writing system and its inherent problems for speech recognizer training, and then discuss previous attempts at automatic diacritization.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Arabic Writing System
</SectionTitle>
      <Paragraph position="0"> The Arabic alphabet consists of twenty-eight letters, twenty- ve of which represent consonants and three of which represent the long vowels (/i:/,/a:/,/u:/). A distinguishing feature of Arabic-script based writing systems is that short vowels are not represented by the letters of the alphabet. Instead, they are marked by so-called diacritics, short strokes placed either above or below the preceding consonant. Several other pronunciation phenomena are marked by diacritics, such as consonant doubling (phonemic in Arabic), which is indicated by the \shadda&amp;quot; sign, and the \tanween&amp;quot;, i.e. word- nal adverbial markers that add /n/ to the pronunciation of the word. These diacritics are listed in Table 1. Arabic texts are almost never fully diacritized; normally, diacritics are used sparingly and only to prevent misunderstandings. Exceptions are important religious and/or political texts or beginners' texts for  students of Arabic. The lack of diacritics may lead to considerable lexical ambiguity that must be resolved by contextual information, which in turn presupposes knowledge of the language.</Paragraph>
      <Paragraph position="1"> It was observed in (Debili et al., 2002) that a non-diacritized dictionary word form has 2.9 possible diacritized forms on average and that an Arabic text containing 23,000 word forms showed an average ratio of 1:11.6. The forma73 a46 a16a74a187, for instance, has 21 possible diacritizations. The correspondence between graphemes and phonemes is relatively transparent compared to other languages like English or French: apart from certain special graphemes (e.g. laam alif), the relationship is one to one. Finally, it is worth noting that the writing system described above is that of MSA. Arabic dialects are primarily oral varieties in that they do not have generally agreed-upon writing standards.</Paragraph>
      <Paragraph position="2"> Whenever there is the need to write down dialectal speech, speakers will try to approximate the standard system as far as possible and use a phonetic spelling for non-MSA or foreign words.</Paragraph>
      <Paragraph position="3"> The lack of diacritics in standard Arabic texts makes it di cult to use non-diacritized text for training since the location and identity of short vowels and other phonetic segments are unknown. One possible approach is to use acoustic models for long vowels and consonants only, where the acoustic signal portions corresponding to unwritten segments are implicitly incorporated into the acoustic models for consonants (Billa et al, 2002). However, this leads to less discriminative acoustic and language models.</Paragraph>
      <Paragraph position="4"> Previous work (Kirchho et al., 2002; Lamel, 2003) has compared the word error rates of two CH ECA recognizers: one trained on script transcriptions and another trained on romanized transcriptions. It was shown that the loss in information due to training on script forms results in signi cantly worse performance: a relative increase in word error rate of almost 10% was observed.</Paragraph>
      <Paragraph position="5"> It seems clear that diacritized data should be used for training Arabic ASR systems whenever possible. As explained above, however, it is very expensive to obtain manually transcribed data in a diacritized form. Therefore, the corpora that do include detailed transcriptions are fairly small and any dialectal data that might become available in the future will also very likely be of limited size. By contrast, it is much easier to collect publicly available data (e.g. broadcast news data) and to transcribe it in script form.</Paragraph>
      <Paragraph position="6"> In order to be able to take advantage of such resources, we need to restore short vowels and other missing diacritics in the transcription.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Prior Work
</SectionTitle>
      <Paragraph position="0"> Various software companies have developed automatic diacritization products for Arabic.</Paragraph>
      <Paragraph position="1"> However, all of these are targeted towards MSA; to our knowledge, there are no products for dialectal Arabic. In a previous study (Kirchho et al., 2002) one of these products was tested on three di erent texts, two MSA texts and one ECA text. It was found that the diacritization error rate (percentage of missing and wrongly identi ed or inserted diacritics) on MSA ranged between 9% and 28%, depending on whether or not case vowel endings were counted. However, on the ECA text, the diacritization software obtained an error rate of 48%.</Paragraph>
      <Paragraph position="2"> A fully automatic approach to diacritization was presented in (Gal, 2002), where an HMM-based bigram model was used for decoding diacritized sentences from non-diacritized sentences. The technique was applied to the Quran and achieved 14% word error (incorrectly diacritized words).</Paragraph>
      <Paragraph position="3"> A rst attempt at developing an automatic diacritizer for dialectal speech was reported in (Kirchho et al., 2002). The basic approach was to use a small set of parallel script and diacritized data (obtained from the ECA CallHome corpus) and to derive diacritization rules in an example-based way. This entirely knowledge-free approach achieved a 16.6% word error rate. Other studies (El-Imam, 2003) have addressed problems of grapheme-to-phoneme conversion in Arabic, e.g. for the purpose of speech synthesis, but have assumed that a fully diacritized version of the text is already available. Several knowledge sources are available for determining the most appropriate diacritization of a script form: analysis of the morphological structure of the word (including segmentation into stems, pre xes, roots and patterns), consideration of the syntactic context in which the word form appears, and, in the context of speech recognition, the acoustic data that accompanies the transcription. Speci c dictionary information could in principle be added (such as information about proper names), but this knowledge source is ignored for the purpose of this study. All of the approaches described above make use of text-based information only and do not attempt to use acoustic information.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML