File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1004_metho.xml
Size: 14,604 bytes
Last Modified: 2025-10-06 14:15:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1004"> <Title>Finite State Automata and Arabic Writing</Title> <Section position="3" start_page="0" end_page="28" type="metho"> <SectionTitle> 1 CONTEXTUAL ANALYSIS </SectionTitle> <Paragraph position="0"> Whatever be the choice made for coding, from a typesetting or a computational point of view, there must be different codes for the different shapes of a letter. So every arabicized software has to use two systems for coding : the reduced code we have just introduced and the extended code in which the different shapes have different using Klaus Lagally's ArabTEX codes. Up to UNICODE, no normalization exists for the second one. So every arabicized software has to solve the problem of choosing the right shape of every printed or displayed letter.</Paragraph> <Section position="1" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 1.1 Rules for letter shape </SectionTitle> <Paragraph position="0"> determination This determination, frequently known as contextual analysis can be summarized into the following set of unformal rules: 1. At the beginning of a word: * If the letter is a binding letter it takes the INITIAL shape.</Paragraph> <Paragraph position="1"> * If it is a non binding one it takes the ISOLATED shape.</Paragraph> <Paragraph position="2"> 2. In the middle of a word (there is at least one letter following the current one): (a) If the letter is a binding letter then * If it follows a binding letter it takes the MEDIAL shape.</Paragraph> <Paragraph position="3"> * If it follows a non binding letter it takes the INITIAL shape.</Paragraph> <Paragraph position="4"> (b) If the letter is a non binding letter * If it follows a binding letter it takes the FINAL shape.</Paragraph> <Paragraph position="5"> * If it follows a non binding letter it takes the ISOLATED shape.</Paragraph> <Paragraph position="6"> 3. At the end of a word (for both types of letters) * If it follows a binding letter it takes the FINAL shape.</Paragraph> <Paragraph position="7"> * If it follows a non binding letter it takes the ISOLATED shape.</Paragraph> </Section> <Section position="2" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 1.2 Moore and Mealy automata </SectionTitle> <Paragraph position="0"> Moore automata are state assigned output machines : the output function assigns output symbols to each state. They differ from Mealy automata, transition assigned finite state machines, where output symbols are associated with transitions between states. Mealy automata are sometimes called finite transducers.</Paragraph> <Paragraph position="1"> The two machine types have been demonstrated to produce the same input-output mappings 3.</Paragraph> <Paragraph position="2"> 3see (Aho and Unman, 1972) and (Hopcroft and Ullman, 1979) for a full account of these matters Mealy automata are certainly a better choice when bidirectional applications are considered.</Paragraph> <Paragraph position="3"> As the question is to identify succession of symbols of a certain type we found it clearer to use a Moore automaton.</Paragraph> </Section> <Section position="3" start_page="26" end_page="28" type="sub_section"> <SectionTitle> 1.3 A Moore automaton for contextual analysis </SectionTitle> <Paragraph position="0"> 1.3.1 Source language of the automaton It follows from the determination rules that we only need to know what particular letter we are dealing with only at the output stage. All we have to know is wether it is a binding or a non binding letter 4. The alphabet of the automaton should be A = (#} \[J L where L is the set of arabic letters present in the reduced code and # the word boundaries. The set of letters will then be partitioned into three sets * A--+ A'- {{#},N,B} N being the set of non binding letters and B the set of binding letters. If we denote respectively n and b an arbitrary element of each of these sets, the source language of the automaton can be reduced to:</Paragraph> <Paragraph position="2"> where V denotes disjunction and * is the Kleene or the as simple automaton : initial states = {1} final states: -- {5} transitions 4As this question has only been taken as an example, the alphabet has been oversimplified. A full working automaton should cope, as far as arabic is concerned, with two additional problems : hamza on the line to which no preceding letter can be bound to and l~m alif ligature. It should also give a proper treatment of non arabic letters and symbols. But this would not affect the here described method.</Paragraph> <Paragraph position="3"> The alphabet for the target language L2, given what has been said before and using the same method of partioning and then reducing the alphabet could be at first sight:</Paragraph> <Paragraph position="5"> where I denotes a letter in isolated shape, i, m and f stand for initial, medial and final shape.</Paragraph> <Paragraph position="6"> But letters from N have only two shapes final and isolated. Moreover isolated and final shapes of letters from B can only appear at the end of a word, which is not the case for the corresponding shapes of letters from N. So, the following modified version of A2 will be prefered :</Paragraph> <Paragraph position="8"> where In stands for isolated shape of a letter from N and so on. With these symbols the target language L2 can be described by the regular</Paragraph> <Paragraph position="10"> where E denotes as usual the empty string.</Paragraph> <Paragraph position="11"> The translation process of a sequence of LI into a legal sequence of L2 can be operated through the following automaton : initial states = {1} final states = {8}</Paragraph> <Paragraph position="13"> This automaton is clearly nondeterministic.</Paragraph> <Paragraph position="14"> This is due to the fact that a letter from B can appear in final or isolated shape when situated at the end of a word, in initial or medial shape when another letter follows it. Because of this nondeterministic feature, every transition should appear as a set. When this set is a singleton, the &quot;only&quot; state has been put without braces for an easier reading.</Paragraph> <Paragraph position="15"> It can be easily augmented to take account of occasional short vowels or shadda 5 (') that could occur : the transitions to add would force the automaton to loop onto the same state, whatever be it since vowels or shadda can only appear after a consonant and do not influence its shape.</Paragraph> <Paragraph position="16"> This program is a straightforward translation of the above described grammar and automaton.</Paragraph> <Paragraph position="17"> The predicate test allows to limit the generation of inputs to a given length. In the results we chose to limit the length of the input to 6 included word boundaries.</Paragraph> <Paragraph position="19"/> </Section> </Section> <Section position="4" start_page="28" end_page="1800000" type="metho"> <SectionTitle> 2 WRITING OF LETTER HAMZA </SectionTitle> <Paragraph position="0"> The hamza can be written in five different man- $ ners (I, !, 3, ~, ') depending mainly upon: * its position within the word * the preceding and the following vowel As the choice made for coding, was to adhere to a linguistic point of view, there should have been only one code for all these shapes and carrying consonants. But, as it has just been said, to determine the correct writing of hamza, one has to know the surrounding vowels, and it is of common knowledge that the Arabs do not usually write short vowels. These essential data being missing, no algorithm can take place to fulfil this task for a common usage such as display a text on a screen. Thus, the ASMO decided to have distinct codes for the different carriers of hamza, but not of course for their different shapes which can be determined as seen before.</Paragraph> <Paragraph position="1"> So why is this question of any interest ? If we consider NLP applications for Arabic, it could worth considering this problem at generation stage. For instance many vowel alternations occur in the conjugation of verbs, and when a hamza is present in the verb root, the hamza writing will vary accordingly.</Paragraph> <Paragraph position="2"> For example the verb I~ qara'a - he (has) read-changes to 5.~.&quot; yaqra'fna-they read o (present) - and to ~.z~ quri'a - it (has) been read. And at the generation stage vowels are known even if we decided not to write them.</Paragraph> <Paragraph position="3"> The only alternative would be to put all the forms in a dictionary. At CERTAL, our philosophy is to use all the possible means to reduce the size of dictionaries. Hence this question appeared to us worth studying.</Paragraph> <Section position="1" start_page="28" end_page="30" type="sub_section"> <SectionTitle> 2.1 Rules of hamza writing </SectionTitle> <Paragraph position="0"> &quot;i&quot; (.) as in l,)~ 'iv.l~m - information 2. When a hamza is within a word (i.e. preceded and followed by some consonant) it is written * over an alif (i) when - preceded by a sukfin (0) and followed by an &quot;a&quot; as in JL~&quot; yas'alu - he asks- preceded by an &quot;a&quot; and followed by a sukfin as in ~.&quot; ya'kulu - he eats - preceded by an &quot;a&quot; and followed by an &quot;a&quot; as in C/Jk~ sa'ala - he (has) asked * over a waw (~) when - preceded by a sukfin and followed by an &quot;u&quot; as in ~.'&quot; yab 'usu - he is strong, brave -preceded by an &quot;a&quot; and followed by an &quot;u&quot; or an &quot;fi&quot; as in &quot; ~'&quot; ya'~bu - to return or to suffer preceded null a sukfin prefers by a &quot;u&quot; and followed by as in .~ yu'thiru - he - preceded by an &quot;u&quot; and followed by an &quot;a&quot; as in .~ yu'aththiru - he influences -- preceded by an &quot;u&quot; and followed by an &quot;u&quot; or an &quot;fi&quot; as in ~r_~Y. bu '~sun - distresses -- precede by an &quot;fi&quot; and followed by an &quot;u&quot; * over a ya (G) when - preceded by an &quot;i&quot; whatever be the following vowel as in ~. bi'run well - .~ bi'drun plural of the same word - followed by an &quot;i&quot; whatever be the preceding vowel as in ~3~ qd'idun - leader, director, commandant,...</Paragraph> <Paragraph position="1"> * without any carrying letter when - preceded by an &quot;&&quot; and followed by an &quot;a&quot; as ~1~5 bad~'atun - beginning -- preceded by an &quot;fi&quot; and followed by an &quot;a&quot; as in O: li~&quot; yasa'dni - they (both) become bad 3. When a hamza is at the end of a word it is written * without any carrier when - the preceding vowel is a sukfin 6 as in *At2&quot;. juz'un - a part -- the preceding vowel is an &quot;g&quot; as in ~l~fi..l ajza'un, plural of the same word - the preceding vowel is an &quot;fi&quot; as in ;y:~ yasa'u - it becomes bad -- the preceding vowel is an &quot;i&quot; as in * o *0~&quot;~- yajf'u - he arrives * over alif when the preceding vowel is an &quot;a&quot; and the following is one of &quot;a', &quot;an&quot;, &quot;u&quot;, &quot;un&quot; as in i&quot;a~ mubtada 'un</Paragraph> <Paragraph position="3"> * over waw when the preceding vowel is &quot;u&quot; as in ~. jaru'a- he (has) risked - ~.&quot; yajru'u - he riskes* over ya when the precedin.gvowel is &quot;i&quot; as in ~I:,~ khati'un ~tldl al-khati'a ~.~,t~d~ al-khati'i - wrong null A full account of the rules governing hamza writing have just been given. Usual presentations of hamza writing add to these rules, the rules of madda (~) writing. Madda is a contraction used for a hamza followed by an ~ or a hamza followed itself by a sukfin. This happens in some derivations or conjugations, thus we considerer it as pertaining to the whole set of transformations which occur in those cases. ~'q 5kulu +-- ~q~ 'a'kulu -I eat ~-l dkhad_a +-- ~.~ 'aakhad_a - he blamed -Besides, except for elementary schools and Coranic Recitation, noboby cares about ending short vowels. So, if the last vowel is not long it is treated as it were a sukfin, i.e. no vowel. This is always true of modern arabic and this reduces the number of rules involved at the end of a word.</Paragraph> </Section> <Section position="2" start_page="30" end_page="1800000" type="sub_section"> <SectionTitle> 2.2 A Moore automaton for hamza writing </SectionTitle> <Paragraph position="0"> With the aforementioned restrictions these rules can also be implemented as a Moore automaton.</Paragraph> <Paragraph position="1"> It follows from the determination rules that we have to know * if the consonant to be processed is a hamza (whatever its carrier has to be) or not, * wether a vowel is present before or after the hamza, * and if so, what are the surrounding vowels (short or long).</Paragraph> <Paragraph position="2"> Again the presence of a shadda is non pertinent and can be treated as mentioned for the contextual analysis. The alphabet for the source language L3 can be, using the same method as</Paragraph> <Paragraph position="4"> where hz is a hamza with any carrier, 1 any consonant other than hamza and su stands for sukfin. The only other constraints for this lan- null guage are : i. a sukfin cannot * neither follow the first consonant * nor follow a consonant already preceded by a sukfin 2. a hamza cannot follow another hamza 7 The regular expression corresponding to L3 would be too complicated to be really clarifying so we shall go directly to the definition of a generating automaton for this language. initial states = {1} final states: = {21} transitions Because of the narrowness of this style columns, the transition tables have been devided in two parts. The last column of the second table gives the output corresponding to every state.</Paragraph> <Paragraph position="5"> hz 1 a u i hz l a u i The only differences with the source language lie in the distinct carriers for the letter hamza: A4 -- {#,l, hwc, hoa, hua, how, hog, su, a, u, i, ~, ~, 7, } where hwc stands for hamza without a carrier, hoa for hamza on alif, hua for hamza under alif, how for hamza on waw and hog for hamza on ya.</Paragraph> <Paragraph position="6"> A PROLOG program similar to the one used for contextual analysis gives the following results:</Paragraph> </Section> </Section> class="xml-element"></Paper>