File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1047_metho.xml
Size: 15,230 bytes
Last Modified: 2025-10-06 14:13:36
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1047"> <Title>LOGIC COMPRESSION OF DICTIONARIES FOR MULTILINGUAL SPELLING CHECKERS</Title> <Section position="4" start_page="0" end_page="292" type="metho"> <SectionTitle> 1. OVERVIEW OF EXISTING METllOI)S </SectionTitle> <Paragraph position="0"> 1.1. (~rammar-hased approach These methods were used in the beginning on early computers when storage space was expensive. It consists in building a small lexicon contaiuing roots and affixes, a grammar of rules that express tile morphographemic alternations, and all engiue that uses tile grammar and the lexi.. con to see if an input word belongs to the lauguage or not. If the process of recognition fails, some operations (substitution, insertion,...) are performed on the misspelled word to provide a list of candidale words that helps the user Io select tile correct form.</Paragraph> <Paragraph position="1"> Even though, it is a great accomplishment to design a powcrful cnginc \[3\] \[8\] and to cxprcss rules in a pseudo natural way \[9\] even for different languages \[1\] \[2\] \[11\], these systems present some limits: - Multilinguism: This methods does not support all languages. To offer a rot, It)lingual solution for n languages you have to store n grammars and n lexicons, and generally n different engines inlo tile host application.</Paragraph> <Paragraph position="2"> - Cosl of retrievak For some languages, the retriewd of words may be long. For instance, a vocalized Arabic spell checker nlust accept non-vocalized or partially vocalized words which require more lime to be accepted than fully vocalized words.</Paragraph> <Paragraph position="3"> - Cost of guessing alternalives for a misspelled word: To guess a correct word when a misspelled word is found, we have to modify the misspelled word by all possible ope,'ations (substitution, insertion, suppression,...) for 1 or 2 characters and then try to check them. This matter can take a lot of time before displaying the correct forms lot endusers. null - Maintaining file grammars and data: The grammars and lexicon require conti,nlous updating. You need to fiud a muir)lingual computational linguist who knows the linguistic theory and tile ft)rmalism to easily update data and rules \[811.</Paragraph> <Paragraph position="4"> - Ergonomic fcalures: In some languages, end users want to have some options that let Ihem choose how tile spell checker will accept words. In Arabic, for example, different regions have slightly different orthographical conventions. 1.2. Lexicai-fiased appro.'tch: Lexical-based approach appear after the first methods described above, when storage space become less expensive. The first step is to build complete list of surface forms belonging to the language using morphological generators, SI,LP (Specialized l+anguagcs for Linguistic Progr,'uns), etc. and then compresses the large word-dictionary. They are generally used for office apl)lications such as word processors, desktop presentation, etc. Their main advantage is that they cover a complete language since all the forms can be fouud in the initial lisl. Also, they allow efficient rctricval and guessing of misspelled words \[4\]. Ilowever, some limits exist in such systems: - Multilinguism: The compression process give a good ratio for languages with awcak inflexion factor (English,...) where the compression nteehanism give up to 150 KB of storage fi'om around 3 MB of a fifll list \[4\]. The compression technologies arc still powcrfifl for languages with a medium iuflexion factor (Russian,...). For example, a list of all surface Russian words of between 10 and 15 MB of size can be reduced to 700 KB \[41. For hmguagcs with a high inflexion factor (Arabic, I'innish, llungarian,..+), it won't be easy to find compression technologies that give practical results \[4\]. For instance, a full list of completed vocalized woMs in Arabic h:m 300 MB in size anti the current compression mefll(xls are itnpraclical. - No morphological knowledge : These methods arc neutral with respect to the text language, the efficiency of compression techniques +nay be improved by using specific properties of the language \[41.</Paragraph> <Paragraph position="5"> I1. A FIRST APPROACII: ADAPTING AN EXISTING METItOD FOR ARABIC ILl. Using an existing method As a first step, we take an eflicieut method used to compress dictionaries for F+uropean (l:nglish, l:rench,...) spelling checkers 11411 and try tit apply it to Arabic. The first step of our work cousists in building a full list of surface fin+ms usiug a morphological generator 151 anti completed by all irregular fonns and existing corpus. The final large word dictionary which covers uou-vocalizcd Arabic has a size of 75 MB. The comprcssiou process yickls 18 MB iu a or)repressed fi)rmat. I:or .'m idea of the compression process readers can refer to \[10\]. Table 1 gives some results of the compression process for a few Europeau languagcs to see the efficiency of thc method aud its itm/lequacy for the Arabic language.</Paragraph> <Paragraph position="6"> The result fl)r Arabic is impractical for small computers. We must titan find other techniques that produce a smaller dictionary or extend this method; to get an exploit'dale solution. null 11.2. Extension of the method: The initial idea is applied to the morphological system of Aral)ic. While most of the fully inflectc/l forms words in Arabic mc built by adding to a stem prefixes and suffixes wc l)roposc replacing some words with only one form beginning by a special code that represents it family of prefixes and finishing by another special code which represents a family of suffixes. For this purpesc, wc wrote a program in MPW-C that processes a full list of inflected /brms and (+sing an existing decomposition of affixes into sub-sets already established, give the reduced lcxicou where many for,ns are replaced by only otto representation (PSi stem SSj) where PSi (with rcspect to SSj) is the set i (with respect to j) of prefixes (with respect to suffixes). Note that the reduced lexicon reprcscnts faithfully the iuitial list without any silence (missing words) or noise (incorrect words). Only compressed words are replaced, and the rest remain in the reduced list. The figure 1 gives an example of words, an example it1&quot; a decompositions luld the obtained result.</Paragraph> <Paragraph position="7"> The next crucial problem to resolve is lit find the best dccomposilion that provide the best retluced lcxicon. The me+hill must t~ automatic, It must process the large worddictionary, and rcgar(ling an initial list of prcl\]xes and sill L fixes, must give as oulput the best dccompositiou and the optimal reduced dictionary. But, hclk)rc studying the implementation of such an algorithm, we began, tit see how much space we coukl gain by this teChlfique starting from a lt)anual decomposition.</Paragraph> <Paragraph position="8"> ~ ~tldRclh~d;_ Starting front a different fifll lists for each category of words (transitive verbs, nouns,...), we choose different decompositions and processed the full list with the coml)rcssion tool. The best decomposition kept \[or each category was lhc decomposition which eliminated the maxiluum forms. This method gave mauy candidate decompositious depending ou Ihc grammalio'tl calcgory of ihc word. To choose Ihe best global one we took into account the fi'equency of dictionary etlIries. This method was tested tit+ differeut Arabic word lists and some results :Ire described here, Re:tders cat+ refer to 1101 or fill for luore itfformation. To see some dccolnpositiou, consider the following sets: lil:\[wa, fa\], I~l l;+2={la, sa }, /,3~/ l;, 3 = {ha, at}, 1,3i/ .......</Paragraph> <Paragraph position="9"> F 1 ~ {tom, ttnuna, ta, Uma}, / ~3~.../ F 2 : {ya, ;din, yimt, +ulna}, /,31,31 ,3~. / ......</Paragraph> <Paragraph position="10"> I; 6 = {ha, haft, ya, ka, kern, kt)uma, kent, l).om, houma, bona, haft}, F 7 -~ F6\ {ya, i+laS} 4. {hi}, 1:9 ~=(wa} ......</Paragraph> <Paragraph position="11"> l:. i (with respect to Fj) is a set of prefixes (with respect to suffixes). We uotc the quantity I(i.E j (wilh respect to FI.Fi) all strings built by a collcalcnation of each clcmcut of l~+i (with rcsl~ecl to Fit with each clement of l(j (with rcspect ready found for Arabic (6 classes of prefixes, 13 cl,'tsscs of suffixes; each class containing an average of 8 affixes), we processed a collection of non-vocalized Arabic dictionaries (17 MB), the restllt gave a reduction lexicon of 254 KB. Used this in combination with the compression process described in SS 1.2, tile final result is 121 KB. Note also that part of this work was implemented in a commercial multilingual word processor (WinText (c)) to offer Arabic spell checking.</Paragraph> <Paragraph position="12"> II1 LOGIC COMPRESSION: III.1. Theoretical aspects: Let V be a finite set and V* the set of words built on V including null strings noted ~.</Paragraph> <Paragraph position="13"> W E V*. W = WiW>..Wn. W i e V.</Paragraph> <Paragraph position="14"> i c \[1..n\]. Let V + = V* - {~l}.</Paragraph> <Paragraph position="15"> Let Y be a sub-set of V that contain vowels.</Paragraph> <Paragraph position="16"> 1. Prefix(W). V W c V +.</Paragraph> <Paragraph position="17"> We call order i prefix the quantity:</Paragraph> <Paragraph position="19"> We call vocalic pattern of W the set:</Paragraph> <Paragraph position="21"> We call root the quantity:</Paragraph> <Paragraph position="23"> 5. Pi: Prefixes class. Pi = {~, F'il,Pi>...l:Ji',:} * Pij is a prefix. 1 _< j _< k Card(Pi) =k+ 1. if k>__ 1.</Paragraph> <Paragraph position="24"> = 1. ifl'i = {0}, 6. Sj: Suffixes class. Sj = {~, Sjl, Sj2,...Si~:}. Sji is a suffix. 1 _< i _< k Card(Sj) =k+l. if k_> 1.</Paragraph> <Paragraph position="25"> = 1. ifSj = {tZi}.</Paragraph> <Paragraph position="26"> 7. Vl: Vowel class.</Paragraph> <Paragraph position="28"> Let's take the following automala that represent some surface w)calized words (fig 2) Pij is a prefix. 1 _<_ j <_ n.</Paragraph> <Paragraph position="29"> Sji is a suffix. 1 _< i _< n.</Paragraph> <Paragraph position="30"> C i are tile consonalltS of the vocabtilary. 1 _<i_<k.</Paragraph> <Paragraph position="31"> Vij iS the vowel attached to the consonaut Cj. l ~<i_<qand l_<j_<_k.</Paragraph> <Paragraph position="32"> ?J is the null string.</Paragraph> <Paragraph position="33"> This automata recognizes all words beginning from an initial state (marked by *) and finishing in a final state (marked by a double circle) The utunher of arcs of such an aulofuala is:</Paragraph> </Section> <Section position="5" start_page="292" end_page="292" type="metho"> <SectionTitle> 11 II </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> If we consider, for example, that affixes have a single chm'acter, the nmnber of a,cs is equal to 2(n+1) + 2q(k-1).</Paragraph> <Paragraph position="3"> The logic compression consist in supplying the class of prefixes, suffixes and vowels and replaces each set by only one arc that represent a family of prefixes, suffixes or vowels.</Paragraph> <Paragraph position="4"> Starting from the following sets already eslablished:</Paragraph> <Paragraph position="6"> v(K'alic pallern slorcd as z.</Paragraph> <Paragraph position="7"> The logic compression reduces the initial automalOU to this new one: The number of arcs kept in the automata is equal to 3 + k. The SOl Vt: contains a sub-scl of k vowels which must be applied to the last k characlers.</Paragraph> <Paragraph position="8"> Ill.3. Experiments: The logic compression with only an affix decomposition, built by the manual meflmd cxplaiued above, has been tested on various list of words that represent colleclions of multilingual diclionaries (a list of inflected forms). Three languages are tesmd: non-vocalized Arabic which has a great inllexion lactor, French which has a weak inflexion factor, Russian which has a medium inflextoll factor, l;.xtmrimenls arc dolie in two ways. First by using our logic compression alone anti, thel|, in conji||tction with other methods by supplying the reduced lexicon (lisl of compressed words in text format) obtained with our method as input to existing methods. The three other methods tested a,e Ihe following: o Physical compression: Using a commercial physical process (Stuffit).</Paragraph> <Paragraph position="9"> - Morpho-physical coinprcssion: This method was used to compress dictionaries used to buiM a spell checker 1411. It combines morphological proprieties by taking inlo account the suffixes of the language, but wilhout any link between Ihem. It also contains sonie physical features 171. * FSM (Finite-State Machine) Compression: Using file Lexc (Finite State Lexicon Compiler) which allows the conversion of a list of surface forms inlo a transducer which is then minimized \[81.</Paragraph> <Paragraph position="10"> Resttlls are described in table 2.</Paragraph> <Paragraph position="11"> 111.4. Interpretations: The nlost interesting thing observed on this table is the improvement obtained when we combine our method with a previotls one. These resulls show that the existing methods are not optimal and can be improved by our logical compression in its first step. These important results in storage space shouhl not hide others aspects of Slmll checker systems (retrieval and guessing). It would be interesting if the results given in the table were followed by oilier results showing impmvenmnts in the |etrieval and guessing of words.</Paragraph> <Paragraph position="12"> IV. A PROPOSEI) ARCIIITECTURI,; OF A UNIVERSAL SI'ELLING CHECKI,'R: Figure 3 shows the architecture of our proposed universal spelling checker. Our method is inspired from previous methods (SS 1.2), but presents some new original aspects that allow it to be considered a truly multilingual solution. In summary, our system has the following l'ea+ ttlles: * Multilinguism: lhis mclhod will insure the multi-lingual constraint By using different tools, specific to each langt|age, to create a list of all surface lk),'ms. * Storage space: by introducing the logic compression into the compression process, we will be able to get a reduced lexicon for whalever langu'lgc we have to use. One task that still remains is to improve the logic comp,'ession by making the lask of finding the best decomposilion more automatic. This problem is coii|bi|latorial; we lllllSl discover how to apply the optimization algorithms (genetic algorithll|, stochastic algorithm,...) in each case Io find an optimal reduced lexicon starling from Ihe large word-dictiolmry and primilive morphological km)wledge (list of affixes and w}wets).</Paragraph> <Paragraph position="13"> * Retrieval/guessing: even lllollgh we havell'l any conc|'ele ,-esults now, the firsl experinlenls show Ihat the process of checking words in an I;SM formalisln is faster \[halt other exisling methods, l'urlhermore, we are exploring paths Io introduce functions (similarily key,...) into the final obtained lexicon to make a rapkl guessing of replacements for misslxflled words.</Paragraph> </Section> class="xml-element"></Paper>