File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1017_metho.xml
Size: 9,835 bytes
Last Modified: 2025-10-06 14:14:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1017"> <Title>Arabic Finite-State Morphological Analysis and Generation</Title> <Section position="4" start_page="89" end_page="90" type="metho"> <SectionTitle> 3 History </SectionTitle> <Paragraph position="0"> In 1989 and 1990, with colleagues at ALPNET (Beesley, 1989; Beesley, Buckwalter and Newtoil, 1989; Buckwalter, 1990; Beesley, 1990), I built a large two-level morphological analyzer for Arabic using a slightly enhanced implementation of KIMMO-style two-level morphology (Koskenniemi, 1983, 1984; Gajek, 1983; Karttunen, 1983). Traditional two-level morphology (see Figure 2), as in the publicly available PC-KIMMO implementation (Antworth, 1990), allows only concatenation of morphemes in the morphotacties. Lexicons are stored and manipulated at runtime as a forest of letter trees, with each trec typically containing a single class of morphemes, with the leaves connected to subsequent morpheme trees via a system of &quot;continuation classes&quot;. A letter path through the lexieal trees from a legal starting state to a final leaf defines an abstract or &quot;lexical&quot; string. The various two-level rules, which had to be hand-compiled into finite-state transducers, were run in parallel by code that simulated their intersection. The rules allowed and controlled the variations between the lexical strings and the surface strings being analyzed: thus the Arabic surface word wdrsl (~5,~ja ~) could be matched with the lexical string wa+daras+al, among others, via appropriate rules.</Paragraph> <Paragraph position="1"> In the ALPNET Arabic system, roots and pat= terns were stored in separate trees in the lexical forest, and an algorithm, called Detouring, performed the interdigitation of semitic roots and patterns into stems at runtime. The other chal- null lenges of Arabic morphological w~riation and orthography, including varying amounts of diacritical marking, all succmnbed to rather complex but conq)letely traditional two-level rules. Whih&quot; the resulting system was successfidly sold and is also currently being used as the morphological engine of an Arabic project at the University of Maryland, it suffers from many well-known limitations of traditional two-level morphology.</Paragraph> <Paragraph position="2"> 1. As there was no automatic rulc compiler available to us, the rules had to bc compiled into tinite-state transducers t)y hand, a tedious task that often influences the linguist to simplify the rules by postulating a rather surfacy lexical level. Hand-compilation of a complex rule, which can easily take hours, is a real disincentive to change and experimentation. null 2. Because there was no algorithm to intersect the rule transduccrs, over 100 of them in the ALPNET system, thcy are stored separately and must each be consulted separately at each step of the analysis. As the time necessary to move a rule transduccr to a new state is usually independent of its size, moving 100 transducers at runtimc cat, be 100 times slower than moving a single intersected transducer.</Paragraph> <Paragraph position="3"> 3. Because the lexical letter trccs in a traditional Kimmo-style system are dccoratcd with glosses, features and other miscellaneous information on the leaves, they are not purc finite-state machines, cannot bc combined into a single fsm, cannot be composed with the rules, and have to be storcd and run as separate data structures.</Paragraph> <Paragraph position="4"> 4. Various diacritical fcatures inscrted into the lexical strings to insurc proper analyses made this and other KIMMO-stylc systems awkward or in,practical for generation.</Paragraph> <Paragraph position="5"> 5. Finally, in the enhanced ALPNI,;T implementation, the storage of almost 5000 roots and hundreds of patterns it, separate sul)lcxicons saved memory space, but the l)etouring operation that interdigitatcd them in rcaltime was inherently inelficient, building and then throwing away many superficially plausible sterns that were not sahctioned by the lexicon codings. (Any Arabic root (:at, combine legally with only a small subset of the possible patterns.) With building phantom stems and the unavoidable backtracking caused by the overall deficiency and ambiguity of written Arabic words, the resulting system was rather slow, analyzing about 2 words per second on a small IBM mainframe.</Paragraph> </Section> <Section position="5" start_page="90" end_page="92" type="metho"> <SectionTitle> 4 Reimplementation </SectionTitle> <Paragraph position="0"> Work began in 1995 to convert the analysis to thc Xerox fst format. The ALPNET lexicons were first converted into the format of lexc, the lexicon c()mpiler (Karttnnen and Beesley, 1992). Althongil lexc by itself is largely limited to concatcnative morphotactics, just like traditional two-level morphology, it was noted that the interdigitation of semitic roots and patterns is nothing more or less than their intersection, an operation supported in the Xerox finite-state calculus. Thus if ? represents any letter, and C represents any radical (consonant), the root drs (tY' -) ~) can be interpreted as ?*d?*r?*s?*.</Paragraph> <Paragraph position="1"> The intersection of this root with the pattern CaCaC yields the stem daras (ty,55). See Figure In s()mc analyses (e.g. McCarthy, 1981), the voweling of the pattern is also abstracted out, leaving pattern templatcs like CVCVC and a vocalic element that cat, bc formalized as ?*a?*a?*. If V represents a vowel, then the intersection of the root, ten,plate and vocalic elements yields the same result. See Figure 4.</Paragraph> <Paragraph position="2"> Using standard Ol)crations availablc through the lexc compiler and other finite-state tools, the analysis can be constructed according to the taste and necds of the linguists.</Paragraph> <Paragraph position="3"> Because the upper-side string is returned as the result of an anMysis, it is often more helpful to define the upper-side string as a baseform (here a root) folh,wed by a set of symbol tags designed to represent relevant morphosyntactic features of the attalysis. For examph', daras (O,)~) happens to be the Form 1 perfect active stem based on the root drs (tY) a, with CVCVC being the Form passive voweling ui is the parallel passive example. If +FormI, +Perfect, +Active and +Passive are defined as single symbols, and if +FormI+Perfect maps to CVCVC, and if +Active maps to aa and +Passive to ui, the analyses can be constructed as in Figures 5 and 6.</Paragraph> <Paragraph position="4"> After composition of the relevant transducers, the intermediate levels disappear, resulting in a direct mapping between the upper and lower levels shown. The resulting single transducer is called the lexicon transducer.</Paragraph> <Paragraph position="5"> All valid stems, currently about 85,000 of them, are automatically intersected, at compile time, at one level of the analysis. Suitable prefixes and suffixes are also present in the lexicon transducer, added in the normal concatenative ways.</Paragraph> <Paragraph position="6"> Stems like davas (t.r,33) and duris (tg4~), and especially those like banay (~.') based on &quot;weak&quot; roots, are still quite abstract and idealized compared to their ultimate surface realizations.</Paragraph> <Paragraph position="7"> Finite-state rules rules map the idealized strings into surface strings, handling all kinds of epentheses, deletions and assimilations. The twolc rule compiler (Karttunen and Beesley, 1992) is able not only to recompile the rules automatically but to intersect them into a single rule fst. This rule fst is then composed on the bottom of the lexi- null String to Unvoweled Surface String c.,~_)~ con fst, yielding a single Lexical Transducer. The symbol .o. in Figure 7 indicates composition.</Paragraph> <Paragraph position="8"> Another transducer is also composed on top of the lexicon fst to map various rule-triggering features, no longer needed, into epsilon and to enforce various long-distance morphotactic restrictions.</Paragraph> <Paragraph position="9"> All intermediate levels disappear in the compositions, and one is left with a single two-level lexical transducer that contains surface strings on tim bottom and lexical strings, including roots and tags, on the top. A typical transduction is shown in Figure 8, where the final t (~) is the surface realization of the third-person feminine singular suffix -at. Fully voweled, the surface string for this reading would be darasat ( -,~a33 ). Because short vowels are seldom written in surface words, dvst is also analyzed as the Form I perfect passive third-person singular, which would be fully roweled as dnrisat ( &quot;,~ ~.~), and as several other forms.</Paragraph> <Paragraph position="10"> At runtime, strings being analyzed are simply matched along paths on the bottom side of the lexical transducer, and the solution strings are read off of the matching top side. Like all finite-state transducers, it also generates as easily as it analyzes, literally by running the transducer &quot;back- null wards ~ .</Paragraph> <Paragraph position="11"> The Arabic system runs in exactly the same way, using the same runtime code, a~ the lcxical transducers for other languages like English, French and Spanish. The Arabic system is, however, substantially slower than the. other languages, t)ecause the ambiguity of the surface words forces many dead-end analysis paths to be explored and because more valid solutions have to be found and returned. The mismatch between the concatenated root and pattern on the lexical side and the intersected stem on the lower side also creates an Arabic system that is substantially larger than the other languages.</Paragraph> </Section> class="xml-element"></Paper>