File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1605_intro.xml
Size: 7,691 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1605"> <Title>Systematic Verb Stem Generation for Arabic [?]</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Morphological parsers and analysers for Arabic are required to dissect an input word and analyse its components in order to perform even the simplest of language processing tasks. The letters of the majority of Arabic words undergo transformations rendering their roots unrecognisable. Without the root, it is difficult to identify a word's morphosemantic template, which is necessary for pinpointing its meaning, or its morphosyntactic pattern, which is essential for realising properties of the verb, such as its tense, voice, and mode, and its subject's number, gender, and person. It is fundamental that an analyser be able to reverse the transformations a word undergoes in order to match the separated rootandtemplatewiththeuntransformedonesinits database. Unfortunately, defining rules to reverse transformations is not simple.</Paragraph> <Paragraph position="1"> [?] The authors wish to thank the anonymous reviewers of this article as their suggestions have improved it significantly. ResearchinArabicmorphologyhasprimarilyfocused on morphological analysis rather than stem generation.</Paragraph> <Paragraph position="2"> Sliding window algorithms (El-Affendi, 1999) use an approximate string matching approach of input words against lists of roots, morphological patterns, prefixes, and suffixes. Algebraic algorithms (El-Affendi, 1991), on the other hand, assign binary values to morphological patterns and input words, then perform some simple algebraic operations to decompose a word into a stem and affixes. Permutation algorithms (Al-Shalabi and Evens, 1998) usetheinputword'sletterstogenerate all possible trilateral or quadrilateral sequences without violation of the original order of the letters whichisthencomparedwithitemsinadictionaryof roots until a match is found. Linguistic algorithms (Thalouth and Al-Dannan, 1990; Yagi and Harous, 2003) removelettersfromaninputwordthatbelong to prefixes and suffixes and place the remainder of the word into a list. The members of this list are then tested for a match with a dictionary of morphological patterns.</Paragraph> <Paragraph position="3"> The primary drawback of many of these techniquesisthattheyattempttoanalyseusingtheinfor- null mation found in the letters of the input word. When roots form words, root letters are often transformed by replacement, fusion, inversion, or deletion, and their positions are lost between stem and affix letters. Most attempts use various closest match algorithms, which introduce a high level of uncertainty. In this paper, we define Arabic verb stems such that root radicals, morphological patterns, and transformations are formally specified. When stems are definedthisway, inputwordscanbemappedtocorrect stem definitions, ensuring that transformations match root radicals rather than estimate them.</Paragraph> <Paragraph position="4"> Morphological transformation in our definition is largely built around finite state morphology (Beesley, 2001) which assumes that these transformations can be represented in terms of regular relations between regular language forms. Beesley (2001) uses finite state transducers to encode the intersection between roots, morphological patterns, and the transformation rules that account for morphophonemic phenomena such as assimilation, deletion, epenthesis, metathesis, etc.</Paragraph> <Paragraph position="5"> In this paper, a description of the database requiredforstemgenerationispresented, followedby a definition of stem generation. Then the database together with the definition are used to implement a stem generation engine. This is followed by a suggestion for optimising stem generation. Finally, a database of generated stems is compiled in a format usefultovariousapplicationsthattheconclusionalludes to. In the course of this paper, roots are represented in terms of their ordered sequence of three or four radicals in a set notation, i.e., {F,M,L,Q}. When the capitalised Roman characters F, M, L, and Q are used, they represent a radical variable or place holder. They stand for First Radical (F), Medial Radical(M),LastRadicalinatrilateralroot(L),and Last Radical in a quadrilateral root (Q).</Paragraph> <Paragraph position="6"> For readability, all Arabic script used here is followed by an orthographic transliteration between parentheses, usingtheBuckwalterstandard1. Buckwalter's orthographic transliteration provides a oneto-onecharactermappingfromArabictoUS-ASCII null characters. With the exception of a few characters, this transliteration scheme attempts to match the sounds of the Roman letters to the Arabic ones.</Paragraph> <Paragraph position="7"> The following list is a subset of the less obvious transliterations used here: thaliso (@), alifmaksuraiso (Y), a (a), i (i), u (u), (o), and W (~).</Paragraph> <Paragraph position="9"> Arabic stems can be generated if lists of all roots and all morphological patterns are provided. It is necessary that this data be coupled with a database thatlinkstherootswiththeirmorphologicalpatterns (or templates) so that only valid stems are generated for each root. The roots in this database may be moulded with morphosemantic and morphosyntactic patterns to generate intermediate form stems. Thestemsmaythenbetransformedintofinalsurface forms with a number of specific morphophonemic rules using a finite state transducer compiling language. null Figure 1 showsasummaryofthestemgeneration tables and their relations. The RootsList table contains all verb roots from the popular Arabic dictionary, Al-Waseet, (Mustaphaetal., 1972), withF, M, L, and Q representing the table fields for up to four radicals per root. A root identifier is used to link this table to the Template table. The Template table lists all morphosemantic and morphosyntactic patterns used to generate stems from roots of a certain type. This table also specifies the syntactic propertiesofstems(voiceandtense)generatedbyusingthe null template entry. The MainDictionary table links the</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> RootsListandTemplatetablestogetherandspecifies </SectionTitle> <Paragraph position="0"> which entries apply to which roots.</Paragraph> <Paragraph position="1"> Stems generated with these tables are unaffixed stems. Theaffix idfieldlinkseachentrytoasubject pronominalaffixtablethatusestransformationrules generating affixed stems. Although object pronominal affixes are not dealt with in this paper, they are generallyagglutinatinginnatureandthereforecause no morphophonemic alterations to a stem. They can be added for generation or removed for analysis without affecting the stem at all.</Paragraph> <Paragraph position="2"> Affixation and transformation rules are both specified using PERL regular expressions (Friedl, 2002). Regular expressions (Regexp) is an algebraic language that is used for building finite state transducers(FSTs)thatacceptregularlanguages. In thenextsection, Regexpisusedtoperformmorphophonemic transformations and to generate affixed forms of stems. If generated stems are to be useful for root extraction and morphological analysis, it is essential at every stage of generation to be able to track exactly which letters are members of the root radical set, which belong to the template, and what transformations occur on the untransformed stem producing the final surface form.</Paragraph> </Section> </Section> class="xml-element"></Paper>