File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1605_metho.xml

Size: 18,162 bytes

Last Modified: 2025-10-06 14:09:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1605">
  <Title>Systematic Verb Stem Generation for Arabic [?]</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Definition of Stem Generation
</SectionTitle>
    <Paragraph position="0"> Inordertobeusefulinanalysisapplications, Arabic stems need to be in a surface form which will only undergo agglutinating changes for any further morphologicalmodification. Stemsshouldbedefinedin  termsoftherootradicals,morphosemanticandmorphosyntactic template letters, and morphophonemic alterations. Bydoingso, inversingstemtransformationsbecomestrivial. Werequiretheautomaticstem generatortoalwaysbeawareoftheoriginofeachof  thelettersinstemsitgeneratesandtobeabletodistinguish between letters in the original radical set or in the template string. The stem generator may then beusedtocompileacompletelistofallaffixedstems from database roots while retaining all transformation information. The resulting list of stems may then be turned into a searchable index that holds the complete morphological analysis and classification for each entry.</Paragraph>
    <Paragraph position="1"> Since originally Arabic words can have a maximum of four root radicals, a root radical set R is defined in terms of the ordered letters of the root as follows:</Paragraph>
    <Paragraph position="3"> In the database, pattern, root, variant, and voicetense ids identify a particular morphological pattern s. Templatesareusedtogenerateastemfromaroot.</Paragraph>
    <Paragraph position="4"> Thetextof s isdefinedintermsofthelettersanddiacriticsofthetemplateinsequence (x1...xl) andthe radical position markers or place holders (hF, hM, hL, and hQ), that indicate the positions that letters of the root should be slotted into:</Paragraph>
    <Paragraph position="6"> Stem Generator (SG) uses regular expressions as the language for compiling FSTs for morphophonemictransformations. Transformationrulestakeinto account the context of root radicals in terms of their positions in the template and the nature of the template letters that surround them. Transformations  areperformedusingcombinationsofregularexpressionrulesappliedinsequence,inamannersimilarto null how humans are subconsciously trained to process the individual transformations. The resulting template between one morphophonemic transformation and the next is an intermediate template. However, in order to aid the next transformation, the transformed radicals are marked by inserting their place holdersbeforethem. Forexample, hF rehisoa hM seenisoa hL meemisoa (FraMsaLma) is an intermediate template formed by the root radical set R ={rehiso, seeniso, meemiso} ({r, s, m}) and the morphological pattern s = hFa hMa hLa (FaMaLa).</Paragraph>
    <Paragraph position="7"> Tocreatetheinitialintermediatetemplatei0 from the radical set R and morphological pattern s, a function Regexp(String,SrchPat,ReplStr) is defined to compile FSTs from regular expressions.</Paragraph>
    <Paragraph position="8"> The function accepts in its first argument a string that is tested for a match with the search pattern (SrchPat) in its second argument. If SrchPat is found, the matching characters in String are replaced with the replace string (ReplStr). This function is assumed to accept the standard PERL regular expression syntax.</Paragraph>
    <Paragraph position="9"> A function, CompileIntermediate(R,s), accepts the radical set R and morphological pattern s to compile the first intermediate template i0. A regular expression is built to make this transformation. It searches the morphological pattern text for radical place holders and inserts their respective radical values after them. Since Regexp performs substitutions instead of insertions, replacing each marker with itself followed by its radical value is effectively equivalent to inserting its radical value after it. Let p be a search pattern that matches all occurrences of place holders hF, hM, hL, or hQ in the morphological pattern, then an initial intermediate form i0 may be compiled in the following manner:</Paragraph>
    <Paragraph position="11"> plied on each intermediate template to create subsequent intermediate templates. Transformation rules are defined as:</Paragraph>
    <Paragraph position="13"> A second function Transform(i,t) is required to perform transformations. A subsequent intermedi-</Paragraph>
    <Paragraph position="15"> At any point in the transformation process, the current transformed state of radicals (R') and templatestring(s')maybedecomposedfromthecurrent null intermediate template as follows:</Paragraph>
    <Paragraph position="17"> To turn final intermediate template im into a proper stem, a regular expression is built that deletes the place holders from the intermediate template. To do this with a regular expression, the place holders matched are replaced with the null string during the matching process as follows:</Paragraph>
    <Paragraph position="19"> Additional morphosyntactic templates or affixation rules further modify proper stems for person, gender, number, and mode. Affixation rules are regular expressions like transformation rules. However, these rules modify final intermediate templates by adding prefixes, infixes, or suffixes, or modifying or deleting stem letters. They require knowledge of  theradicalpositionsandoccasionallytheirmorphophonemicorigins. Addingaffixestoastemoperates on the intermediate template which retains the necessary information.</Paragraph>
    <Paragraph position="20"> Letabetheaffixationrulethatisbeingappliedto a certain intermediate template:</Paragraph>
    <Paragraph position="22"> Now using the function Transform that was defined earlier, affixes are added to im to produce the intermediate affixed template im+1:</Paragraph>
    <Paragraph position="24"> may remove place holders using the following:</Paragraph>
    <Paragraph position="26"> With this definition, generated stems are described by intermediate templates. Intermediate templates retain knowledge of the current state of template and radical letters without losing the ability to recall their origins. This algorithm, therefore, would avoid guesswork in the identification of root radicals. Automatic rule-based stem generation and analysis are both facilitated by this feature of intermediate templates.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Stem Generation Engine
</SectionTitle>
    <Paragraph position="0"> A stem generation engine may be built on the basis of the definition just advanced. The three components, Stem Transformer, Affixer, and Slotter, applied in sequence, make up SG. Stem Transformer applies the appropriate transformation rules to the morphological pattern, Affixer adds specific affixes to the transformed template; and Slotter applies the radicals to the transformed affixed template to produce the final affixed stem.</Paragraph>
    <Paragraph position="1"> SG begins with a stem ID from the MainDictionarytableasinputto Stem Transformer (SeeFigure 1). The root and entry associated with the stem ID are used to identify the radicals of the root, the morphologicalpatternstring,alistoftransformation rules, and an affix table ID.</Paragraph>
    <Paragraph position="3"> Stem Transformer applies transformation rules that are localised to the root radicals and letters of the template in the contexts of one another. To preparethetemplateandrootfortransformation, the engine begins by marking radicals in the template.</Paragraph>
    <Paragraph position="4"> Stem Transformer is applied incrementally using the current radical set, the template string, and one transformationruleperpass,asinFigure2. TheoutputofeachpassisfedbackintoStemTransformerin null the form of the jth-rule-transformed template string andradicals, alongwiththe(j+1)th transformation rule. Whenallrulesassociatedwiththetemplateare exhausted, the resultant template string and radicals are output to the next phase.</Paragraph>
    <Paragraph position="5"> To illustrate, assume the morphological pattern</Paragraph>
    <Paragraph position="7"> Stem Transformer generates a proper stem using the following steps: Equation 3 above creates the initial intermediate template when passed the radical set and morphological template, thus producing:</Paragraph>
    <Paragraph position="9"> The first transformation rule t1 = 1,t1 [?] T is a regularexpressionthatsearchesforatehiso(t)following hF and replaces tehiso (t) with a copy of rF. To transform i0 into i1 with rule t1, Equation 5 is used, thus producing:</Paragraph>
    <Paragraph position="11"> to i1. The gemination regular expression searches for an unvowelled letter followed by a vowelled duplicate and replaces it with the geminated vowelled letter. Once more, Equation 5 is used to make the transformation:</Paragraph>
    <Paragraph position="13"> To obtain the proper stem from the intermediate template, the final intermediate template i2 may be substituted into Equation 7:</Paragraph>
    <Paragraph position="15"> To summarise, the final output of Stem Transformer is a root moulded into a template and a template-transformed radical set. These outputs are used as input to the affixation phase which succeeds stem transformation. Affixer, applied iteratively to the product of Stem Transformer, outputs 14 different subject-pronominally affixed</Paragraph>
    <Paragraph position="17"> morphosyntactic forms for every input except the imperative which only produces 5. There are 9 different tense-voice-mode combinations per subject pronominal affix, so most roots produce 117 affixed stems per dictionary entry. Affixer is run with different replace strings that are specific to the type of affix being produced. It modifies copies of the transformed stem from the previous phase, as in Figure 3. Using the example cited shortly before, Affixer is passed the last intermediate template im and the affix regular expression a. In this example, a is a regular expression that searches for hLrL and replaces it with hLrLa tehiso (LrLato); this corresponds to the past active third person feminine singular</Paragraph>
    <Paragraph position="19"> In the last stage of stem generation, Slotter replacestheplaceholdersinthetransformedtemplate null with the transformed radical set, producing the final formoftheaffixedstem. Fortheexample, theresult of applying Equation 10 is:</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Optimisation
</SectionTitle>
    <Paragraph position="0"> Data produced for the use of SG was designed initially with no knowledge of the actual patterns and repetitions that occur with morphophonemic and affix transformation rules. In fact, SG is made to create stems this way: A root is added to a morphosemantic template, then morphosyntactic templates are applied to it, inducing in some patterns morphophonemic transformation. However, while this may be useful in many language teaching tools, it is extremely inefficient. The original data was used to discover patterns that would allow stems to be created in an optimal manner.</Paragraph>
    <Paragraph position="1"> FollowingtheclassificationinYaghi(2004),there are 70 verb root types associated with 44 theoretically possible morphological patterns. There is an elementofrepetitionpresentintheclassification. In addition, the Template table lists sequences of rules that operate on morphological patterns in a manner similar to how native speakers alter patterns phonemically. These rules could be composed into a single FST that yields the surface form.</Paragraph>
    <Paragraph position="2"> For example, in the previous section, the morphophonemic transformation rule set T = {1,12} could have been written into one rule. In its nonoptimised form the rule duplicates rF in place of tehiso (t)creatingintermediateformalifisoi hF thaliso thalisoa hM kafisoa hL rehisoa (AiF@o@aMkaLra)andthendeletesthefirstofthe  duplicatelettersandreplacesitwithageminationdiacriticthatisplacedonthesecondrepeatletter. The resultingsurfaceformis alifisoithalisofathashaddakafiniarehfina (Ai@~akara). Instead, one rule could achieve the surface form by replacing the letter tehiso (t) in the template with a geminated thaliso (@) yielding the same result.</Paragraph>
    <Paragraph position="3"> Compiling separate regular expressions for each transformation rule is costly in terms of processing time especially when used with back-references, as SG does. Back-references group a sub-pattern and refer to it either in the search pattern or substitute string. Such patterns are not constant and are required to be recompiled for every string they are used with. It is desirable, therefore, to minimise the number of times patterns are compiled. To optimise further, thetransformationmaybemadeonthemorphologicalpatternitself,thusproducingasoundsur- null faceformtemplate. Thisprocedurewouldeliminate the need to perform morphophonemic transformations on stems.</Paragraph>
    <Paragraph position="4"> Each template entry in the Template table (see Figure 1) is given a new field containing the surface form template. This is a copy of the morphological pattern with morphophonemic transformations applied. A coding scheme is adopted that continues to retain letter origins and radical positions in the template so that this will not affect affixation. Any transformations that affect the morphological patternaloneareappliedwithoutfurtherconsideration. null ThecodingschemeusestheRomancharactersF,M, L, andQtorepresentplaceholdersinthetemplates.</Paragraph>
    <Paragraph position="5"> Each place holder is followed by a single digit indicating the type of transformation that occurs to the radical slotted in that position. The codes have the following meanings: 0=no alteration, 1=deletion, 2=substitution, 3=gemination. Ifthecodeused is 2, then the very next letter is used to replace the radical to which the code belongs.</Paragraph>
    <Paragraph position="6"> Takeforexample, theTemplatetableentryforthe roottype 17 (allrootswithF=wawiso (w) andL=yehiso (y)), its morphologicalpatternalifisoi hF tehisoa hMa hLa (AiFotaMaLa), and its variant (ID 0). The morphophonemic transformation rules applied to the template are T={20,12,31,34,112}. These rules correspond to the following:  The surface form template can be rewritten as alifisoi hF2tehisofathashadda hMa hL2alifmaksuraiso (AiF2t~aM0aL2Y). This can be used to form stems such as alifisoitehinifathashaddadalfinaalifmaksuraiso (Ait~adaY) by slotting the root {wawiso, daliso, yehiso} ({w,d,y}).</Paragraph>
    <Paragraph position="7"> The affix tables use a similar notation for coding their rules. Every affix rule indicates a change to be madetothesurfaceformtemplateandbeginswitha placeholderfollowedbyacode0or2unlesstherule redefines the entire template in which case the entry begins with a 0. Radical place holders in affix rules define changes to the surface form template. These changes affect the template from the given radical position to the very next radical position or the end of the template, whichever is first.</Paragraph>
    <Paragraph position="8"> Affix rules with code 0 following radical place holders signify that no change should be made to that section of the surface form template. However, a code 2 after a place holder modifies the surface formtemplateinthatpositionbyreplacingtheletter thatfollowsthecodewiththerestofthatsegmentof therule. Affixrulesusingcode 2 afterplace holders override any other code for that position in the surfaceformtemplatebecauseaffixationmodifiesmor- null phophonemically transformed stems.</Paragraph>
    <Paragraph position="9"> Creating affixed stems from templates and affixes formatted in this way becomes far more optimal. If a surface form template was specified as alifisoi rF2tehisofathashadda rM0a rL2alifmaksuraiso (AiF2t~aM0aL2Y) and it was to be combined with the affix rule rL2yehiso tehisou meemisoa alifiso (L2yotumaA) then SG simply needs to align the affix rule with the surface form template using the place holder symbol in the affix rule and replace appropriately as in Table 1.</Paragraph>
    <Paragraph position="10"> With the resulting affixed surface form template SGmayretaintheradicalsoftheoriginalrootwhere they are unchanged, delete radicals marked with code 1 and 3, and substitute letters following code 2 in place of their position holders. If the example above is used with the root {wawiso, daliso, yehiso} ({w, d, y}), the final stem is: alifisoitehinifathashaddadalfinyehinitehmedumeemmedaaliffin (Ait~adayotumaA, meaning &amp;quot;the two of you have accepted compensation for damage&amp;quot;).</Paragraph>
    <Paragraph position="11"> To use the original regular expression transformations would take an average of 18000 seconds to produce a total of 2.2 million valid stems in the database. With the optimised coding scheme, the time taken is reduced to a mere 720 seconds; that is 4% of the original time taken.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Generated Stem Database Compiler
</SectionTitle>
    <Paragraph position="0"> Once the dictionary database has been completed and debugged, an implementation of SG generates for every root, template, and affix the entire list of stems derived from a single root and all the possible template and affix combinations that may apply to that root entry. The average number of dictionary entries that a root can generate is approximately 2.5. Considering that each entry generates 117 different affixed stems, this yields an average of approximately 300 affixedstemsperroot. However, some roots (e.g., {kafiso,tehiso,behiso} ({k,t,b})) produce 13 different entries, which makes approximately 1,500 affixed stems for each of such roots.</Paragraph>
    <Paragraph position="1"> The generated list is later loaded into a B-Tree structured database file that allows fast stem search and entry retrieval.</Paragraph>
    <Paragraph position="2"> A web CGI was built that uses the Stem GenerationEnginetoproduceallaffixedstemsofanygiven null root. A section of the results of this appears in Figure 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML