XML Viewer - e93-1035

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/e93-1035_metho.xml
Size: 17,543 bytes
Last Modified: 2025-10-06 14:13:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="E93-1035">
  <Title>On Abstract Finite-State Morphology</Title>
  <Section position="3" start_page="297" end_page="297" type="metho">
    <SectionTitle>
2 Representing intercalation
</SectionTitle>
    <Paragraph position="0"> An alternative approach to nonconcatenative morphology consists of usin G the idea of prosodic templates \[McCarthy, 1981J, whereby the underlying patterns of vowels and consonants are described. For instance, Kay \[1987\] provides a four-level account of how the Arabic root 'ktb' ('write') is mapped onto the stem 'aktabib' (imperfective active form) by means of the template 'VCCVCVC' (where 'V' stands for vowel and 'C' for consonant) and eight transitions. The first tape contains the root, the second the template, the third the intercalative vowels (vocalism), and the fourth the surface form. State switches are determined by 'frames' of quadruples which specify what each tape symbol must be. There is an overhead attached to the formulation of individual templates and quadruples (which represent the mapping rules) for even a restricted set of lexical entries. More generally, there is nothing in the templates themselves which allows underlying patterns to emerge or be used. This has led to the examination of ways of making abstractions on and classifying templates. For instance, inheritance and default-based approaches, as used in artificial intelligence, can be adopted for template and lexical entry representation \[DeSmedt, 1984\], so that duplicate and redundant information can be deleted from individual entries if the path they are on already contains this information. Research has focused on unification-based formalisms for inheritance network representation (e.g.\[Flickinger et al., 1985; Shieber, 1986; Porter, 1987; Evans and Gazdar, 1990; Bird and Blackburn, 1990; Reinhard and Gibbon, 1991\]).</Paragraph>
    <Paragraph position="1"> The question arises as to whether it is possible to achieve the generalities obtainable through a prosodic template approach within a multi-level finite-state model. Briefly, we hypothesize, in addition to the lexical and surface levels, an abstract level of automaton representation at which classes of inflectional phenomena are given an abstract representation. These abstract automata are translated into two-level automata for specific morphological phenomena. Concatenative and nonconcatenative patterns of inflection are represented not via the dictionary but at an abstract automaton component level. Applications of abstract automata to Arabic noun stems and verb roots are described below.</Paragraph>
  </Section>
  <Section position="4" start_page="297" end_page="298" type="metho">
    <SectionTitle>
3 Arabic noun structure
</SectionTitle>
    <Paragraph position="0"> A noun stem in Arabic is inflected according to Case Type (nominative, accusative, genitive), Number (singular, dual, plural), Gender (feminine and masculine), and Definite/Indefinite. These mainly are suffixes added to the noun stem. The case endings determine the vowelisation of the end letter of the stem.</Paragraph>
    <Paragraph position="1"> The Indefinite Noun Endings are: Singular Nominative: -/un/&amp;quot; (double damma) (e.g. waladon *d)) Accusative: -/an/&amp;quot; (fatha) (e.g. waladan &amp;quot;ld)) Genitive: -/en/. (kasra) (e.g. waladen aJ)) Dual Nominative: -/ani/~I (e.g. waladani ~laJ)) Accusative: -/ayni/~. (e.g. waladyni x:eaJ~) Genitive: as for accusative.</Paragraph>
    <Paragraph position="2"> Plural In Arabic there are three types of plural. These are the Sound Masculine Plural (SMP), the Sound Feminine Plural (SFP), and the Broken Plural (BP). The SMP is for male human beings 2. For example C/.,.~ 2Exception: sana - year ~ which can take the SMP.  ('engineer') becomes o~.,~ or O~. ~v. depending on the case ending. The SFP is for female human beings, inanimates, and most foreign words that have been incorporated into the language. For example, ~Jt~ ('scientist') becomes &amp;quot;b'LJt~ or ~,LJ~, again depending on the case ending. Similarly, 'car' (an inanimate object) ( ;).t~ ) becomes %'b.t~ or obt~. The BP does not follow any regular pattern and is for nouns that do not fall into the above categories. But this is not necessarily the case. For example, oC/.! ('son' -- male human) can be pluralised to *~.i which is a broken plural.</Paragraph>
    <Section position="1" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
The SMP Ending
</SectionTitle>
      <Paragraph position="0"> Nominative: -/oon/~.~ (e.g. muhamiyoon o~1.~) Accusative: -/yyn/O~. (e.g. muhamiyyn O=,L~) Genitive: as for the accusative</Paragraph>
    </Section>
    <Section position="2" start_page="298" end_page="298" type="sub_section">
      <SectionTitle>
The SFP Ending
</SectionTitle>
      <Paragraph position="0"> If there is the feminine ending of ~ then it needs to be removed before adding the SFP ending.</Paragraph>
      <Paragraph position="1"> Nominative:-/atun/&amp;quot;b-1 (e.g. maktabatun degb~) Accusative: -/aten/f,i (e.g. maktabaten o.t~ Genitive: as for the accusative The definite noun endings are the same as for the indefinite noun, except that al ( JI ) is added to the beginning of the noun. When a noun is made definite, the nunation is lost, so any ending with double fatha, kasra, or damma would be reduced to a single fatha, kasra, or damma. For example, &amp;quot;~J, ('boy') becomes &amp;quot;aJjJl ('the boy').</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="298" end_page="299" type="metho">
    <SectionTitle>
4 Network representation
</SectionTitle>
    <Paragraph position="0"> The noun structure system to be described below produces surface forms of lexical representation and so is a generator of inflected nouns. Generation is achieved by the use of finite-state transition networks (FSTNs). FSTNs realize finite-state tables (FSTs) which can be used for providing the mappings between lexical and surface structure. For instance, consider the FST in Figure 1 and the associated transition network in Figure 2. According to the  Input h a 1. 2 0 States 2. 0 3 3. 2 0 4: 0 0  tabular representation, if we're in state 1 (first row) and an 'h' is the current input character found (first column), then we switch to state 2 and look at the next character. If we're in state 1 and an 'a' or '!' is found, then we switch to an error state (0). If we're in state 2 and an 'a' is found, we switch to state 3 and read the next character, otherwise we h 6 Figure 2: FSTN for the FST in Figure 1 switch to an error state. States 1, 2 and 3 are non-terminal (signified by the full-stops), whereas state 4 is terminal (signified by ':'). This FST specifies the state-switching behaviour of any machine which is to accept strings of the form '{ha}n\[ ' , i.e. one or more occurrences of 'ha' followed by an exclamation mark. The same FST can be interpreted as a generator of such strings if 'Input' is changed to 'Output' in Figure 1. The 'conditions' on arcs are reinterpreted as characters to be output in this case.</Paragraph>
    <Paragraph position="1"> The transition network in Figure 2 is constructed directly from the FST: nodes are labeled with state numbers, and arcs specify the input conditions before a state switch can occur. Double-circled nodes in the transition network signify start and terminal nodes. Given such FSTs and equivalent transition networks for Arabic noun and verb structures, Prolog was used to implement the automata. Start and end states are declared with the predicates start_state(X) and end_state(Y) where X and Y represent state numbers, and arc declarations have the form: arc (CurrentState, NextState, \[InputString\], \[Output-String\]). The third argument consists of the parameters Input Character, Direction, Offset, and the fourth refers (for nouns) to the characters for the output word. The direction indicates how to move the scanning head across the input. It can be one of two values: r for right, and I for left. The offset indicates by how much to move left or right along the input tape. (Right or left zero is the same as not moving.) The use of directions and offsets (a non-zero offset of n can be regarded as n separate state transitions of one move in the required direction) means that the automata used here are examples of two-way finite automata \[Rabin and Scott, 1959; Sheperdson, 1959; Hopcroft and Ullman, 1979\].</Paragraph>
    <Paragraph position="2"> The system works in the following way for Singular Nominatives (and similarly for all the other noun inflections). A request for 'bnt' ('girl') to be inflected with Singular Nominative produces the list \[b,n,t,+,o,n\] which is then fed to the appropriate automaton. The FSTN for the Singular Nominative automaton can be seen in Figure 3 and its associated FST in Figure 4. The first character, 'b', is identified. The current arc statement is matched against</Paragraph>
    <Paragraph position="4"> the arc facts of the automaton. For the first letter we have: are(1,?,\[b,?,?\],\[?\]), i.e. what is the state to be moved to from state 1, and what is to be produced at this stage? This will match against the stored arc(1,1,\[Anychr,r,1\],\[Anychr\]), i.e. if in state 1 and any character found, then stay in state 1 and move one position to the right (offset) after copying the character ('b') to the output. The next character is then scanned. This matching process is repeated until the whole of the input word has been read.</Paragraph>
    <Paragraph position="5"> Figure 5 shows how the output string is built up for input \[b~n~t~+,o~n\]. For the first four steps the procedure is straightforward: the input is echoed to the output list. The boundary sign (+) is replaced with a null value (&amp;quot;). When the first of the case ending letters is met, nothing is produced until a check is made whether the previous output character needs changing. The automaton therefore moves back to the end of the stem to check the end character (line 7). For this particular example, the character remains the same, and the automaton moves forward again to the first case ending (line 8). The offsets for movement backwards and forwards leaves the automaton at the same position as in line 6. The bottom line shows the output list at the end of the traversal of the automaton. (The 'O' in the output list refers to the double damma.) Null values are deleted, and the output list sent to the Arabic output routines.</Paragraph>
    <Paragraph position="6"> Narayanan and Hashem \[1992\] provide example runs and more detail about the implementation.</Paragraph>
    <Paragraph position="8"/>
  </Section>
  <Section position="6" start_page="299" end_page="300" type="metho">
    <SectionTitle>
5 Inheritance-based derivation
</SectionTitle>
    <Paragraph position="0"> Two-way automata for all nine types of inflection (three Case by three Number) can be constructed from abstract ones. For instance, the noun system used two abstractions on number. Figure 6 repre-</Paragraph>
    <Paragraph position="2"> Specific automata, for example for Dual Accusative and Genitive (Figure 8), can be derived from the abstract dual automaton by means of the specific  automaton inheriting the basic form of the abstract automaton and adding specific arcs and nodes (specialization), as will be described later.</Paragraph>
  </Section>
  <Section position="7" start_page="300" end_page="300" type="metho">
    <SectionTitle>
6 Verb structure
</SectionTitle>
    <Paragraph position="0"> The major difference between concatenative and nonconcatenative two-way automata for Arabic is that, for nonconcatenation, movement in both directions is required within the output tape rather than the input tape, so that affix information can be inserted between root characters. For concatenative two-way automata (as for the nouns), any moves are to the beginning or ending of the stem on the input tape, and if the last character of the stem needs changing this happens before the affix output is added.</Paragraph>
    <Paragraph position="1"> Arabic verb structure is well-documented (e.g.</Paragraph>
    <Paragraph position="2"> \[McCarthy, 1981; Hudson, 1986\]). The following table gives the perfect active and perfect passive stems of the first three forms of 'ktb' only, but these are adequate to demonstrate the abstraction principles involved here.</Paragraph>
    <Paragraph position="3">  The input representation is of the form \[&lt;root&gt; + &lt;vowels&gt;\], e.g. \[k,t,b,+,a,a\] with a request for Form II results in 'kattab', and \[k,t,b,+,u,i I results in 'kuutib' if Form III passive is requested.</Paragraph>
    <Paragraph position="4"> The following six statements describe an automaton (Figure 9) for generating Form I stems.</Paragraph>
    <Paragraph position="6"> The output argument of the arc statement is more complex than for nouns. The output argument \[X,  D, N\] means 'After moving N steps in direction D, write X', where X can be a consonant C or vowel V. Also, the output argument can consist of one or two lists, the first for moving in one direction, the other to return the head to an appropriate location on the output tape for the next state. For instance, given the input \[k,t,b,+,a,a\] with a request for Form I, arc (1) would produce 'C_' (i.e. the first consonant is output together with a blank space to its right). The same would happen for the second consonant by arc (2). Arc (3) produces only a consonant, so in state 4 the output tape contains 'C_C_C', with the head of the output tape resting on the last C.</Paragraph>
    <Paragraph position="7"> Arc (4) acts as a check that exactly three consonants have been found. Arc (5) makes the output head move left four positions (to the first blank between two Cs) and inserts the V before moving back to its original position (and writing a null value again over the existing null value). Arc 6 works similarly, except that the offset is only two. The input has been scanned sequentially, one character at a time.</Paragraph>
    <Paragraph position="8"> This automaton also works for perfect passive Form I stems: 'a' and 'a' are replaced by 'u' and 'i'. Also, Form II can inherit the Form I automaton and add two specializations. First, arc (2) is changed so that instead of one C being written two copies of the C are made (i.e. (2a)), and arc (5) has offset 5 and not 4 (i.e. (ha)):</Paragraph>
    <Paragraph position="10"> Form III can inherit from Form I and add its two specializations, namely, arc (1) is changed so that two blanks are introduced (i.e. (lb)), and arc (5) so that two Vs are written (i.e. (bb)). The offset when moving left is 5, and when returning 4.</Paragraph>
    <Paragraph position="12"/>
  </Section>
  <Section position="8" start_page="300" end_page="301" type="metho">
    <SectionTitle>
7 Abstract automata and inheritance
</SectionTitle>
    <Paragraph position="0"> The abstract automaton underlying Forms I, II and III is given in Figure 10. The solid lines specify those arcs which are core to all specific automata, and the dashed lines signify arcs which will be specialized. In</Paragraph>
  </Section>
  <Section position="9" start_page="301" end_page="301" type="metho">
    <SectionTitle>
III
</SectionTitle>
    <Paragraph position="0"> the arcs of the automata for Forms I, II and III the pattern of output Cs and Vs has specialized (as in (lb), (2a) and (5b)) and so have offsets (as in 5(a) and 5(b)). Inheritance is multiple since the automaton for Form III inherits (2) from Form I as well as 1. the right return offset of 4 from (5) of Form I, i.e. arc(5,6, \[V,r,1\], \[\[V,1,4\], \[~', r,4\]\]), and 2. the move left (before writing) offset from (5a) of Form II, i.e. arc(5,6, IV,r,1\], \[\[V, 1,5\], ~',r,5\]\]). Form III also specializes its V pattern, i.e. arc(5, 6, \[V,r,1\], \[\[VV, 1, 5\], \[&amp;quot;,r,4\]\]). In all cases, there are seven states and fixed length stems depending on their form. The inheritance structure for these three Forms is given in Figure 11. Form 0 specifies the core arcs which are inherited by all specific automata and cannot be specialized, and subsequent automata can further specialize their behaviour by adding their own arcs or changing contents of arcs inherited from other automata.</Paragraph>
    <Paragraph position="1"> The inheritance status of an arc is given by another argument in the arc representation. Arcs therefore have the following form in the implemented system: arc (S 1, S2, IP, OP, status) where S1 and $2 are state numbers, IP and OP are the sets of input and output parameters, respectively, and 'status' is 0 for core and non-zero for non-core.</Paragraph>
    <Paragraph position="2"> In the case of representing the inheritance relationships between the different Forms, any non-zero status value refers to the Form for which the arc is a specialization. The Form I automaton is therefore fully described by:  (1) arc(l,2, It,r, (2) arc(2,3, \[C,r, (3) arc(3,4, \[C,r, (4) az'c(4,S, \[+,r, (6) arc(S,e, \[V,r,</Paragraph>
    <Paragraph position="4"> where status 1 refers to Form I specialization. Form II automata are fully described by:</Paragraph>
    <Paragraph position="6"> where status 2 refers to Form II specialization. Similarly for Form III:</Paragraph>
    <Paragraph position="8"> where (5b) has been constructed out of (5) and 5(a), i.e. the state number's, input argument and right return offset of 5, and the move left offset of 5, respectively. Ideally, these changes to (5) and (5a) will be carried out within the Form III object.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML