File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1016_metho.xml

Size: 18,934 bytes

Last Modified: 2025-10-06 14:14:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1016">
  <Title>Automatic Acquisition of Two-Level Morphological Rules</Title>
  <Section position="4" start_page="103" end_page="103" type="metho">
    <SectionTitle>
2 Two-level Rule Formalism
</SectionTitle>
    <Paragraph position="0"> Two-level rules view a word as having a lezical and a surface representation, with a correspondence between them (Antworth, 1990), e.g.:  Lexical: h appy + e r Surface: h appi 0 e r Each pair of lexical and surface characters is called a feasible pair. A feasible pair can be written as lezicabcharac~er:surface-charac~er. Such a pair is called a default pair when the lexicai character and surface character are identical (e.g. h:h). When the lexical and surface character differ, it is called a special pair (e.g. y:i). The null character (0) may appear as either a lexical character (as in +:0) or a surface character, but not as both.</Paragraph>
    <Paragraph position="1"> 1Non-linear operations (such as infixation) are not considered here, since the basic two-level model deals with it in a round-about way. We can note that extensions to the basic two-level model have been proposed to handle non-linear morphology (Kiraz, 1996).</Paragraph>
    <Paragraph position="2"> Two-level rules have the following syntax (Sproat, 1992, p.145):</Paragraph>
    <Paragraph position="4"> ce (correspondence part), LC (le# contezt) and ac (right contez~) are regular expressions over the alphabet of feasible pairs. In most, if not all, implementations based on the two-level model, the correspondence part consists of a single special pair. We also consider only single pair CPs in this paper. The operator op is one of four types:  1. Exclusion rule: a:b /~ LC _ RC 2. Context restriction rule: a:b ::~ LC _ RC 3. Surface coercion rule: a:b ~ LC _ RC 4. Composite rule: a:b C/V LC _ RC  The exclusion rule (/~) is used to prohibit the application of another, too general rule, in a particular subcontext. Since our method does not overgeneralize, we will consider only the ~, ~ and ~:~ rule types.</Paragraph>
  </Section>
  <Section position="5" start_page="103" end_page="104" type="metho">
    <SectionTitle>
3 Acquisition of Morphotactics
</SectionTitle>
    <Paragraph position="0"> The morphotactics of the input words are acquired by (1) computing the string edit difference between each source-target pair and (2) merging the edit sequences as a minimal acyclic finite state automaton. The automaton, viewed as a DAG, is used to segment the target word into its constituent morphemes. null</Paragraph>
    <Section position="1" start_page="103" end_page="104" type="sub_section">
      <SectionTitle>
3.1 Determining String Edit Sequences
</SectionTitle>
      <Paragraph position="0"> A string edit sequence is a sequence of elementary operations which change a source string into a target string (Sankoff and Kruskal, 1983, Chapter 1).</Paragraph>
      <Paragraph position="1"> The elementary operations used in this paper are single character deletion (DELETE), insertion (IN-SERT) and replacement (REPLACE). We indicate the copying of a character by NOCHANGE. A cost is associated with each elementary operation. Typically, INSERT and DELETE have the same (positive) cost and NOCHANGE has a cost of zero. REPLACE could have the same or a higher cost than INSERT or DELETE. Edit sequences can be ranked by the sum of the costs of the elementary operations that appear in them. The interesting edit sequences are those with the lowest total cost. For most word pairs, there are more than one edit sequence (or mapping) possible which have the same minimal total cost. To select a single edit sequence which will most likely result in a correct segmentation, we added a morphology-specific heuristic to a general string edit algorithm (Vidal et al., 1995).</Paragraph>
      <Paragraph position="2"> This heuristic always selects an edit sequence containing two subsequences which identify prefix-root and root-suffix boundaries. The heuristic depends  on the elementary operations being limited only to INSERT, DELETE and NOCHANGE, i.e. no REPLACEs are allowed. We assume that the target word contains more morphemes than the source word. It therefore follows that there are more INSERTs than DELETEs in an edit sequence. Furthermore, the letters forming the morphemes of the target word appear only as the right-hand components of INSERT operations. Consider the edit sequence to change the string happy into the string  Note that the prefix un- as well as the suffix -er consist only of INSERTs. Furthermore, the prefix-root morpheme boundary is associated with an INSERT followed by a NOCHANGE and the root-suffix boundary by a NOCHANGE-DELETE-INSERT sequence. In general, the prefix-root boundary is just the reverse of the root-suffix boundary, i.e. INSERT-DELETE-NOCHANGE, with the DELETE operation being optional. The heuristic resulting from this observation is a bias giving highest precedence to INSERT operations, followed by DELETE and NOCHANGE, in the first half of the edit sequence. In the second half, the precedence is reversed.</Paragraph>
    </Section>
    <Section position="2" start_page="104" end_page="104" type="sub_section">
      <SectionTitle>
3.2 Merging Edit Sequences
</SectionTitle>
      <Paragraph position="0"> A single source-target edit sequence may contain spurious INSERTs which are not considered to form part of a morpheme. For example, the O:i insertion in Example 7 should not contribute to the suffix -er to form -ier, since -ier is an allomorph of -er.</Paragraph>
      <Paragraph position="1"> To combat these spurious INSERTs, all the edit sequences for a set of source-target words are merged as follows: A minimal acyclic finite state automaton (AFSA) is constructed which accepts all and only the edit sequences as input strings. This AFSA is then viewed as a DAG, with the elementary edit operations as edge labels. For each edge a count is kept of the number of different edit sequences which pass through it. A path segment in the DAG consisting of one or more INSERT operations having a similar count, is then considered to be associated with a morpheme in the target word. The O:e O:r INSERT sequence associated with the -er suffix appears more times than the O:i O:e O:r INSERT sequence associated with the -ier suffix, even in a small set of adjectively-related source-target pairs. This means that there is a rise in the edge counts from O:i to O:e (indicating a root-suffix boundary), while O:e and O:r have similar frequency counts. For prefixes a fall in the edge frequency count of an INSERT sequence indicates a prefix-root boundary.</Paragraph>
      <Paragraph position="2"> To extract the morphemes of each target word, every path through the DAG is followed and only the target-side of the elementary operations serving as edge labels, are written out. The null characters (0) on the target-side of DELETEs are ignored while the target-side of INSERTs are only written if their frequency counts indicate that they are not sporadic allomorph INSERT operations. For Example 7, the following morphotactic description results: is\] Target Word -- Prefix + Source + Suffix</Paragraph>
      <Paragraph position="4"> ditions at a time. However, once the morpheme boundary markers (+) have been inserted, phase two should be able to acquire the correct two-level rules for an arbitrary number of affix additions: prefizl +prefiz2+. . . +roo~+suffizl +suffiz2+ ....</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="104" end_page="107" type="metho">
    <SectionTitle>
4 Acquiring Optimal Rules
</SectionTitle>
    <Paragraph position="0"> To acquire the optimal rules, we first determine the full length lexical-sufface representation of each word pair. This representation is required for writing two-level rules (Section 2). The morphotactic descriptions from the previous section provide source-target input pairs from which new string edit sequences are computed: The right-hand side of the morphotactic description is used as the source and the left-hand side as the target string. For instance,  maps the source into the target and provides the lexical and surface representation required by the two-level rules: \[ 11\] Lexical: u n + h a pp y + e r Surface: u n 0 h a ppi 0 e r The REPLACE elementary string edit operations (e.g. y:i) are now allowed since the morpheme boundary markers (+) are already present in the source string. REPLACEs allow shorter edit sequences to be computed, since one REPLACE does  the same work as an adjacent INSERT-DELETE pair. REPLACE, INSERT and DELETE have the same associated cost and NOCHANGE has a cost of zero. The morpheme boundary marker (+) is always mapped to the null character (0), which makes for linguistically more understandable mappings. Under these conditions, the selection of any minimal cost string edit mapping provides an acceptable lexical-surface representation 2.</Paragraph>
    <Paragraph position="1"> To formulate a two-level rule for the source-target pair happy-unhappier, we need a correspondence pair (CP) and a rule type (op), as well as a left context (LC) and a right context (RC) (see Section 2). Rules need only be coded for special pairs, i.e. IN-SERTs, DELETEs or REPLACEs. The only special pair in the above example is y:i, which will be the CP of the rule. Now the question arises as to how large the context of this rule must be? It should be large enough to uniquely specify the positions in the lexical-surface input stream where the rule is applied. On the other hand, the context should not be too large, resulting in an overspecified context which prohibits the application of the rule to unseen, but similar, words. Thus to make a rule as general as possible, its context (LC and RC) should be as short as possible s . By inspecting the edit sequence in Example 10, we see that y changes into i when y is preceded by a p:p, which serves as our first attempt at a (left) context for y:i. Two questions must be asked to determine the correct rule type to be used (Antworth, 1990, p.53): Question 1 Is E the only environment in which L:S is allowed? Question 2 Must L always be realized as S in E? The term environment denotes the combined left and right contexts of a special l~air. E in our example is &amp;quot;p:p_&amp;quot;, L is y and S is i. Thus the answer to question one is true, since y:i only occurs after p:p in our example. The answer to question two is also true, since y is always realized as i after a p:p in the above edit sequence. Which rule type to use is gleaned from Table 1. Thus, to continue our example, we should use the composite rule type (C/:~):</Paragraph>
    <Paragraph position="3"> will lead to an optimal rule set. In most (if not all) of the examples seen, a minimal mapping was also intuitively acceptable.</Paragraph>
    <Paragraph position="4"> sit abstractions (e.g. sets such as V denoting vowels) over the regular pairs are introduced, it will not be so simple to determine what is &amp;quot;a more general context&amp;quot;. However, current implementations require abstractions to be explicitly instantiated during the compilation process ((Karttunen and Beesley, 1992, pp.19-21) and (Antworth, 1990, pp.49-50)) . Thus, with the current state of the art, abstractions serve only to make the  This example shows how to go about coding the set of two-level rules for a single source-target pair. However, this soon becomes a tedious and error prone task when the number of source-target pairs increases, due to the complex interplay of rules and their contexts.</Paragraph>
    <Section position="1" start_page="105" end_page="107" type="sub_section">
      <SectionTitle>
4.1 Minimal Discerning Rule Contexts
</SectionTitle>
      <Paragraph position="0"> It is important to acquire the minimal discerning context for each rule. This ensures that the rules are as general as possible (to work on unseen words as well) and prevents rule conflicts. Recall that one need only code rules for the special pairs. Thus it is necessary to determine a rule type with associated minimal discerning context for each occurrence of a special pair in the final edit sequences. This is done by comparing all the possible contiguous 4 contexts of a special pair against all the possible contexts of all the other feasible pairs. To enable the computational comparison of the growing left and right contexts around a feasible pair, we developed a &amp;quot;mixed-context&amp;quot; representation. We call the particular feasible pair for which a mixed-context is to be constructed, a marker pair (MP), to distinguish it from the feasible pairs in its context. The mixed-context representation is created by writing the first feasible pair to the left of the marker pair, then the first right-context pair, then the second left-context pair and so forth: \[ 13\] LC1, RC1, LC2, RC2, LC3, RC3, ..., MP The marker pair at the end serves as a label. Special symbols indicate the start (SOS) and end (EOS) of an edit sequence. If, say, the right-context ofa MP is shorter than the left-context, an out-of-bounds symbol (OOB) is used to maintain the mixed-context format. For example the mixed-context of y:i in the edit sequence in Example 10, is represented as: \[ 14\] p:p, +:0, p:p, e:e, a:a, r:r, h:h, EOS, +:0, OOB, n:n, OOB, u:u, SOS, OOB, y:i The common prefixes of the mixed-contexts are merged by constructing a minimal AFSA which accepts all and only these mixed-context sequences.</Paragraph>
      <Paragraph position="1">  Question 2 have the The transitions (or edges, when viewed as a DAG) of the AFSA are labeled with the feasible pairs and special symbols in the mixed-context sequence. There is only one final state for this minimal AFSA. Note that all and only the terminal edges leading to this final state will be labeled with the marker pairs, since they appear at the end of the mixed-context sequences. More than one terminal edge may be labeled with the same marker pair. All the possible (mixed) contexts of a specific marker pair can be recovered by following every path from the root to the terminal edges labeled with that marker pair.</Paragraph>
      <Paragraph position="2"> If a path is traversed only up to an intermediate edge, a shortened context surrounding the marker pair can be extracted. We will call such an intermediate edge a delimiter edge, since it delimits a shortened context. For example, traversing the mixed context path of y:i in Example 14 up to e:e would result in the (unmixed) shortened context: \[ 25\] p:p p:p _ +:0 e:e From the shortened context we can write a two-level</Paragraph>
      <Paragraph position="4"> For each marker pair in the DAG which is also a special pair, we want to find those delimiter edges which produce the shortest contexts providing a true answer to at least one of the two rule type decision questions given above. The mixed-context prefix-merged AFSA, viewed as a DAG, allow us to rephrase the two questions in order to find answers in a procedural way: Question 1 Traverse all the paths from the root to the terminal edges labeled with the marker pair L:S. Is there an edge el in the DAG which all these paths have in common? If so, then question one is true for the environment E constructed from the shortened mixed-contexts associated with the path prefixes delimited by el.</Paragraph>
      <Paragraph position="5"> Consider the terminal edges which same L-component as the marker pair L:S and which are reachable from a common edge e2 in the DAG. Do all of these terminal edges also have the same S-component as the marker pair? If so, then question two is true for the environment E constructed from the shortened mixed-contexts associated with the path prefixes delimited by e2.</Paragraph>
      <Paragraph position="6"> For each marker pair, we traverse the DAG and mark the delimiter edges nearest to the root which allow a true answer to either question one, question two or both (i.e. el = e2). This means that each path from the root to a terminal edge can have at most three marked delimiter edges: One delimiting a context for a ~ rule, one delimiting a context for a rule and one delimiting a context for a ~ rule. The marker pair used to answer the two questions, serves as the correspondence part (Section 2) of the rule. To continue with Example 14, let us assume that the DAG edge labeled with e:e is the closest edge to the root which answers true only to question one. Then the ~ rule is indicated: \[ IS\] y:i ~ p:p p:p _ +:0 e:e However, if the edge labeled with r:r answers true to both questions, we prefer the composite rule (C/#) associated with it although this results in a larger context:</Paragraph>
      <Paragraph position="8"> The reasons for this preference are that the C/~ rule * provides a more precise statement about the applicable environment of the rule and it * seems to be preferred in systems designed by linguistic experts.</Paragraph>
      <Paragraph position="9"> Furthermore, from inspecting examples, a delimiter edge indicating a ~ rule generally delimits the shortest contexts, followed by the delimiter for C/~ and the delimiter for ~. The shorter the selected context, the more generally applicable is the rule. We therefore select only one rule per path, in the following preference order: (1) C/~, (2) ~ and (3) ~.</Paragraph>
      <Paragraph position="10"> Note that any of the six possible precedence orders would provide an accurate analysis and generation of the pairs used for learning. However, our suggested precedence seems to strike the best balance between over- or underrecognition and over- or undergeneration when the rules would be applied to unseen pairs.</Paragraph>
      <Paragraph position="11"> The mixed-context representation has one obvious drawback: If an optimal rule has only a left or only a right context, it cannot be acquired. To solve this problem, two additional minimal AFSAs are constructed: One containing only the left context information for all the marker pairs and one containing only the right context information. The same process is then followed as with the mixed contexts.</Paragraph>
      <Paragraph position="12"> The final set of rules is selected from the output of all three the AFSAs: For each special pair 1. we select any of the C/~ rules with the shortest contexts of which the special pair is the left-hand side, or  2. if no C/~ rules were found, we select the shortest and ~ rules for each occurrence of the special pair. They are then merged into a single C/~ rule with disjuneted contexts.</Paragraph>
      <Paragraph position="13"> The rule set learned is complete since all possible combinations of marker pairs, rule types and contexts are considered by traversing all three DAGs. Furthermore, the rules in the set have the shortest possible contexts, since, for a given DAG, there is only one delimiter edge closest to the root for each path, marker pair and rule type combination.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML