File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1002_metho.xml
Size: 10,326 bytes
Last Modified: 2025-10-06 14:14:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1002"> <Title>Automatic Induction of Finite State Transducers for Simple Phonological Rules</Title> <Section position="4" start_page="9" end_page="10" type="metho"> <SectionTitle> 3 Problems Using OSTIA to learn </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="9" end_page="10" type="sub_section"> <SectionTitle> Phonological Rules </SectionTitle> <Paragraph position="0"> The OSTIA algorithm can be proven to learn any subsequential relation in the limit. That is, given an infinite sequence of valid input/output pairs, it will at some point derive the target transducer from the samples seen so from Figure 2 For example, OSTIA's tendency to produce overly &quot;clumped&quot; transducers is illustrated by the arcs with out &quot;b ae&quot; and &quot;n d&quot; in the transducer in Figure 4, or even Figure 2. OSTIA's default behavior is to emit the remainder of the output string for a transduction as soon as enough input symbols have been seen to uniquely identify the input string in the training set. This results in machines which may, seemingly at random, insert or delete sequences of four or five phonemes, something which is linguistically implausible. In addition, the incorrect distribution of output symbols prevents the optimal merging of states during the learning process, resulting in large and inaccurate transducers.</Paragraph> <Paragraph position="1"> Another example of an unnatural generalization is shown in 4, the final transducer induced by OSTIA on the three word training set of Figure 2. For example, the transducer of Figure 4 will insert an 'ae' after any 'b', and delete any 'ae' from the input. Perhaps worse, it will fail completely upon seeing any symbol other than 'er' or end-of-string after a 't'. While it might be unreasonable to expect any transducer trained on three samples to be perfect, the transducer of Figure 4 illustrates on a small scale how the OSTIA algorithm might be improved.</Paragraph> <Paragraph position="2"> Similarly, if the OSTIA algorithm is training on cases of flapping in which the preceding environment is every stressed vowel but one, the algorithm has no way of knowing that it can generalize the environment to all stressed vowels. The algorithm needs knowledge about classes of phonemes to fill in accidental gaps in training data coverage.</Paragraph> </Section> </Section> <Section position="5" start_page="10" end_page="10" type="metho"> <SectionTitle> 4 Using Alignment Information </SectionTitle> <Paragraph position="0"> Our first modification of OSTIA was to add the bias that, as a default, a phoneme is realized as itself, or as a similar phone. Our algorithm guesses the most probable phoneme to phoneme alignment between the input and output strings, and uses this information to distribute the output symbols among the arcs of the initial tree transducer. This is demonstrated for the word &quot;importance&quot; in Figures 5 and 6.</Paragraph> <Paragraph position="1"> ih m p oal r t ah n s IIII /111 ih m p oal dx ah n t s The modification proceeds in two stages. First, a dynamic programming method is used to compute a correspondence between input and output phonemes. The alignment uses the algorithm of Wagner & Fischer (1974), which calculates the insertions, deletions, and substitutions which make up the minimum edit distance between the underlying and surface strings. The costs of edit operations are based on phonetic features; we used 26 binary articulatory features. The cost function for substitutions was equal to the number of features changed between the two phonemes. The cost of insertions and deletions was 6 (roughly one quarter the maximum possible substitution cost). From the sequence of edit operations, a mapping of output phonemes to input phonemes is generated according to the following rules: * Any phoneme maps to an input phoneme for which it substitutes * Inserted phonemes map to the input phoneme immediately following the first substitution to the left of the inserted phoneme Second, when adding a new arc to the tree, all the unused output phonemes up to and including those which map to the arc's input phoneme become the new arc's output, and are now marked as having been used. When walking down branches of the tree to add a new input/output sample, the longest common prefix, n, of the sample's unused output and the output of each arc is calculated. The next n symbols of the transduction's output are now marked as having been used. If the length, l, of the arc's output string is greater than n, it is necessary to push back the last l - n symbols onto arcs further down the tree. A tree transducer constructed by this process is shown in Figure 7, for comparison with the unaligned version in Figure 2.</Paragraph> <Paragraph position="2"> Results of our alignment algorithm are summarized in SS6. The denser distribution of output symbols resulting from the alignment constrains the merging of states early in the merging loop of the algorithm. Interestingly, preventing the wrong states from merging early on allows more merging later, and results in more compact transducers. null</Paragraph> </Section> <Section position="6" start_page="10" end_page="12" type="metho"> <SectionTitle> 5 Generalizing Behavior With Decision Trees </SectionTitle> <Paragraph position="0"> In order to allow OSTIA to make natural generalizations in its rules, we added a decision tree to each state of the machine, describing the behavior of that state. For example, the decision tree for state 2 of the machine in Figure 1 is shown in Figure 8. Note that if the underlying phone is an unstressed vowel (\[-cons,-stress\]), the machine outputs a flap, followed by the underlying vowel, otherwise it outputs a 't' followed by the underlying phone.</Paragraph> <Paragraph position="1"> The decision trees describe the behavior of the machine at a given state in terms of the next input symbol by generalizing from the arcs leaving the state. The decision trees classify the arcs leaving each state based on the arc's input symbol into groups with the same behavior. The same 26 binary phonetic features used in calculating edit distance were used to classify phonemes in the decision trees. Thus the branches of the decision tree are labeled with phonetic feature values of the arc's input symbol, and the leaves of the tree correspond to the different behaviors. By an arc's behavior, we mean its output string considered as a function of its input phoneme, and its destination state. Two arcs are considered to have the same behavior if they agree each of the following: * the index i of the output symbol corresponding to the input symbol (determined from the alignment procedure) * the difference of the phonetic feature vectors of the input symbol and symbol i of the output string * the prefix of length i - 1 of the output string 1: Output: dx \[ \], Destination State: 0 2: Output: t \[ \], Destination State: 0 3: On end of string: Output: t, Destination State: 0 Figure 8: Example Decision Tree: This tree describes the behavior of State 2 of the transducer in Figure 1. \[ \] in the output string indicates the arc's input symbol (with no features changed).</Paragraph> <Paragraph position="2"> * the suffix of the output string beginning at position</Paragraph> <Paragraph position="4"> After the process of merging states terminates, a decision tree is induced at each state to classify the outgoing arcs. Figure 9 shows a tree induced at the initial state of the transducer for flapping.</Paragraph> <Paragraph position="5"> Using phonetic features to build a decision tree guarantees that each leaf of the tree represents a natural class of phonemes, that is, a set of phonemes that can be described by specifying values for some subset of the phonetic features. Thus if we think of the transducer as a set of rewrite rules, we can now express the context of each rule as a regular expression of natural classes of preceding phonemes.</Paragraph> <Paragraph position="6"> of the flapping transducer Some induced transducers may need to be generalized even further, since the input transducer to the decision tree learning may have arcs which are incorrect merely because of accidental prior structure. Consider again the English flapping rule, which applies in the context of a preceding stressed vowel. Our algorithm first learned a transducer whose decision tree is shown in Figure 9. In this transducer all arcs leaving state 0 correctly lead to the flapping state on stressed vowels, except for those stressed vowels which happen not to have occurred in the training set. For these unseen vowels (which consisted of the rounded diphthongs 'oy' and 'ow' with secondary stress), the transducers incorrectly returns to state 0. In this case, we wish the algorithm to make the generalization that the rule applies after all stressed vowels.</Paragraph> <Paragraph position="7"> This type of generalization can be accomplished by pruning the decision trees at each state of the machine. Pruning is done by stepping through each state of the machine and pruning as many decision nodes as possible at each state. The entire training set of transductions is tested after each branch is pruned. If any errors are found, the outcome of the pruned node's other child is tested. If errors are still found, the pruning operation is reversed. This process continues at the fringe of the decision tree until no more pruning is possible. Figure 10 shows the correct decision tree for flapping, obtained by pruning the tree in Figure 9.</Paragraph> <Paragraph position="8"> The process of pruning the decision trees is complicated by the fact that the pruning operations allowed at one state depend on the status of the trees at each other state. Thus it is necessary to make several passes through the states, attempting additional pruning at each pass, until no more improvement is possible. In addition, testing each pruning operation against the entire training set is expensive, but in the case of synthetic data it gives the best results. For other applications it may be desirable to keep a cross validation set for this purpose.</Paragraph> </Section> class="xml-element"></Paper>