File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1015_metho.xml

Size: 17,538 bytes

Last Modified: 2025-10-06 14:07:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="N01-1015">
  <Title>Re-Engineering Letter-to-Sound Rules</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Aligning the Lexicon
</SectionTitle>
    <Paragraph position="0"> Learning a mapping between sets of strings is difficult unless the task is suitably restricted or additional supervision is provided. Aligning the lexicon allows us to transform the learning task into a classification task to which standard machine learning techniques can be applied.</Paragraph>
    <Paragraph position="1"> Given a lexical entry we ideally would want to align each letter with zero or more phonemes in a way that minimizes the descriptions of the function performing the mapping and of the exceptions.</Paragraph>
    <Paragraph position="2"> Since we do not know how to do this efficiently, we chose to be content with an alignment produced by the first phase of the algorithm described in (Luk and Damper, 1996): we treat the strings to be aligned as bags of symbols, count all possible combinations, and use this to estimate the parameters for a zeroth-order Markov model.</Paragraph>
    <Paragraph position="3">  (a) t e x . t e t E k s t .</Paragraph>
    <Paragraph position="4"> (b) t e x t e . . . . .</Paragraph>
    <Paragraph position="5"> . . . . . t E k s t  where the dot represents the empty string (for reasons of visual clarity), also referred to as &amp;quot;. Alignment (b), while not as intuitively plausible as alignment (a), is possible as an extreme case. In general, when counting the combinations of ' letters with p phonemes, we want to include p empty letters and ' empty phonemes. For example, given the letters 'texte' and corresponding phonemes /tEkst/, we countCL(t;&amp;quot;) = 10,CL(t;t) = 4,CL(t;k) = 2, etc. By normalizing the counts we arrive at an empirical joint probability distribution ~PL for the lexicon. null The existing rewrite rules were another source of information. A rewrite rule is of the form</Paragraph>
    <Paragraph position="7"> where is usually a string of letters and a string of phonemes. The contextual restrictions expressed by and will be ignored. Typically and are very short, rarely consisting of more than four symbols. We created a second lexicon consisting of around 200 pairsh ; imentioned in the rewrite rules, and applied the same procedure as before to obtain counts CR and from those a joint probability distribution ~PR.</Paragraph>
    <Paragraph position="8"> The two empirical distributions were combined and smoothed by linear interpolation with a uniform distribution PU:</Paragraph>
    <Paragraph position="10"> effects of using different coefficient vectors ~ will be discussed in Section 4.</Paragraph>
    <Paragraph position="11"> Since we had available a library for manipulating weighted automata (Mohri et al., 2000), the alignments were computed by using negative log probabilities as weights for a transducer with a single state (hence equivalent to a zeroth-order Markov model), composing on the left with the letter string and on the right with the phoneme string, and finding the best path (Searls and Murphy, 1995; Mohri et al., 2000). This amounts to inserting &amp;quot;-symbols into both the string of letters and the string of phonemes in a way that minimizes the overall weight of the transduction, i. e. maximizes the probability of the alignment with respect to the model.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Building Training Instances
</SectionTitle>
    <Paragraph position="0"> Now we bring in additional restrictions that allow us to express the task of finding a function that maps letter sequences to phoneme sequences as the simpler task of inducing a mapping from a single letter to a single phoneme. This is a standard classification task, and once we have a set of feature functions and training instances we can choose from a multitude of learning algorithms and target representations. However, investigating the implications of different choices is not our goal.</Paragraph>
    <Paragraph position="1"> The first simplifying assumption is to pretend that translating an entire text amounts to translating each word in isolation (but see the discussion of liaison in Section 5 below). Secondly we make use of the fact that the pronunciation of a letter is in most cases fully determined by its local context, much more so in French (Laporte, 1997) than in English.</Paragraph>
    <Paragraph position="2"> Each letter is to be mapped to a phoneme, or the empty string &amp;quot;, in the case of &amp;quot;silent&amp;quot; letters (deletions). An additional mechanism is needed for those cases where a letter corresponds to more than one phoneme (insertions), e. g. the letter 'x' corresponding to the phonemes /ks/ in Figure 2a. The problem is the non-uniform appearance of an explicit empty string symbol that allows for insertions. We avoided having to build a separate classifier to predict these insertion points (see (Riley, 1991) in the context of pronunciation modeling) by simply pretending that an explicit empty string is present before each letter and after the last letter. This is illustrated in Figure 2b. Visual inspection of several aligned lexica revealed that at most one empty string symbol is needed between any two letters.</Paragraph>
    <Paragraph position="3"> From these aligned and padded strings we derived training instances by considering local windows of a fixed size. A context of size one requires a win- null (a) t e x . t e t E k s t .</Paragraph>
    <Paragraph position="4"> (b) . t . e . x . t . e .</Paragraph>
    <Paragraph position="5"> . t . E . k s t . . .</Paragraph>
    <Paragraph position="6">  dow of size three, which is centered on the letter aligned with the target phoneme. Figure 3 shows the first few training instances derived from the example in Figure 2b above. The beginning and end of the string are marked with a special symbol. Note that the empty string symbol only appears in the center of the window, never in the contextual part, where it would not convey any information.</Paragraph>
    <Paragraph position="8"/>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We delineated a 90%/10% split of the lexicon and performed the alignment using a probability distribution with coefficients 1 = 0, 2 = 0:9, and 3 = 0:1, i. e., no information from the rewrite rules was used and the empirical probabilities derived from the lexicon were smoothed slightly. The value for 3 was determined empirically after several trial runs on a held-out portion. We then generated training instances as described in the previous section, and set aside the 10% we had earmarked earlier for testing purposes. We ran C4.5 on the remaining portion of the data, using the held out 10% for testing. Table 1 summarizes the following aspects of the performance of the induced decision tree classifiers on the test data relative to the size of context used for classification: classification accuracy per symbol; micro-averaged precision (P) and recall (R) per symbol; size of the tree in number of nodes; and size of the saved tree data in kilobytes.</Paragraph>
    <Paragraph position="1"> All trees were pruned and the subsetting option of C4.5 was used to further reduce the size of the trees.</Paragraph>
    <Paragraph position="2"> Further increasing the context size did not result in better performance. We did see a performance in-context acc. P R size of tree  crease, however, when we repeated the above procedure with different coefficients ~ . This time we set</Paragraph>
    <Paragraph position="4"> ular values were again determined empirically. The important thing to note is that the information from the rewrite rules is now dominant, as compared to before when it was completely absent. The effect this had on performance is summarized in Table 2 for three letters of context. As before, classification accuracy is given on a per-symbol basis; average accuracy per word is around 85%. Notice that the size of the tree decreases as a result of a better alignment.</Paragraph>
    <Paragraph position="5"> alignment acc. P R size of tree  These figures are all relative to our existing system. What is most important to us are the vast improvements in efficiency: the decision trees take up less than 10% of the space of the original letter-to-phoneme component, which weighs in at 6.7 MB total with composition deferred until runtime, since off-line composition would have resulted in an impractically large machine. The size of the original component could be reduced through the use of compression techniques (Kiraz, 1999), which would lead to an additional run-time overhead.</Paragraph>
    <Paragraph position="6"> Classification speed of the decision trees is on the order of several thousand letters per second (depending on platform details), which is many times faster than the existing system. The exact details of a speed comparison depend heavily on platform issues and what one considers to be the average case, but a conservative estimate places the speedup at a factor of 20 or more.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Directions for Further Research
</SectionTitle>
    <Paragraph position="0"> The tremendous gains in efficiency will enable us to investigate the use of additional processing modules that are not included in the existing system because they would have pushed performance below an acceptable bound. For example no sophisticated part-of-speech (POS) disambiguation is done at the moment, but would be needed to distinguish, e. g., between different pronunciations of French words ending in -ent, which could be verbs, nouns, adverbs, etc. The need for POS disambiguation is even clearer for languages with &amp;quot;deep&amp;quot; orthographies, such as English. In conjunction with shallow parsing, POS disambiguation would give us enough information to deal with most cases of liaison, an inter-word phenomenon that required special attention in the existing system and that we have so far ignored in the new approach because of the exclusive focus on regularities at the level of isolated words.</Paragraph>
    <Paragraph position="1"> We have been using the existing automaton-based system as our baseline, which is unfair because that system makes mistakes which could very well obscure some regularities the inductive approach might otherwise have discovered. Future comparisons should use an independent gold standard, such as a large dictionary, to evaluate and compare both approaches. The advantage of using the existing system instead of a dictionary is that we could generate large amounts of training data from corpora.</Paragraph>
    <Paragraph position="2"> But even with plenty of training data available, the paradigms of verbal inflections, for example, are quite extensive in French, inflected verb forms are typically not listed in a dictionary, and we cannot guarantee that sufficiently many forms appear in a corpus to guarantee full coverage. In this case it would make sense to use a hybrid approach that reuses the explicit representations of verbal inflections from the existing system.</Paragraph>
    <Paragraph position="3"> More importantly, having more training data available for use with our new approach would only help to a small extent. Though more and/or cleaner data would possibly result in better alignments, we do not expect to find vast improvements unless the restriction imposed by the zeroth-order Markov assumption used for alignment is dropped, which could easily be done. However, it is not clear that using a bigram or trigram model for alignment would optimize the alignment in such a way that the decision tree classifier learned from the aligned data is as small and accurate as possible.</Paragraph>
    <Paragraph position="4"> This points to a fundamental shortcoming of the usual two-step procedure, which we followed here: the goodness of an alignment performed in the first step should be determined by the impact it has on producing an optimal classifier that is induced in the second step. However, there is no provision for feedback from the second step to the first step. For this a different setup would be needed that would discover an optimal alignment and classifier at the same time. This, to us, is one of the key research questions yet to be addressed in learning letter-to-sound rules, since the quality of an alignment and hence the training data for a classifier learner is essential for ensuring satisfactory performance of the induced classifier. The question of which classifier (learner) to use is secondary and not necessarily specific to the task of learning letter-sound correspondences. null</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Relation to Existing Research
</SectionTitle>
    <Paragraph position="0"> The problem of letter-to-sound conversion is very similar to the problem of modeling pronunciation variation, or phonetic/phonological modeling (Miller, 1998). For pronunciation modeling where alternative pronunciations are generated from known forms one can use standard similarity metrics for strings (Hamming distance, Levenshtein distance, etc.), which are not meaningful for mappings between sequences over dissimilar alphabets, such as letter-to-phoneme mappings.</Paragraph>
    <Paragraph position="1"> General techniques for letter-to-phoneme conversion need to go beyond dictionary lookups and should be able to handle all possible written word forms. Since the general problem of learning regular mappings between regular languages is intractable because of the vast hypothesis space, all existing research on automatic methods has imposed restrictions on the class of target functions. In almost all cases, this paper included, one only considers functions that are local in the sense that only a fixed amount of context is relevant for mapping a letter to a phoneme.</Paragraph>
    <Paragraph position="2"> One exception to this is (Gildea and Jurafsky, 1995), where the target function space are the subsequential transducers, for which a limit-identification algorithm exists (Oncina et al., 1993). However, without additional guidance, that algorithm cannot be directly applied to the phonetic modeling task due to data sparseness and/or lack of sufficient bias (Gildea and Jurafsky, 1995). We would argue that the lack of locality restrictions is at the root of the convergence problems for that approach.</Paragraph>
    <Paragraph position="3"> Our approach effectively restricts the hypothesis space even further to include only the k-local (or strictlyk-testable) sequential transducers, where a classification decision is made deterministically and based on a fixed amount of context. We consider this to be a good target since we would like the letter-to-sound mapping to be a function (every piece of text has exactly one contextually appropriate phonetic realization) and to be deterministically computable without involving any kind of search.</Paragraph>
    <Paragraph position="4"> Locality gives us enough bias for efficiently learning classifiers with good performance. Since we are dealing with a restricted subclass of finite-state transducers, our approach is, at a theoretical level, fully consistent with the claim in (Sproat, 2000) that letter-phoneme correspondences can be expressed as regular relations. However, it must be stressed that just because something is finite-state does not mean it should be implemented directly as a finite-state automaton.</Paragraph>
    <Paragraph position="5"> Other machine learning approaches employ essentially the same locality restrictions. Different learning algorithms can be used, including Artificial neural networks (Sejnowski and Rosenberg, 1987; Miller, 1998), decision tree learners (Black et al., 1998), memory-based learners and hybrid symbolic approaches (Van den Bosch and Daelemans, 1993; Daelemans and van den Bosch, 1997), or Markov models. Out of these the approach in (Black et al., 1998) is most similar to ours, but it presupposes that phoneme strings are never longer than the corresponding letter strings, which is mostly true, but has systematic exceptions, e. g. 'exact' in English or French. English has many more exceptions that do not involve the letter 'x', such as 'cubism' (/kjubIz@m/ according to cmudict.0.6) or 'mutualsim'. null The problem of finding a good alignment has not received its due attention in the literature. Work on multiple alignments in computational biology cannot be adapted directly because the letter-to-sound mapping is between dissimilar alphabets. The alignment problem in statistical machine translation (Brown et al., 1990) is too general: long-distance displacement of large chunks of material may occur frequently when translating whole sentences, but are unlikely to play any role for the letter-to-sound mapping, though local reorderings do occur (Sproat, 2000). Ad hoc figures of merit for alignments (Daelemans and van den Bosch, 1997) or hand-corrected alignments (Black et al., 1998) might give good results in practice, but do not get us any closer to a principled solution. The present work is another step towards obtaining better alignments by exploiting easily available knowledge in a systematic fashion.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML