XML Viewer - j00-2003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/j00-2003_metho.xml
Size: 17,104 bytes
Last Modified: 2025-10-06 14:07:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="J00-2003">
  <Title>A Multistrategy Approach to Improving Pronunciation by Analogy</Title>
  <Section position="5" start_page="200" end_page="203" type="metho">
    <SectionTitle>
4. Previous Work and Extensions
</SectionTitle>
    <Paragraph position="0"> In this section, we briefly review our previous work on a single-strategy approach to PbA (Damper and Eastmond 1996, 1997). The basic purpose of the earlier work was to reimplement D&amp;N's system but to improve the scoring heuristic used to find the best path through the pronunciation lattice. To have a more realistic and relevant evaluation on a large corpus of real words, as opposed to a small set of pseudowords, we adopted the methodology of removing each word in turn from the lexicon and  Computational Linguistics Volume 26, Number 2 deriving a pronunciation by analogy with the remaining words. In the terminology of machine learning, this is called leave-one-out or n-fold cross validation (Daelemans, van den Bosch, and Weijters 1997; van den Bosch 1997) where n is here the size of the dictionary. PbA has been used in this work to solve three string-mapping problems of importance in speech technology: letter-to-phoneme translation, phoneme-to-letter translation, and letter-to-stress conversion.</Paragraph>
    <Section position="1" start_page="201" end_page="202" type="sub_section">
      <SectionTitle>
4.1 Lexical Database
</SectionTitle>
      <Paragraph position="0"> The lexical database on which the analogy process is based is the 20,009 words of Web- null ster's Pocket Dictionary (1974 edition), manually aligned by Sejnowski and Rosenberg (S&amp;R) (1987) for training their NETtalk neural network. The database is publicly available via the World Wide Web from URL: ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech /dictionaries/. It has data arranged in columns:  aardvark a-rdvark 1 &lt;&lt;&lt;&gt; 2 &lt;&lt; aback xb@k- 0 &gt; 1 &lt; &lt; abacus @bxkxs 1 &lt; 0 &gt; 0 &lt; abaft xb@ft 0 &gt; 1 &lt; &lt; etc.</Paragraph>
      <Paragraph position="1">  Here the second column is the pronunciation, expressed in the phoneme symbols listed in S&amp;R's Appendix A, pp. 161-162. (In this paper, we retain the use of S&amp;R's phonetic symbols rather than transliterating to the symbols recommended by the International Phonetic Association. We do so to maintain consistency with S&amp;R's publicly available lexicon.) The phoneme inventory is of size 52, and has the advantage (for computer implementation) that all symbols are single characters from the ASCII set. The &amp;quot;-&amp;quot; symbol is the null phoneme, introduced to give a strict one-to-one alignment between letters and phonemes to satisfy the training requirements of NETtalk. The third column encodes the syllable boundaries for the words and their corresponding stress patterns.  Stress is associated with vowel letters and arrows with consonants. The arrows point towards the stress nuclei and change direction at syllable boundaries. To this extent, &amp;quot;syllable boundary (right/left)&amp;quot; as above is a misnomer on the part of S&amp;R: indeed, this information is not adequate by itself to place syllable boundaries, which we will denote &amp;quot;1&amp;quot;. We can, however, infer four rules (or regular expressions) to identify syllable boundaries:  RI: \[&lt;&gt;\] ~ \[&lt; I &gt;l R2: \[&lt; digit\] ~ \[&lt; I digit\] R3: \[digit &gt;\] ~ \[digit\[ &gt;\] R4: \[digit digit\] =&gt; \[digit\[digit\]  These have subsequently been confirmed as correct by Sejnowski and Rosenberg (personal communications).</Paragraph>
      <Paragraph position="2"> Table 1 gives some examples of syllable boundaries and decoded &amp;quot;digit stress&amp;quot; patterns obtained using these rules (last row). By digit stress, we mean that the same  stress-level code (one of 1, 2, or 0) is given to all letters--vowels and consonants-within a syllable. There are no &amp;quot;&gt;&amp;quot; or &amp;quot;&lt;&amp;quot; codes in the digit stress pattern. For letter-to-phoneme conversion, homonyms (413 entries) and the two one-letter words i and o were removed from the original NETtalk corpus to leave 19,594 entries. (The one-letter word a is absent from Webster's dictionary.) We do not, of course, contest the importance of homonyms in general. Here, however, we focus on the (sub)problem of pronouncing isolated word forms. The assumption is that there will be another, extended process in the future that handles sentence-level pronunciation. Excluding homonyms keeps the problem tractable and means that we can have meaningful scores. We did not want the same spelling to have different &amp;quot;correct&amp;quot; pronunciations, otherwise we have to decide which to consider correct or accept any of them. How do we make this decision? (Indeed, this was one of the problems in scoring Glushko's pseudowords as pronounced by D&amp;N's system.) To apply the analogy method to phoneme-to-letter conversion, we have again used the NETtalk corpus. In this case, on the same logic as above, homophones (463 entries) and one-phoneme pronunciations (/A/ and /o/ in S&amp;R's notation) were removed from this corpus. This left 19,544 pronunciations. Finally, we have used analogy to map letters to digit-stress patterns.</Paragraph>
    </Section>
    <Section position="2" start_page="202" end_page="203" type="sub_section">
      <SectionTitle>
4.2 Best Preliminary Results
</SectionTitle>
      <Paragraph position="0"> Preliminary results were obtained using a single scoring strategy in order to provide a baseline for comparison with the multistrategy approach to be described shortly. The best results (Table 2) for the three conversions--letter-to-phoneme, phoneme-to-letter, and letter-to-stress--were obtained by full pattern matching and using a weighted total product (TP) scoring. The latter is similar to Damper and Eastmond's TP score (Damper and Eastmond 1997), i.e., the sum of the product of the arc frequencies for all shortest paths giving the same pronunciation. In this case, however, each contribution to the total product is weighted by the number of letters associated with each arc.</Paragraph>
      <Paragraph position="1"> Table 2 Best single-strategy results (% correct) for the three conversion problems studied: letter-to-phoneme, phoneme-to-letter, and letter-to-stress. These were obtained by full pattern matching and using a weighted total product (TP) scoring.</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 26, Number 2 The percentage of words in which both pronunciation and stress are correct was 41.8%. By use of a very simple strategy for silence avoidance, the results for letter-to-phoneme conversion were marginally increased from 61.7% to 61.9% words correct and from 91.6% to 91.8% phonemes correct. The strategy adopted was simply to add a (null-labeled) arc in the case that there was no complete path through the pronunciation lattice, and a single break occurred between adjacent nodes. This corresponds to concatenation of two otherwise complete word fragments. These best results should be compared with 60.7% words correct and 91.2% phonemes correct, as previously obtained by Damper and Eastmond (1997, Table 2).</Paragraph>
      <Paragraph position="3"> 5. Information Fusion in Computational Linguistics In the introduction, we stated that our multistrategy approach is a special case of information (or data) fusion. What precisely is this? According to Hall and Llinas (1997, 7-8): The most fundamental characterization of... fusion involves a hierarchical transformation between observed ... parameters (provided by multiple sources as input) and a decision or inference.</Paragraph>
      <Paragraph position="4"> In principle, &amp;quot;fusion provides significant advantages over single source data&amp;quot; including &amp;quot;the statistical advantage gained by combining same-source data (e.g., obtaining an improved estimate.., via redundant observations)&amp;quot; (p. 6). However, dangers include &amp;quot;the attempt to combine accurate (i.e., good) data with inaccurate or biased data, especially if the uncertainties or variances of the data are unknown&amp;quot; (p. 8). Methods of information fusion include &amp;quot;voting methods, Bayesian inference, Dempster-Shafer's method, generalized evidence processing theory, and various ad hoc techniques&amp;quot; (Hall 1992, 135).</Paragraph>
      <Paragraph position="5"> Clearly, the above characterization is very wide ranging. Consequently, fusion has been applied to a wide variety of pattern recognition and decision theoretic problems-using a plethora of theories, techniques, and tools--including some applications in computational linguistics (e.g., Brill and Wu 1998; van Halteren, Zavrel, and Daelemans 1998) and speech technology (e.g., Bowles and Damper 1989; Romary and Pierrel 1989). According to Abbott (1999, 290), &amp;quot;While the reasons \[that\] combining models works so well are not rigorously understood, there is ample evidence that improvements over single models are typical .... A strong case can be made for combining models across algorithm families as a means of providing uncorrelated output estimates.&amp;quot; Our purpose in this paper is to study and exploit such fusion by model (or strategy) combination as a way of achieving performance gains in PbA.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="203" end_page="208" type="metho">
    <SectionTitle>
6. Multiple Strategies for PbA
</SectionTitle>
    <Paragraph position="0"> We have experimented with five different scoring strategies, used singly and in combination, in an attempt to improve the performance of PbA. The chosen strategies are by no means exhaustive, nor do we make any claim that they represent the &amp;quot;best&amp;quot; choices.</Paragraph>
    <Paragraph position="1"> Mostly, they were intuitively appealing measures that had different motivations, chosen in the hope that this would produce uncorrelated outputs (see quote from Abbott above). Also, in view of the remarks of Hall and Llinas about the potential dangers of fusion/combination, we chose deliberately to include some very simple strategies indeed, to see if they harmed performance.</Paragraph>
    <Paragraph position="2">  Example pronunciation lattice for the word longevity. For simplicity, only arcs contributing to the shortest (length-3) paths are shown and null arc labels are omitted. Phoneme symbols are those employed by Sejnowski and Rosenberg.</Paragraph>
    <Paragraph position="3"> In the following, full pattern matching has been used exclusively.</Paragraph>
    <Section position="1" start_page="204" end_page="204" type="sub_section">
      <SectionTitle>
6.1 Pronunciation Candidates
</SectionTitle>
      <Paragraph position="0"> Formally, the pronunciation lattice for a word can be seen as a set of N candidate pronunciations (corresponding to tied, shortest paths) with some features: ff.(Wi) = {C1 ..... Cj,..., CN} is the lattice for the word Wi with Cj~\[1,N\] denoting the candidates.</Paragraph>
      <Paragraph position="1"> Cj is a 3-tuple (Fj, Dj, Pj) where: Fj -- {fl .... , fn } represents the set of arc frequencies along the jth candidate path (length n).</Paragraph>
      <Paragraph position="2"> Dj = {dl ..... dk,..., dn} represents the &amp;quot;path structure,&amp;quot; i.e., the difference of the position index (within the word) of the nodes at either end of the kth arc.</Paragraph>
      <Paragraph position="3"> Pj = {pl ..... Pm,...,Pl} is the set of pronunciation candidates with pm'S from the set of phonemes (52 in our case) and l is the length of the pronunciation. (Within the NETtalk corpus, primarily because of the use of the null phoneme, words and their pronunciations all have the same length).</Paragraph>
      <Paragraph position="4"> For example, for the pronunciation lattice given in Figure 2 for the word longevity, we have the six candidate pronunciations shown in Table 3, along with their arc frequencies and path structures. Of course, candidate pronunciations are not necessarily distinct: different shortest paths can obviously correspond to the same phoneme string. In this example, the correct pronunciation is that corresponding to Candidates 4 and 6.</Paragraph>
    </Section>
    <Section position="2" start_page="204" end_page="206" type="sub_section">
      <SectionTitle>
6.2 Combining Scores
</SectionTitle>
      <Paragraph position="0"> According to Hall and Llinas (1997, 8), &amp;quot;observational data may be combined or fused at a variety of levels.&amp;quot; Since each of our strategies operates on the same basic data  structure--the pronunciation lattice---the most obvious kind of fusion, and that employed here, is combination at the level of the final decision. In principle, fusion at this level is able to cope with the so-called common currency problem, whereby different sources of information produce incommensurate data of different types to which different physical units of measurement apply. Suppose, for instance, that combination is by a weighted summation in which the weights are learned from the data: in this case, the weights act to emphasize some measures while de-emphasizing others. Here, we use the rank of a pronunciation among the competing candidates as the basis of a weighting, or common currency.</Paragraph>
      <Paragraph position="1"> From a computational point of view, the main idea is to attribute points to each candidate for each scoring strategy. The number of points given to a candidate for scoring strategy si is inversely related to its rank order on the basis of si. Thus, the total number of points (T) awarded for each strategy is:</Paragraph>
      <Paragraph position="3"> where N is the number of candidate pronunciations (N = 6 in our longevity example, so that T(6) = 21).</Paragraph>
      <Paragraph position="4"> Let cand(Rs~) express the number of candidates that have the rank R for the scoring strategy si so that cand(Rs~) &gt; 1 if there are ties, otherwise cand(Rs~) = 1. Then P(Cj, Rs~), the number of points awarded to the candidate Cj thanks to its rank on the basis of strategy si, is:</Paragraph>
      <Paragraph position="6"> Recently, Kittler et al. (1998) have considered the relative merits of several combination rules for decision-level fusion from a theoretical and experimental perspective.</Paragraph>
      <Paragraph position="7"> The rules compared were sum, product, max, min, and majority. Of these, the sum and product rules generally performed best. In view of this, these are the rules used here.</Paragraph>
      <Paragraph position="8"> For the sum rule, the final score for a candidate pronunciation, FS(Cj), is simply taken as the sum of the different numbers of points won for each of the S strategies.  Marchand and Damper Improving Pronunciation by Analogy Since not all strategies are necessarily included:</Paragraph>
      <Paragraph position="10"> and for the product rule:</Paragraph>
      <Paragraph position="12"> where 6si is the Kronecker delta, which is 1 if strategy si is included in the combined score and 0 otherwise. Finally, the pronunciation corresponding to the candidate that obtains the best final score is chosen.</Paragraph>
    </Section>
    <Section position="3" start_page="206" end_page="208" type="sub_section">
      <SectionTitle>
6.3 Scoring Strategies
</SectionTitle>
      <Paragraph position="0"> Five different strategies have been used in deriving an overall pronunciation. In the following, we list each strategy, define it, and give the result (in a table below the formal definition) of applying the strategy to the six candidate pronunciations of the example word longevity (Table 3). For each strategy, points are awarded as determined by Equation (3) and total 21 in accordance with Equation (2).</Paragraph>
      <Paragraph position="1"> Strategy 1 This is the product of the arc frequencies (PF) along the shortest path.</Paragraph>
      <Paragraph position="3"> The candidate scoring the maximum PF( ) value is given rank 1.</Paragraph>
      <Paragraph position="5"> The frequency of the same pronunciation (FSP), i.e., the number of occurrences of the same pronunciation within the (tied) shortest paths.</Paragraph>
      <Paragraph position="7"> j k and k ~ \[1, N\].</Paragraph>
      <Paragraph position="8"> The candidate scoring the maximum FSP( ) value is given rank 1.</Paragraph>
      <Paragraph position="10"> where 6( ) is the Kronecker delta, which is 1 if pronunciations Pj and Pk differ in position i and is 0 otherwise. Table 4 illustrates the computation of NDS( ) for Candidate 1 (/lcGgEvxti/).</Paragraph>
      <Paragraph position="11"> The candidate scoring the minimum NDS( ) value is given rank 1.</Paragraph>
      <Paragraph position="12">  Illustration of the computation of NDS( ) for the candidate pronunciation/lcGgEvxti/ of longevity. Phonemes which are not equal to those of the target pronunciation are entered in bold.</Paragraph>
      <Paragraph position="13">  We now consider the results of using these five strategies for scoring the shortest paths both singly and combined in all possible combinations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML