File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/j96-3003_abstr.xml
Size: 22,709 bytes
Last Modified: 2025-10-06 13:48:39
<?xml version="1.0" standalone="yes"?> <Paper uid="J96-3003"> <Title>Efficient Multilingual Phoneme-to-Grapheme Conversion Based on HMM</Title> <Section position="2" start_page="0" end_page="357" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Phoneme-based speech recognition systems incorporate a phoneme-to-grapheme conversion (PTGC) module to produce orthographically correct output. Many approaches have been used, most of which compare the phonemic strings to a (usually applicationspecific) dictionary containing both the phonemic and the graphemic form of every word the system can handle (Laface, Micca, and Pieraccini 1987; Levinson et al. 1989, etc.). Considering the effort and cost required to create such a dictionary, this is a serious limitation, especially for inflectionally rich languages such as Greek and German.</Paragraph> <Paragraph position="1"> Another very important issue when searching for words in a dictionary is the number of candidates resulting from each phonemic input. Depending on the language and the errors of the recognizer, this number may be very large, rendering the disambiguation of the words by a subsequent language model a time-consuming and unreliable task.</Paragraph> <Paragraph position="2"> The domain of application is another factor that strongly influences conversion performance; a general dictionary can omit the specialized words of specific domains (e.g., legal, engineering, or medical terminology) and vice versa. Finally, applications that must handle a large number of proper names (e.g., directory service applications) generally cannot include all the possible names. The only remedy in such situations * Wire Communication Laboratory, University of Patras, Patras, GR 26500 Greece (~) 1996 Association for Computational Linguistics Computational Linguistics Volume 22, Number 3 would be to increase the size of the reference dictionary, so that every possible input word is included. A final consideration is the type of errors a dictionary-based PTGC system introduces when it encounters a word that is not contained in the dictionary: the system will produce the closest existing word (in its dictionary) as the best candidate, which may give a completely incomprehensible (if not wrong) meaning to the input phrase.</Paragraph> <Paragraph position="3"> Another approach to phoneme-to-grapheme conversion is the use of linguistic and/or heuristic rules (Kerkhoff and Wester 1987). This method works on a phoneme or syllable basis and can give adequate results in languages where the spelling is very similar to the pronunciation (such as Italian). Nevertheless, languages with diphthongs or double letters cannot benefit from this method, since it creates long lists of homophonic candidates that are all correct (in the sense that they are pronounced as the input word) but that do not exist in the language. In Greek, for example, where the phoneme/i/is the sound of five different graphemes (~, z, v, ~, o~) and the phoneme /1/can come from ~ and ;~;~, the phonemic form/m'ila/would produce a list containing the following 10 transcriptions: #i)~c~, #~&c~, #(;Ae~, l~i&o~, #oi&~, #i)~&c~, #~&&c~, #~)~&a, #ci&o~, and #oi&&e~ all having the same pronunciation. From this list, only two represent existing orthographically correct words; &quot;#i),c~&quot; 'speak!' and &quot;#~)~c~&quot; 'apples.' Previous work has shown that an average of 30 graphernic candidates is produced by this transcription for every input phonemic word (Rentzepopoulos 1988).</Paragraph> <Paragraph position="4"> To overcome the disadvantages of the above mentioned methods, a novel statistical approach to the problem of PTGC, which is based on hidden Markov models (HMM), has been investigated and is presented in this paper. Although statistical approaches have already been widely applied in several fields of natural language processing, they have not been considered for PTGC. The proposed method is language independent, does not use a dictionary, and can be applied with only minimal linguistic knowledge, thus reducing the cost of system development. Initially, the first-order HMM and the common Viterbi algorithm were used to provide a simple transcription for each input word. In its current version, the method is based on second-order HMM and on a modified Viterbi algorithm, which can provide more than one graphemic output for each phonemic input, in descending order of probability. The multiple outputs make it possible to apply a language model in sentence level for disambiguation at a subsequent stage. This version of the algorithm raised the number of correctly transcribed phonemes to 97%-100% for most of the languages the system was tested on. The proposed system assumes that the word boundaries are known; that is, it is a subsequent stage in an isolated-word speech recognition system. The PTGC method can work as a stand-alone module or in co-operation with a look-up module with a small to moderate size dictionary containing the most common words of the language. In the latter case, the look-up module employs a distance threshold: when the difference between the input and the words in the dictionary is greater than this threshold, control is passed to the HMM system, which converts the input phoneme string to graphemes.</Paragraph> <Paragraph position="5"> The basic theory, the pilot implementation, and the proposed final system are presented in Section 2. The evaluation procedure and the error-measure methodology are described in Section 3. In Section 4, the experimental results of the system are presented and the nature of the errors is discussed. The multilingual aspects of the algorithm and experimental results for seven languages are also given in this section.</Paragraph> <Paragraph position="6"> Finally, some conclusions are drawn about the system and topics for further research and hardware implementation are discussed in Section 5.</Paragraph> <Paragraph position="7"> Rentzepopoulos and Kokkinakis Phoneme-to-Grapheme Using HMM 2. Description of the System Before the presentation of the proposed system, a brief overview of the theory used and the issues addressed in its application are given. These include the basic hidden Markov model theory, the Viterbi algorithm, the N-best algorithm and the solutions used to make the PTGC system fast and efficient, adequate for real-time applications.</Paragraph> <Section position="1" start_page="352" end_page="353" type="sub_section"> <SectionTitle> 2.1 The First Order Hidden Markov Model </SectionTitle> <Paragraph position="0"> An HMM can model any real-world process that changes states in time, provided that the state changes are more or less time independent (Hannaford and Lee 1990; Rabiner 1989). An HMM is used to describe statistical phenomena that can be considered sequences of hidden (i.e., not directly observable) states that produce observable symbols (Lee 1989). These phenomena are called hidden Markov processes. A hidden Markov process is described by a model ,~ that consists of three matrices A, B, and ~-.</Paragraph> <Paragraph position="1"> Matrix A contains the transition probabilities of the hidden states, matrix B contains the probability of occurrence of an observation symbol given the hidden state, and vector 7r contains the initial probabilities of the hidden state. In mathematical terms:</Paragraph> <Paragraph position="3"> where N is the number of possible hidden states and M is the number of all the observable events. Obviously the dimension of matrix A is N x N, that of matrix B is N x M, and ~ is a vector of N elements.</Paragraph> <Paragraph position="4"> In equations (1)-(3), qt is the hidden state of the system at time t, Si is the /th possible hidden state of the system, Ot is the observation symbol at time t, and Vm is the m th possible observable symbol.</Paragraph> <Paragraph position="5"> For the application of HMM theory to PTGC, the correspondence of the natural language intraword features to an HMM can be found on the following basis: The natural language pronunciation can be considered as the output (observation) of a system that uses as input (hidden state sequence) the spelling of the language (Rentzepopoulos, Tsopanoglou, and Kokkinakis 1991).</Paragraph> <Paragraph position="6"> In this formulation, the sequence of phonemes produced by the system can be seen as the observation-symbol sequence of an HMM that uses the graphemic forms as a hidden-state sequence. With this statement, the PTGC problem can be restated as follows: Given the observation-symbol sequence O(t) (phonemes) and the HMM A, find the hidden-state sequence Q(t) (graphemes) that maximizes the probability P(O I Q, ,~).</Paragraph> <Paragraph position="7"> A formal technique for finding the single best state sequence is based on dynamic programming and is the well-known Viterbi algorithm (Forney 1973; Viterbi 1967). In a word-level implementation, the algorithm must find the hidden-state sequence (i.e., word in its orthographic form) with the best score, given the model & and the observation sequence O (i.e., word in its phonemic form). This algorithm proceeds Computational Linguistics Volume 22, Number 3 recursively from the beginning to the end of the word calculating for any time (in the case of PTGC, time is the position of a phoneme/grapheme in the word) the score of the best path in all possible hidden-state sequences that end at the current state. The model's parameters can be estimated using the definition formulas, since both the hidden-state and the observation-symbol sequences are known during the training phase of the conversion system. Thus there is no need of a special estimation procedure like the Baum-Welsh algorithm (Rabiner 1989), which is used when the hidden-state sequence is not known. In general:</Paragraph> <Paragraph position="9"> where n(x) is the number of occurrences of x in the training corpus and n'(x) is an estimation of the number of occurrences of x in the application corpus. The size of the training corpus and the sparseness of the resulting matrices can lead to different approaches in the definition of the estimation function n' (x). If a reasonably large text is available for training, then nS(x) ~- n(x). On the other hand, if the training data are insufficient (something that would result in a very sparse transition matrix) then a smoothing technique should be used for the estimation function n' (x) (Katz 1987; Ney and Essen 1991).</Paragraph> </Section> <Section position="2" start_page="353" end_page="355" type="sub_section"> <SectionTitle> 2.2 Pilot System </SectionTitle> <Paragraph position="0"> To implement the above algorithm in PTGC, some decisions had to be made about the states, observation symbols, and transition probabilities. These decisions are listed below.</Paragraph> <Paragraph position="1"> a.</Paragraph> <Paragraph position="2"> b.</Paragraph> <Paragraph position="3"> Every hidden state should produce one observation symbol. To achieve this, all the possible graphemic transcriptions of phonemes were coded as separate graphemic symbols (e.g., 7r and 7rTr are two different graphemic symbols even though they are both pronounced/p/). The transition probability matrix (A) should be biased to contain at least one occurrence for every transition and no zero elements. Consider first (a). According to the physical meaning given to the hidden states and the observation symbols of the HMM used, there cannot be hidden states (graphemes) that do not produce an observable symbol (phoneme). This is only partially correct for natural languages including mute letters and diphthongs. To overcome this problem, the hidden-state alphabet and the observation-symbol alphabet should contain not only single characters (single graphemes or phonemes respectively) but also clusters. This way, it is guaranteed that there will be no case where a sequence of graphemes produces a sequence of phonemes of a different length. The rules for the segmentation of a phoneme string to a sequence of symbols conforming to the above condition are manually defined off-line according to the procedure presented below in an informal algorithmic language (Figure 1).</Paragraph> <Paragraph position="4"> Rentzepopoulos and Kokkinakis Phoneme-to-Grapheme Using HMM Let G = {gl, g2, ... gM} be the set of phonemes and</Paragraph> <Paragraph position="6"> Segmentation rule development algorithm.</Paragraph> <Paragraph position="7"> The meaning of this algorithm is the following: If a pair of phonemes is written as either a single grapheme or a pair of graphemes, then this pair is considered a single state. The same holds for the reverse procedure when a pair of graphemes is pronounced as either a single phoneme or a pair of phonemes. For example (in Greek): grapheme ~ is pronounced/ks/e.g., C/4&-ksfdi 'vinegar' grapheme ~ is pronounced/k/e.g., ~aA6-kald 'good' grapheme rT is pronounced/s/e.g., c~c~O~-saff 'lucid' graphemes ~cr are pronounced/ks/e.g., ~C/~-c~cr~-4kstasi 'ecstasy' In this example the pair of phonemes/ks/is considered a single phonemic symbol. Accordingly, the pair &quot;~cr&quot; is also considered a single graphemic state since it is pronounced as /ks/. As can be seen, in order to disambiguate the case of ~cr the phonemic symbol/ks/and the graphemic state ~cr must be introduced.</Paragraph> <Paragraph position="8"> This algorithm is the only language-specific part of the PTGC system and its formulation requires only familiarity with the spelling of the language and not sophisticated linguistic knowledge. The rules are incorporated in the PTGC system using an automated procedure as a separate input function that parses the input strings into states.</Paragraph> <Paragraph position="9"> Now consider (b), concerning the transition probability matrix. Matrix A is established according to formula (4) through training in appropriate corpora using as n'(x) = max(n(x), 1). The bias described in (b) is necessary so that the algorithm does not discard a new transition but instead assigns a bad score to it (Rabiner 1989). The bias is one occurrence for each transition that has never occurred in the training corpus Computational Linguistics Volume 22, Number 3 and the model is normalized so that it fulfills the statistical constraints, i.e.:</Paragraph> <Paragraph position="11"> This estimation is allowed since the training corpora are reasonably large and the bias of one occurrence per transition has no significant effect on the validity of the actually nonzero matrix elements.</Paragraph> <Paragraph position="12"> Initially, a system based on a first-order HMM was implemented, and the results of its evaluation, detailed in Section 3, were promising. For Greek, this system gave an average score of 78% correctly transcribed words, while at the phoneme level the score reached 95.5% (Rentzepopoulos and Kokkinakis 1991). Similar rates were achieved in four other languages (English, French, German, and Italian) (Rentzepopoulos and Kokkinakis 1992).</Paragraph> <Paragraph position="13"> The model implemented as above showed some disadvantages: Therefore, a higher-order HMM and a multiple-output conversion algorithm were employed in order to overcome these disadvantages and achieve better results.</Paragraph> </Section> <Section position="3" start_page="355" end_page="356" type="sub_section"> <SectionTitle> 2.3 Second Order HMM </SectionTitle> <Paragraph position="0"> Since the results of the first order HMM system were encouraging, we decided to develop an improved version of the system. Two areas were selected for possible advancement: first, to make the system contain more detail in the modeling of the language, and second, to use a system that could produce more than one output solution for each phonemic input (homophones). This would offer a choice between alternatives, making it possible to find the best solution at a following stage.</Paragraph> <Paragraph position="1"> The first improvement was accomplished using a second-order HMM. This is a model that contains conditional probabilities of the form:</Paragraph> <Paragraph position="3"> i.e., the probability of occurrence of state Sk when the two previous states are Si and Sj at t - 2 and t - 1, respectively. The complete model needs a new matrix of conditional probabilities that contains the probability of state-pairs in time t = {1, 2}: p= {pij:i= l...N,j= l...N}, pij=P(ql=Silq2=Sj) (9) So the complete model 3~ consists of {A, B, 7r, p}. The second-order HMM can be translated into a first-order HMM with an extended state space, in which state pairs are used as single states.</Paragraph> <Paragraph position="4"> To use the above model, a new version of the Viterbi algorithm should be employed, one which can recursively calculate the intermediate values of the probability measure d using the second-order HMM. A second-order HMM has been introduced before (Kriouile, Mari, and Haton 1990) for other problems in the field of pattern analysis and speech recognition. In He (1988) the Viterbi algorithm is presented for a second-order HMM using the transformation of the model to a first-order with extended state space. The algorithm that was developed here uses the features of the Viterbi algorithm in a slightly different way, tailored to the needs of the PTGC problem as described in Section 2.5.</Paragraph> </Section> <Section position="4" start_page="356" end_page="356" type="sub_section"> <SectionTitle> Rentzepopoulos and Kokkinakis Phoneme-to-Grapherne Using HMM 2.4 Multiple-Output (N-best) Conversion Algorithm </SectionTitle> <Paragraph position="0"> The Viterbi algorithm produces the overall best state sequence by maximizing the over-all probability P(O I Q). If the N best state sequences are needed, then the algorithm must be modified to keep the N best state' sequences from ql through qt. Schwartz and Austin (1991) and Schwartz and Chow (1990) present the N-best algorithm in detail.</Paragraph> <Paragraph position="1"> The following consideration is the basis of the multiple-output conversion algorithm: Let QE = {Ql(t), Q2(t),..., QE(t)} be the globally E best hidden-state sequences that end at state qt = Si at a given time t. By &quot;best&quot; we mean, as usual, those sequences having the highest probability. If one of the globally E best hidden-state sequences that starts at t = 1 and ends at t = T passes from state Si at time t then it must have one of the members of Qt E as part of the path from time 1 to t.</Paragraph> <Paragraph position="2"> To prove this, only the following assumptions are required: Qx(t) is a state sequence that ends at time t at state Si; Qx(t) C/~ QE; and Qx(t) is part of one of the E globally best hidden-state sequences. Clearly Qx(t) ~ QE ~=~ P(Qx(t)) < P(Qi(t)), Vi E 1... E. The probability of the complete state sequence Qm(T) (1 < m _< E) which contains Qx(t) would be:</Paragraph> <Paragraph position="4"> leading to state Si at time t that are more probable than Qm (T), which is a path among the first E most probable paths; a contradictory statement.</Paragraph> <Paragraph position="5"> Summarizing, we have shown that we only need to keep the locally (at any time t in 1... T) E best paths as we go along the possible state sequences for every possible state. When we arrive at the end, we only need to keep the E globally highest probabilities and trace back the states that resulted in these.</Paragraph> </Section> <Section position="5" start_page="356" end_page="357" type="sub_section"> <SectionTitle> 2.5 Final system </SectionTitle> <Paragraph position="0"> The final version of the conversion system uses the previously mentioned methods, i.e., the second-order HMM and the N-best version of the Viterbi algorithm along with a transformation that is necessary to speed up the execution of the conversion.</Paragraph> <Paragraph position="1"> The algorithm as described previously has many disadvantages for a PTGC system from the implementation point of view. The values of the parameters of the model are in the range of 100. This implies that, considering storage, we need to keep in memory 100 x 100 x 100 double precision floating point numbers for matrix A along with the other data of the model and the algorithm. To be exact we need for: A: N 3 B: NxM double precision floating point numbers double precision floating point numbers double precision floating point numbers double precision floating point numbers double precision floating point numbers short integers (for the meaning of 6 and ~b see the algorithm presented in the Appendix).</Paragraph> <Paragraph position="2"> In the above, if we substitute the values of the Greek PTGC system (N=140, M=70, T=30, E=4) for the symbols, we can see that we need no less than 45,708,320 bytes for storing these data. Aside from the problem of storage, the computer has to execute the inner part of the second-order-multiple-output conversion algorithm N 3 x E x T/2 times (the average length of a word is about T/2), i.e., 219,520,000 times per word. As is clear from the presentation of the algorithm, this part contains a rather time-consuming sorting procedure plus a floating point multiplication. It is obvious that this is an unacceptable time delay for real-time applications.</Paragraph> <Paragraph position="3"> To decrease the algorithm execution time and storage needs we introduced the following improvements: a.</Paragraph> <Paragraph position="4"> b.</Paragraph> <Paragraph position="5"> C.</Paragraph> <Paragraph position="6"> Taking advantage of the relative sparseness of matrix B, we first determine if Bj(t) is nonzero and only then does the algorithm proceed to the rest of the processing. This has decreased the execution time of the conversion by nearly 100 times.</Paragraph> <Paragraph position="7"> We do the same for matrix A. This means that if the indices (i,j, k) indicate a zero transition probability then the algorithm proceeds without trying to calculate the overall probability, thus eliminating a floating point multiplication.</Paragraph> <Paragraph position="8"> Since at every time point the intermediate variable d(t) is calculated only from d(t - 1) we keep only two copies of dij, one for t and one for t - 1. Finally, the fact that only multiplications are involved in the processing of the conversion algorithm led us to transform the algorithm to use only additions. In the Appendix, the algorithm we implemented is presented.</Paragraph> </Section> </Section> class="xml-element"></Paper>