XML Viewer - j96-3003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/j96-3003_metho.xml
Size: 8,195 bytes
Last Modified: 2025-10-06 14:14:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="J96-3003">
  <Title>Efficient Multilingual Phoneme-to-Grapheme Conversion Based on HMM</Title>
  <Section position="5" start_page="371" end_page="372" type="metho">
    <SectionTitle>
5. Conclusion
</SectionTitle>
    <Paragraph position="0"> We have presented a system for phoneme-to-grapheme conversion (PTGC) at the word level that uses the principles of hidden Markov models to statistically correlate the graphemic forms to the phonemic forms of a natural language. A first- and second-order HMM have been created and the Viterbi and N-best algorithm have been used for the conversion. In the latter case, experimentation showed that no more than two solutions (output candidates) are necessary to produce the correct output with an accuracy higher than 96% for most of the languages the system was tested on. If four output candidates are allowed, then this rate reaches 97% to 100%. Moreover, it must be noted that the success rate of the system, although already good enough, can be further improved by better training on a larger corpus of selected texts.</Paragraph>
    <Paragraph position="1"> An important advantage of the system presented here, in comparison to rule-based or dictionary look-up systems, is that it produces only one (or at least very few) graphemic suggestions for each phonemic word. In the first case (one suggestion), no language model is needed to disambiguate potential homophones at sentence level. In the second case (a few suggestions), the execution speed of the system is substantially higher than in rule-based or dictionary-based systems, due to the small number of suggestions per word. The prototype system, which was implemented on a 486-based personal computer, responded at an average rate of one word per second for Exp 2 (second-order HMM) and about ten times faster for Exp 1 (first-order model). The fact that the algorithm scans the input word linearly (once from the beginning to the end) means that it can work in parallel with other modules of speech recognition systems and produce output with a very short delay after the end of the input.</Paragraph>
    <Paragraph position="2"> Another advantage of this system is that it can work in any language in which the pronunciation of the words is statistically dependent only on their spelling. The only language-specific part of the system, i.e., the algorithm for the segmentation rule  Rentzepopoulos and Kokkinakis Phoneme-to-Grapheme Using HMM definition, is straightforward and does not need any special linguistic knowledge but only familiarity with the target language to be processed.</Paragraph>
    <Paragraph position="3"> The system is not limited by any dictionary. This is a significant advantage in very large or unlimited dictionary applications. An implication of this property is that the system does not try to match the input utterance to the closest word (by some measure of distance) contained in the dictionary but rather tries to find its most probable spelling. In this sense, the output of the PTGC system never misleads the final human user about what the input was.</Paragraph>
    <Paragraph position="4"> Note also that the system is symmetric between the two forms of a natural language: graphemic and phonemic. This implies that without any modification, the algorithm can be used in the reverse order (i.e., for a grapheme-to-phoneme conversion system, widely used in text-to-speech \[or speech synthesis\] systems) by just interchanging the phonemic with the graphemic data of the training procedure.</Paragraph>
    <Paragraph position="5"> Last but not least, the fact that the system is not rule based but uses an algorithm based on probabilities makes it possible to implement the system in hardware, resulting in a system adaptable to any real-time speech recognition system. As can be seen in the equations of the appendix the algorithm is highly parallel since the values of djk(t) are independently computed from the values of dij(t - 1); this means that these calculations can be performed concurrently. In this manner, the response time of the complete algorithm can be proportional to N 2 rather than to N 3, yielding a system that can serve as a module for any real-time speech recognition system.</Paragraph>
    <Paragraph position="6"> In conclusion, the proposed method has the following advantages:</Paragraph>
  </Section>
  <Section position="6" start_page="372" end_page="374" type="metho">
    <SectionTitle>
Appendix: Implementation Notes
</SectionTitle>
    <Paragraph position="0"> The fact that only multiplications are involved in the processing of the conversion algorithm led us to convert the algorithm to use only additions. Instead of using probabilities, we used their negative logarithm, thus yielding distances. This transformation offers two advantages: First, a four-byte integer representation is used for each number instead of a ten-byte floating point representation, without any loss of accuracy, thus reducing memory requirements. Second, a substantial increase in processing speed is achieved, since the fixed point addition is faster than floating point multiplication.</Paragraph>
    <Paragraph position="1"> Clearly, since the probability P is a number between 0 and 1, -log(P) is a number in the range 0... ec. In order to reduce computation, one of the two library-supplied logarithm functions had to be used, i.e., log10 or log e. It can easily be seen that if a &gt; b, then -loga(P ) &lt; -lOgb(P ). For this reason the natural logarithms (base e = 2.71828) were chosen instead of decimal logarithms.</Paragraph>
    <Paragraph position="2"> To benefit from the above transformation, a fixed point arithmetic should be used (floating point addition is as troublesome as floating point multiplication if not more). At this point, we had to make decisions taking into account implementation-specific parameters. The system was implemented using the C programming language on  Computational Linguistics Volume 22, Number 3 a 486-based computer in protected mode, thus exploiting its full 32-bit architecture. The probabilities are first calculated using the greatest available floating point representation, which is 80 bits for a long double floating point number. The smallest nonzero value of P in this representation (and effectively its best resolution) is 3.4 x 10 -4932 corresponding to the greatest value of - log(P), which is 11,355.126. An unsigned long integer has a 32-bit dynamic range, which results in a maximum value of 2 32 = 4, 294, 967, 295. Since for every state we need to add two distances, one from matrix A and one from matrix B, we must be sure that there will be no overflow after all the additions that must be made for each word. The system as tested uses a maximum number of 30 states per word, a constant that has not yet been surpassed by any word in all the languages in which it was tested. This means that the maximum distance value must be 232/60 = 71,582, 788, which results in a scaling factor f = 71,582,788/11,355.126 = 63, 040. By multiplying every distance with this factor and truncating it to its integral part, it is guaranteed that there will be no overflow in the execution of the Viterbi algorithm. This fact allows the elimination of code that would check for overflow during the algorithm, resulting in a much faster code.</Paragraph>
    <Paragraph position="3"> For reference, the complete algorithm converted to work with logarithms (as it was implemented) is presented below: Let a t, fl!, 7r' and p' be the HMM parameters after the above transformation and normalization (e.g., alj k = f. Lloge(aijk)J where f(= 63, 040) is the factor that was used to facilitate fixed point arithmetic). Then we inductively compute the locally minimum distance 6 ! and the path ~ as follows: Initialization</Paragraph>
    <Paragraph position="5"> where d* is a vector containing the E minimum distance values that correspond to the E state sequences Qe = {qt}, t = 1... T, e = 1... E, which are returned as the best (most probable) state sequences.</Paragraph>
    <Paragraph position="6">  Rentzepopoulos and Kokkinakis Phoneme-to-Grapheme Using HMM</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML