File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1062_metho.xml

Size: 14,665 bytes

Last Modified: 2025-10-06 14:10:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1062">
  <Title>Unlimited vocabulary speech recognition for agglutinative languages</Title>
  <Section position="3" start_page="488" end_page="489" type="metho">
    <SectionTitle>
3 Statistical properties of Finnish,
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="488" end_page="489" type="sub_section">
      <SectionTitle>
Estonian and Turkish
</SectionTitle>
      <Paragraph position="0"> Before presenting the speech recognition results, some statistical properties are presented for the three agglutinative languages studied. If we consider choosing a vocabulary of the 50k-70k most common words, as usual in English broadcast news LVCSR systems, the out-of-vocabulary (OOV) rate in English is typically smaller than 1%. Using the language model training data the following OOV rates can be found for a vocabulary including only the most common words: 15% OOV for 69k in Finnish (Hirsimaki et al., 2005), 10% for 60k in Estonian and 9% for 50k in Turkish. As shown in (Hacioglu et al., 2003) this does not only mean the same amount of extra speech recognition errors, but even more, because the recognizer tends to lose track when unknown words get mapped to those that are in the vocabulary. Even doubling the vocabulary is not a suf- null for Turkish language ficient solution, because a vocabulary twice as large (120k) would only reduce the OOV rate to 6% in Estonian and 5% in Turkish. In Finnish even a 400k vocabulary of the most common words still gives 5% OOV in the language model training material.</Paragraph>
      <Paragraph position="1"> Figure 2 illustrates the vocabulary explosion encountered when using words and how using morphs avoids this problem for Turkish. The figure on the left shows the vocabulary growth for both words and morphs. The figure on the right shows the graph for morphs in more detail. As seen in the figure, the number of new words encountered continues to increase as the corpus size gets larger whereas the number of new morphs encountered levels off.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="489" end_page="490" type="metho">
    <SectionTitle>
4 Speech recognition experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="489" end_page="489" type="sub_section">
      <SectionTitle>
4.1 About selection of the recognition tasks
</SectionTitle>
      <Paragraph position="0"> In this work the morph-based language models have been applied in speech recognition for three different agglutinative languages, Finnish, Estonian and Turkish. The recognition tasks are speaker dependent and independent fluent dictation of sentences taken from newspapers and books, which typically require very large vocabulary language models.</Paragraph>
    </Section>
    <Section position="2" start_page="489" end_page="489" type="sub_section">
      <SectionTitle>
4.2 Finnish
</SectionTitle>
      <Paragraph position="0"> Finnish is a highly inflected language, in which words are formed mainly by agglutination and compounding. Finnish is also the language for which the algorithm for the unsupervised morpheme discovery (Creutz and Lagus, 2002) was originally developed.</Paragraph>
      <Paragraph position="1"> The units of the morph lexicon for the experiments in this paper were learned from a joint corpus containing newspapers, books and newswire stories of totally about 150 million words (CSC, 2001). We obtained a lexicon of 25k morphs by feeding the learning algorithm with the word list containing the 160k most common words. For language model training we used the same text corpus and the recently developed growing n-gram training algorithm (Siivola and Pellom, 2005). The amount of resulted n-grams are listed in Table 4. The average length of a morph is such that a word corresponds to 2.52 morphs including a word break symbol.</Paragraph>
      <Paragraph position="2"> The speech recognition task consisted of a book read aloud by one female speaker as in (Hirsimaki et al., 2005). Speaker dependent cross-word triphone models were trained using the first 12 hours of data and evaluated by the last 27 minutes. The models included tied state hidden Markov models (HMMs) of totally 1500 different states, 8 Gaussian mixtures (GMMs) per state, short-time mel-cepstral features (MFCCs), maximum likelihood linear transformation (MLLT) and explicit phone duration models (Pylkkonen and Kurimo, 2004). The real-time factor of recognition speed was less than 10 xRT with a 2.2 GHz CPU. However, with the efficient LVCSR decoder utilized (Pylkkonen, 2005) it seems that by making an even smaller morph lexicon, such as 10k, the decoding speed could be optimized to only a few times real-time without an excessive trade-off with recognition performance.</Paragraph>
    </Section>
    <Section position="3" start_page="489" end_page="490" type="sub_section">
      <SectionTitle>
4.3 Estonian
</SectionTitle>
      <Paragraph position="0"> Estonian is closely related to Finnish and a similar language modeling approach was directly applied to the Estonian recognition task. The text corpus used to learn the morph units and train the statistical language model consisted of newspapers and books, altogether about 55 million words (Segakorpus, 2005). At first, 45k morph units were obtained as the best subword unit set from the list of the 470k most common words in the corpora. For speeding up the recognition, the morph lexicon was afterwards reduced to 37k by splitting the rarest morphs (occurring in only one or two words) further into smaller ones. Corresponding growing n-gram language models as in Finnish were trained from the Estonian corpora resulting the n-grams in Table 4.</Paragraph>
      <Paragraph position="1"> The speech recognition task in Estonian consisted of long sentences read by 50 randomly picked held-out test speakers, 7 sentences each (a part of (Meister  et al., 2002)). Unlike the Finnish and Turkish microphone data, this data was recorded from telephone, i.e. 8 kHz sampling rate and narrow band data instead of 16 kHz and normal (full) bandwidth. The phoneme models were trained for speaker independent recognition using windowed cepstral mean subtraction and significantly more data (over 200 hours and 1300 speakers) than for the Finnish task. The speaker independence, together with the telephone quality and occasional background noises, made this task still a considerably more difficult one. Otherwise the acoustic models were similar cross-word triphone GMM-HMMs with MFCC features, MLLT transformation and the explicit phone duration modeling, except larger: 5100 different states and 16 GMMs per state. Thus, the recognition speed is also slower than in Finnish, about 20 xRT (2.2GHz CPU).</Paragraph>
    </Section>
    <Section position="4" start_page="490" end_page="490" type="sub_section">
      <SectionTitle>
4.4 Turkish
</SectionTitle>
      <Paragraph position="0"> Turkish is another a highly-inflected and agglutinative language with relatively free word order. The same Morfessor tool (Creutz and Lagus, 2005) as in Finnish and Estonian was applied to Turkish texts as well. Using the 360k most common words from the training corpus, 34k morph units were obtained.</Paragraph>
      <Paragraph position="1"> The training corpus consists of approximately 27M words taken from literature, law, politics, social sciences, popular science, information technology, medicine, newspapers, magazines and sports news.</Paragraph>
      <Paragraph position="2"> N-gram language models for different orders with interpolated Kneser-Ney smoothing as well as entropy based pruning were built for this morph lexicon using the SRILM toolkit (Stolcke, 2002). The number of n-grams for the highest order we tried (6grams without entropy-based pruning) are reported in Table 4. In average, there are 2.37 morphs per word including the word break symbol.</Paragraph>
      <Paragraph position="3"> The recognition task in Turkish consisted of approximately one hour of newspaper sentences read by one female speaker. We used decision-tree state clustered cross-word triphone models with approximately 5000 HMM states. Instead of using letter to phoneme rules, the acoustic models were based directly on letters. Each state of the speaker independent HMMs had a GMM with 6 mixture components. The HTK frontend (Young et al., 2002) was used to get the MFCC based acoustic features. The explicit phone duration models were not applied.</Paragraph>
      <Paragraph position="4"> The training data contained 17 hours of speech from over 250 speakers. Instead of the LVCSR decoder used in Finnish and Estonian (Pylkkonen, 2005), the Turkish evaluation was performed using another decoder (AT&amp;T, 2003), Using a 3.6GHz CPU, the real-time factor was around one.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="490" end_page="491" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> The recognition results for the three different tasks: Finnish, Estonian and Turkish, are provided in Tables 1 - 3. In each task the word error rate (WER) and letter error rate (LER) statistics for the morph-based system is compared to a corresponding word-based system. The resulting morpheme strings are glued to words according to the word break symbols included in the language model (see Section 2.2) and the WER is computed as the sum of substituted, inserted and deleted words divided by the correct number of words. LER is included here as well, because although WER is a more common measure, it is not comparable between languages. For example, in agglutinative languages the words are long and contain a variable amount of morphemes. Thus, any incorrect prefix or suffix would make the whole word incorrect. The n-gram language model statistics are given in Table 4.</Paragraph>
    <Paragraph position="1">  independent Estonian task consisting of read sentences recorded via telephone (see Section 4.3). For a reference (word-based) language model a 60k lexicon was used here.</Paragraph>
    <Paragraph position="2">  independent Turkish task consisting of read newspaper sentences (see Section 4.4). For the reference 50k (word-based) language model the accuracy given by 4 and 5-grams did not improve from that of 3-grams.</Paragraph>
    <Paragraph position="3"> In the Turkish recognizer the memory constraints during network optimization (Allauzen et al., 2004) allowed the use of language models only up to 5grams. The language model pruning thresholds were optimized over a range of values and the best results are shown in Table 3. We also tried the same experiments with two-pass recognition. In the first pass, instead of the best path, lattice output was generated with the same language models with pruning. Then these lattices were rescored using the nonpruned 6-gram language models (see Table 4) and the best path was taken as the recognition output.</Paragraph>
    <Paragraph position="4"> For the word-based reference model, the two-pass recognition gave no improvements. It is likely that the language model training corpus was too small to train proper 6-gram word models. However, for the morph-based model, we obtained a slight improvement (0.7 % absolute) by two-pass recognition.</Paragraph>
  </Section>
  <Section position="6" start_page="491" end_page="492" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> The key result of this paper is that we can successfully apply the unsupervised statistical morphs in large vocabulary language models in all the three experimented agglutinative languages. Furthermore, the results show that in all the different LVCSR tasks, the morph-based language models perform very well and constantly dominate the reference language model based on words. The way that the lexi- null that the Turkish language model was not prepared by the growing n-gram algorithm as the others and the model was limited to 6-grams.</Paragraph>
    <Paragraph position="1"> con is built from the word fragments allows the construction of statistical language models, in practice, for almost an unlimited vocabulary by a lexicon that still has a convenient size.</Paragraph>
    <Paragraph position="2"> The recognition was here restricted to agglutinative languages and tasks in which the language used is both rather general and matches fairly well with the available training texts. Significant performance variation in different languages can be observed here, because of the different tasks and the fact that comparable recognition conditions and training resources have not been possible to arrange. However, we believe that the tasks are still both difficult and realistic enough to illustrate the difference of performance when using language models based on a lexicon of morphs vs. words in each task. There are no directly comparable previous LVCSR results on the same tasks and data, but the closest ones which can be found are slightly over 20% WER for the Finnish task (Hirsimaki et al., 2005), slightly over 40 % WER for the Estonian task (Alumae, 2005) and slightly over 30 % WER for the Turkish task (Erdogan et al., 2005).</Paragraph>
    <Paragraph position="3"> Naturally, it is also possible to prepare a huge lexicon and still succeed in recognition fairly well (Saraclar et al., 2002; McTait and Adda-Decker, 2003; Hirsimaki et al., 2005), but this is not a very convenient approach because of the resulting huge language models or the heavy pruning required to keep  them still tractable. The word-based language models that were constructed in this paper as reference models were trained as much as possible in the same way as the corresponding morph language models.</Paragraph>
    <Paragraph position="4"> For Finnish and Estonian the growing n-grams (Siivola and Pellom, 2005) were used including the option of constructing the OOV words from phonemes as in (Hirsimaki et al., 2005). For Turkish a conventional n-gram was built by SRILM similarly as for the morphs. The recognition approach taken for Turkish involves a static decoding network construction and optimization resulting in near real time decoding. However, the memory requirements of network optimization becomes prohibitive for large lexicon and language models as presented in this paper. In this paper the recognition speed was not a major concern, but from the application point of view that is a very important factor to be taken into a account in the comparison. It seems that the major factors that make the recognition slower are short lexical units, large lexicon and language models and the amount of Gaussian mixtures in the acoustic model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML