File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3005_metho.xml

Size: 11,369 bytes

Last Modified: 2025-10-06 14:09:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3005">
  <Title>Automatic Call Routing with Multiple Language Models</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Modelling
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1. Model Structure
</SectionTitle>
      <Paragraph position="0"> Figure 1 shows the method used to produce an initial language model.</Paragraph>
      <Paragraph position="1"> The algorithm follows that described in [1]:  1. Build an n-gram language model (LM) using the dictionary transcriptions of the WSJ corpus (we used n=6). Make this the current LM.</Paragraph>
      <Paragraph position="2"> 2. Use the current LM in the recognizer to produce a set of phone strings.</Paragraph>
      <Paragraph position="3"> 3. Build a new LM based on the recognizer phone strings: 4. If niterations &lt;=threshold, goto 2 else  finish and produce a single language model for all routes.</Paragraph>
      <Paragraph position="4"> Phonotactic language model (WSJ) Figure. 1 The Iterative training procedure The phone strings are now segmented and clustered so that salient phone sequences for each route can be identified. This is done as follows:</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FOR EACH ROUTE
</SectionTitle>
    <Paragraph position="0"> 1. Segment each recognized phone string in the route into all possible sequences of 3,4, ... , 9 phones.</Paragraph>
    <Paragraph position="1"> 2. Estimate the MI for each sequence, and identify the salient sequences as the sequences with the highest MI [11].</Paragraph>
    <Paragraph position="2"> 3. Cluster the salient sequences within the route. This is done by calculating and combining two measures of distance (using dynamic programming techniques) for each pair of sequences: * The Levensthein distance between the phone symbols representing the sequences.</Paragraph>
    <Paragraph position="3"> * The acoustic distance in &amp;quot;MFCC space&amp;quot; between the two waveform segments representing the sequences.</Paragraph>
    <Paragraph position="4"> 4. Use a simple lexicon pruning scheme that  eliminates long agglomerations of short primitives [12].</Paragraph>
    <Paragraph position="5"> At this point, we have generated a set of clustered phone sequences for each route. Each phone sequence corresponds to a sequence of frames, and the frame sequences within a cluster are used to build an HMM These HMMs are used later to estimate the class of a segment output by the recognizer (see section 3.2). Finally, we build a language model for each route, as follows by collecting together the recognised phonetic sequences of utterances from each route and using them to construct a language model.</Paragraph>
    <Paragraph position="6"> After iterating the LM, detection of key phonetic sequences improves. However, many utterances do not produce any sequences or produce several sequences from different routes. For recognition, we use a &amp;quot;divide and conquer&amp;quot; approach. Utterances that yield one or more sequences from the same route are classified immediately as that route, and utterances whose output is ambiguous, in that they yield no sequences, or sequences from several routes, or whose recognition confidence is too low to trust, are subject to a more detailed recognition pass in which separate LMs for each route are used. This has the advantage of only applying the extra computational effort required to use multiple LMs for those utterances that need this. In practice, if lattices are used, the additional computational effort is not too great. The confidence measure used was the measure available from the Nuance speech recognizer v8.0.</Paragraph>
    <Paragraph position="7"> Hence recognition proceeds as follows.</Paragraph>
    <Paragraph position="8">  1. A single language model is used in the recognizer to produce an output phone string.</Paragraph>
    <Paragraph position="9"> 2. Any phonetic sequences in the output string that also occur within any of the clusters of key phonetic sequences in any of the routes are found.</Paragraph>
    <Paragraph position="10"> 3. IF the number of key phonetic sequences found is one or more AND the sequences all belong to the same route:  the utterance is classified as belonging to this route. ELSEIF the number of key phonetic sequences is zero OR there are one or more sequences from different routes OR the confidence measure of the whole utterance is lower than some threshold: the utterance is re-recognized using all 18 language models.</Paragraph>
    <Paragraph position="11"> 4. Recognition using multiple language models works as follows. 18 recognized phonetic sequences are output, one from each recognizer (as shown in Figure 2), and key phonetic sequences are detected in each output.</Paragraph>
    <Paragraph position="12"> IF there are one or more sequences from different routes: Putative sections of the speech that contain keywords are identified by comparing the symbolic output of a recognizer using a certain LM with the sequences that were used to form the HMMs of the clustered key phonetic sequences for this LM. These HMMs are then used to determine the likelihood of each sequence given the output string, and the utterance is assigned to the route of the highest likelihood. ELSEIF the number of key phonetic sequences is  Call type classification is done using a vector-based approach as described in [8]. It is perhaps surprising that this classifier gets 100% accuracy (2553/2553) on utterances in which all the sequences are apparently from the same route--we attribute this to the fact that the 18 call-types were used were highly independent in their use of keywords.</Paragraph>
    <Paragraph position="13">  Key phonetic sequences can be incorrectly matched to incorrect segments of the utterance, causing false alarms. To combat this problem, we use matching in the acoustic domain as well as the symbolic domain. HMMs for 41 key phonetic sequences whose number of occurrences was larger than a threshold (we used 30) were built. Each key phonetic sequence was modelled by a five-state left-to-right HMM with no skips and each state is characterised by a mixture Gaussian state observation density. A maximum of 3 mixture components per state is used. The Baum-Welch algorithm is then used to estimate the parameters of the Gaussian densities for all states of subword HMM's. We use key phrase detection as described in [13][14]. By using the phonetic output from the recogniser, the position in the utterance waveform of putative strings can be identified, and this section of the waveform is input into the phonetic sequence HMMs. Detection of phrases is achieved by monitoring the forward probability of the data given the model at any time and searching for peaks in the probability. If full-likelihood recognition is used, we estimate the score ),( twS  is the forward probability of word w at time t [13]. In practice, we used the Viterbi equivalent of equation (1) to determine the likelihood. 4. Experiments 4.1. Phone accuracy based on one LM Figure 3 illustrates the effects of (a) using the recogniser output strings to construct a new language model as described in section 3.1; (b) using 18 different LMs as well as a single LM.  Rec-Phone: Build language model using recognised phonetic sequences of utterances from training set; Trans-Phone: Build language model using phoneme transcriptions of words of utterances from training set 1 LM: Recognition using one language model; 18LMs: Recognition using 18 language models.</Paragraph>
    <Paragraph position="14"> Figure 3 shows that the phone error rate is very much higher when recognised phone sequences (Rec-Phone) rather than dictionary transcriptions (Trans-Phone) are used to build an LM. However, an interesting point is that iterative performance decreases when the transcriptions are used, but increases when the recognised strings are used. This is probably because, when the recognised strings are used, the initial LM, which is trained on WSJ, does not reflect the distribution of n-grams in the data, and so performance is poor. However, the vocabulary in the data is quite small, so that after even a single recognition pass, although the error-rate is high, the new LM is a better reflection of the n-grams in the data. This has the effect of improving the phone recognition performance, and this improvement continues with each iteration.</Paragraph>
    <Paragraph position="15"> When we use an initial language model built using dictionary phoneme transcriptions, the performance is initially much better than using an LM trained on an independent corpus, as would be expected. However, because of the small vocabulary size and the relatively high number of occurrences of a few phonetic sequences, any errors in recognition of these sequences dominate, and this leads to an increasing overall error-rate. These results are not as good as those obtained by Hiyan [1] using an iterative language model. This may be because of the difference in the speech recognisers, or, more likely, in the average length of the phrases in the different vocabularies, which are much shorter than the phrases used here.</Paragraph>
    <Paragraph position="16">  call routing accuracy Table 1 shows the call-routing classification performance when a single LM is used and the LM is iterated. What is interesting here is that an apparently small increase in phone accuracy on iteration gives rise to a huge increase in call-routing accuracy. This is because although the overall phone error-rate improves only slightly, the error rate on the key phonetic sequences is greatly improved, leading to improved classification performance. Note that performance on this dataset when the dictionary translations of the transcriptions of the utterances are used is 93.7%.  Trans-Phone: language model built with dictionary phoneme transcriptions of the utterances; 1 LM: iterative language model built using recognition</Paragraph>
    <Paragraph position="18"> described in section 3.1.</Paragraph>
    <Paragraph position="19"> Table 2 compares the call-routing classification accuracies. The accuracy achieved using the two pass system with multiple LMs (86.5%) is much better than that using a single iterated LM, but not quite as good as that obtained by using the dictionary transcriptions. It could be argued that it is not possible to say whether the improvement shown in column 4 of Table 2 compared with column 3 is due to the use of multiple LMs or to the use of the HMM post-processor.</Paragraph>
    <Paragraph position="20"> However, when a single LM is used, the situation is either that there are one or more fairly unambiguous output sequences from a single call type, or there are many noisy and ambiguous sequences whose positions are not well-defined. It is very difficult to process these putative sequences with all the HMMs of key phonetic sequences. Using multiple LMs has the effect of producing relatively unambiguous sequences from only a small subset set of call-types, whose position in the waveform is quite well-defined. This reduces the number of HMM sequences that need to used and hence also the difficulty of application.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML