File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1001_metho.xml
Size: 20,551 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1001"> <Title>Effective Utterance Classification with Unsupervised Phonotactic Models</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Utterance Classification Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Runtime Operation </SectionTitle> <Paragraph position="0"> The runtime operation of our utterance classification method is simple. It involves applying two models (which are trained as described in the next subsection): A statistical n-gram phonotactic model and a phone string classification model. At runtime, the phonotactic model is used by an automatic speech recognition system to convert a new input utterance into a phone string which is mapped to an output class by applying the classification model. (We will often refer to an output class as an &quot;action&quot;, for example transfer to a specific call-routing destination). The configuration at runtime is as shown in Figure 1. More details about the specific recognizer and classifier components used in our experiments are given in the Section 3.</Paragraph> <Paragraph position="1"> The classifier can optionally make use of more information about the context of an utterance to improve the accuracy of mapping to actions. As noted in Figure 1, in the experiments presented here, we use a single additional feature as a proxy for the utterance context, specifically, the identity of the spoken prompt that elicited the utterance. It should be noted, however, that inclusion of such additional information is not central to the method: Whether, and how much, context information to include to improve classification accuracy will depend on the application. Other candidate aspects of context may include the dialog state, the day of week, the role of the speaker, and so on.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Training Procedure </SectionTitle> <Paragraph position="0"> Training is divided into two phases. First, train a phone n-gram model using only the training utterance speech files and a domain-independent acoustic model. Second, train a classification model mapping phone strings and prompts (the classifier inputs) to actions (the classifier outputs).</Paragraph> <Paragraph position="1"> The recognition training phase is an iterative procedure in which a phone n-gram model is refined successively: The phone strings resulting from the current pass over the speech files are used to construct the phone n-gram model for the next iteration. In other words, this is a &quot;Viterbi re-estimation&quot; or &quot;1-best re-estimation&quot; process. We currently only re-estimate the n-gram model, so the same general-purpose HMM acoustic model is used for ASR decoding in all iterations. Other more expensive n-gram re-estimation methods can be used instead, including ones in which successive n-gram models are re-estimated from n-best or lattice ASR output. Candidates for the initial model used in this procedure are an unweighted phone loop or a general purpose phonotactic model for the language being recognized.</Paragraph> <Paragraph position="2"> The steps of the training process are as follows. (The procedure is depicted in Figure 2.) 1. Set the phone string model G to an initial phone string model. Initialize the n-gram order N to 1.</Paragraph> <Paragraph position="3"> (Here 'order' means the size of the n-grams, so for example 2 means bi-grams.) 2. Set S to the set of phone strings resulting from recognizing the training speech files with G (after possibly adjusting the insertion penalty, as explained below).</Paragraph> <Paragraph position="4"> 3. Estimate an n-gram model G0 of order N from the set of strings S.</Paragraph> <Paragraph position="5"> 4. If N <Nmax, set N N + 1 and G G0 and go to step 2, otherwise continue with step 5.</Paragraph> <Paragraph position="6"> 5. For each recognized string s2S, construct a classifier input pair (s;r) where r is the prompt that elicited the utterance recognized as s.</Paragraph> <Paragraph position="7"> 6. Train a classification model M to generalize the training function f : (s;r) ! a, where a is the action associated with the utterance recognized as s. 7. Return the classifier model M and the final n-gram model G0 as the results of the training procedure.</Paragraph> <Paragraph position="8"> Instead of increasing the order N of the phone n-gram model during re-estimation, an alternative would be to iterate Nmax times with a fixed n-gram order, possibly with successively increased weight being given to the language model vs. the acoustic model in ASR decoding. One issue that arises in the context of unsupervised recognition without transcription is how to adjust recognition parameters that affect the length of recognized strings. In conventional training of recognizers from word transcriptions, a &quot;word insertion penalty&quot; is typically tuned after comparing recognizer output against transcriptions. To address this issue, we estimate the expected speaking rate (in phones per second) for the relevant type of speech (human-computer interaction in these experiments). The token insertion penalty of the recognizer is then adjusted so that the speaking rate for automatically detected speech in a small sample of training data approximates the expected speaking rate.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experimental Setup </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Data </SectionTitle> <Paragraph position="0"> Three collections of utterances from different domains were used in the experiments. Domain A is the one studied in previously cited experiments (Gorin et al., 1999; Levit et al., 2001; Petrovska-Delacretaz et al., 2000). Utterances for domains B and C are from similar interactive spoken natural language systems.</Paragraph> <Paragraph position="1"> Domain A. The utterances being classified are the customer side of live English conversations between AT&T residential customers and an automated customer care system. This system is open to the public so the number of speakers is large (several thousand). There were 40106 training utterances and 9724 test utterances. The average length of an utterance was 11.29 words. The split between training and test utterances was such that the utterances from a particular call were either all in the training set or all in the test set. There were 56 actions in this domain. Some utterances had more than one action associated with them, the average number of actions associated with an utterance being 1.09.</Paragraph> <Paragraph position="2"> Domain B. This is a database of utterances from an interactive spoken language application relating to product line information. There were 10470 training utterances and 5005 test utterances. The average length of an utterance was 3.95 words. There were 54 actions in this domain. Some utterances had more than one action associated with them, the average number of actions associated with an utterance being 1.23.</Paragraph> <Paragraph position="3"> Domain C. This is a database of utterances from an interactive spoken language application relating to consumer order transactions (reviewing order status, etc.) in a limited domain. There were 14355 training utterances and 5000 test utterances. The average length of an utterance was 8.88 words. There were 93 actions in this domain. Some utterances had more than one action associated with them, the average number of actions associated with an utterance being 1.07.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Recognizer </SectionTitle> <Paragraph position="0"> The same acoustic model was used in all the experiments reported here, i.e. for experiments with both the phone-based and word-based utterance classifiers. This model has 42 phones and uses discriminatively trained 3-state HMMs with 10 Gaussians per state. It uses feature space transformations to reduce the feature space to 60 features prior to discriminative maximum mutual information training. This acoustic model was trained by Andrej Ljolje and is similar to the baseline acoustic model used for experiments with the Switchboard corpus, an earlier version of which is described by Ljolje et al. (2000).</Paragraph> <Paragraph position="1"> (Like the model used here, the baseline model in those experiments does not involve speaker and environment normalizations.) The n-gram phonotactic models used were represented as weighted finite state automata. These automata (with the exception of the initial unweighted phone loop) were constructed using the stochastic language modeling technique described by Riccardi et al. (1996). This modeling technique, which includes a scheme for backing off to probability estimates for shorter n-grams, was originally designed for language modeling at the word level.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Classifier </SectionTitle> <Paragraph position="0"> Different possible classification algorithms can be used in our utterance classification method. For the experiments reported here we use the BoosTexter classifier (Schapire and Singer, 2000). Among the alternatives are decision trees (Quinlan, 1993) and support vector machines (Vapnik, 1995). BoosTexter was originally designed for text categorization. It uses the AdaBoost algorithm (Freund and Schapire, 1997; Schapire, 1999), a wide margin machine learning algorithm. At training time, AdaBoost selects features from a specified space of possible features and associates weights with them. A distinguishing characteristic of the AdaBoost algorithm is that it places more emphasis on training examples that are difficult to classify. The algorithm does this by iterating through a number of rounds: at each round, it imposes a distribution on the training data that gives more probability mass to examples that were difficult to classify in the previous round. In our experiments, 500 rounds of boosting were used; each round allows the selection of a new feature and the adjustment of weights associated with existing features. In the experiments, the possible features are identifiers corresponding to prompts, and phone n-grams or word n-grams (for the phone and word-based methods respectively) up to length 4.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Experimental Conditions </SectionTitle> <Paragraph position="0"> Three experimental conditions are considered. The suffixes (M and H) in the condition names refer to whether the two training phases (i.e. training for recognition and classification respectively) use inputs produced by machine (M) or human (H) processing.</Paragraph> <Paragraph position="1"> PhonesMM This experimental condition is the method described in this paper, so no human transcriptions are used. Unsupervised training from the training speech files is used to build a phone recognition model. The classifier is trained on the phone strings resulting from recognizing the training speech files with this model. At runtime, the classifier is applied to the results of recognizing the test files with this model. The initial recogition model for the unsupervised recognition training process was an unweighted phone loop. The final n-gram order used in the recognition training procedure (Nmax in section 2) was 5.</Paragraph> <Paragraph position="2"> WordsHM Human transcriptions of the training speech files are used to build a word trigram model. The classifier is trained on the word strings resulting from recognizing the training speech files with this word trigram model. At runtime, the classifier is applied to the results of recognizing the test files with the word trigram model.</Paragraph> <Paragraph position="3"> Learned phone Corresponding sequence words b ih l ih billing k ao l z calls n ah m b number f aa n phone r ey t rate k ae n s cancel aa p ax r operator aw t m ay what my ch eh k check m ay b my bill p ae n ih company s w ih ch switch er n ae sh international v ax k w have a question l ih ng p billing plan r ey t s rates k t uw p like to pay ae l ax n balance m er s er customer service r jh f ao charge for ing procedure from domain A training speech files. WordsHH Human transcriptions of the training speech files are used to build a word trigram model. The classifier is trained on the human transcriptions of the speech training files. At runtime, the classifier is applied to the results of recognizing the test files with the word trigram model.</Paragraph> <Paragraph position="4"> For all three conditions, median recognition and classification time for test data was less than real time (i.e. the duration of test speech files) on current micro-processors. As noted earlier, the acoustic model, the number of boosting rounds, and the use of prompts as an additional classification feature, are the same for all experimental conditions. null</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Example learned phone sequences </SectionTitle> <Paragraph position="0"> To give an impression of the kind of phone sequences resulting from the automatic training procedure and applied by the classifier at runtime, see Table 1. The table lists some examples of such phone strings learned from domain A training speech files, together with English words, or parts of words (shown in bold type), they may correspond to. (Of course, the words play no part in the method and are only included for expository purposes.) The phone strings are shown in the DARPA phone alphabet. null</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Classification Accuracy </SectionTitle> <Paragraph position="0"> In this section we compare the accuracy of our phone-string utterance classification method (PhonesMM) with methods (WordsHM and WordsHH) using manual transcription and word string models.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Accuracy Metric </SectionTitle> <Paragraph position="0"> The results are presented as utterance classification rates, specifically the percentage of utterances in the test set for which the predicted action is valid. Here a valid prediction means that the predicted action is the same as one of the actions associated with the test utterance by a human labeler. (As noted in section 3, the average number of actions associated with an utterance was 1.09, 1.23, and 1.07 for domains A, B, and C, respectively.) In this metric we only take into account a single action predicted by the classifier, i.e. this is &quot;rank 1&quot; classification accuracy, rather than the laxer &quot;rank 2&quot; classification accuracy (where the classifier is allowed to make two predictions) reported by Gorin et. al (1999) and Petrovska et. al (2000).</Paragraph> <Paragraph position="1"> In practical applications of utterance classification, user inputs are rejected if the confidence of the classifier in making a prediction falls below a threshold appropriate to the application. After rejection, the system may, for example, route the call to a human or reprompt the user. We therefore show the accuracy of classifying accepted utterances at different rejection rates, specifically 0% (all utterances accepted), 10%, 20%, 30%, 40%, and 50%. Following Schapire and Singer (2000), the confidence level, for rejection purposes, assigned to a prediction is taken to be the difference between the scores assigned by BoosTexter to the highest ranked action (the predicted action) and the next highest ranked action.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> Accuracy Results </SectionTitle> <Paragraph position="0"> Utterance classification accuracy rates, at various rejection rates, for domain A are shown in Table 2 for the three experimental conditions described in section 3.4.</Paragraph> <Paragraph position="1"> The corresponding results for domains B and C are shown cation accuracy for domain C The utterances in domain A are on average longer and more complex than in domain B; this may partly explain the higher classification rates for domain B. The generally lower classification accuracy rates for domain C may reflect the larger set of actions for this domain (92 actions, compared with 56 and 54 actions for domains A and B). Another difference between the domains was that the recording quality for domain B was not as high as for domains A and C. Despite these differences between the domains, there is a consistent pattern for the comparison of most interest to this paper, i.e. the relative performance of utterance classification methods requiring or not requiring transcription.</Paragraph> <Paragraph position="2"> Perhaps the most surprising outcome of these experiments is that the phone-based method with short &quot;phrasal&quot; contexts (up to four phones) has classification accuracy that is so close to that provided by the longer phrasal contexts of trigram word recognition and word-string classification. Of course, the re-estimation of phone n-grams employed in the phone-based method means that two-word units are implicitly modeled since the phone 5-grams modeled in recognition, and 4-grams in classification, can straddle word boundaries.</Paragraph> <Paragraph position="3"> The experiments suggest that if transcriptions are available (i.e. the effort to produce them has already been expended), then they can be used to slightly improve performance over the phone-based method (PhonesMM) not requiring transcriptions. For domains A and C, this would give an absolute performance difference of about 2%, while for domain B the difference is around 1%.</Paragraph> <Paragraph position="4"> Nmax Recog. Classif.</Paragraph> <Paragraph position="5"> increasing values of Nmax for domain B.</Paragraph> <Paragraph position="6"> Whether it is optimal to train the word-based classifier on the transcriptions (WordsHH) or the output of the recognizer (WordsHM) seems to depend on the particular data set.</Paragraph> <Paragraph position="7"> When the operational setting of utterance classification demands very high confidence, and a high degree of rejection is acceptable (e.g. if sufficient human backup operators are available), then the small advantage of the word-based methods is reduced further to less than 1%.</Paragraph> <Paragraph position="8"> This can be seen from the high rejection rate rows of the accuracy tables.</Paragraph> <Paragraph position="9"> Effectiveness of Unsupervised Training Tables 5, 6, and 7, show the effect of increasing Nmax (the final iteration number in the unsupervised phone recognition model) for domains A, B and C, respectively. The row with Nmax = 0 corresponds to the initial unweighted phone loop recognition. The classification accuracies shown in this table are all at 0% rejection. Phone recognition accuracy is the standard ASR error rate accuracy in terms of the percentage of phone insertions, deletions, and substitutions, determined by aligning the ASR output against reference phone transcriptions produced by the pronounciation component of our speech synthesizer. (Since these reference phone transcriptions are not perfect, the actual phone recognition accuracy is probably slightly higher.) Clearly, for all three domains, unsupervised recognition model training improves both Nmax Recog. Classif.</Paragraph> <Paragraph position="10"> increasing values of Nmax for domain C.</Paragraph> <Paragraph position="11"> recognition and classification accuracy compared with a simple phone loop. Unsupervised training of the recognition model is particularly important for domain B where the quality of recordings is not as high as for domains A and C, so the system needs to depend more on the re-estimated n-gram models to achieve the final classification accuracy.</Paragraph> </Section> class="xml-element"></Paper>