File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1036_evalu.xml
Size: 12,892 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1036"> <Title>Backoff Model Training using Partially Observed Data: Application to Dialog Act Tagging</Title> <Section position="5" start_page="282" end_page="286" type="evalu"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> We evaluated our hidden backoff model on the ICSI meeting recorder dialog act (MRDA) corpus (Shriberg et al., 2004). MRDA is a rich data set that contains 75 natural meetings on different topics with each meeting involving about 6 participants. DA annotations from ICSI were based on a previous approach in (Jurafsky et al., 1997b) with some adaptation for meetings in a number of ways described in (Bhagat et al., 2003). Each DA contains a main tag, several optional special tags and an optional disruption form. The total number of distinct DAs in the corpus is as large as 1260. In order to make the problem comparable to other work (Ang et al., 2005), a DA tag sub-set is used in our experiments that contains back channels (b), place holders (h), questions (q), statements (s), and disruptions (x). In our evaluations, among the entire 75 conversations, 51 are used as the training set, 11 are used as the development set, 11 are used as test set, and the remaining 3 are not used. For each experiment, we used a genetic algorithm to search for the best factored language model structure on the development set and we report the best results.</Paragraph> <Paragraph position="1"> Our baseline system is the generative model shown in Figure 1 and uses a backoff implementation of the word model, and is optimized on the development set. We use the SRILM toolkit with extensions (Bilmes and Kirchhoff, 2003) to train, and use GMTK (Bilmes and Zweig, 2002) for decoding.</Paragraph> <Paragraph position="2"> Our baseline system has an error rate of 19.7% on the test set, which is comparable to other approaches on the same task (Ang et al., 2005).</Paragraph> <Section position="1" start_page="283" end_page="284" type="sub_section"> <SectionTitle> 4.1 Same number of states for all DAs </SectionTitle> <Paragraph position="0"> To compare against our baseline, we use HBMs in the model shown in Figure 2. To train, we followed Algorithm 1 as described before and as is here detailed in Figure 3.</Paragraph> <Paragraph position="1"> In this implementation, an upper triangular matrix (with self-transitions along the diagonal) is used for the hidden state transition probability table so that sub-DA states only propagate in one direction. When initializing the hidden state sequence of a DA, we expanded the states uniformly along the sentence. This initial alignment is then used for HBM training. In the word models used in our experiments, the backoff path rst drops previous words, then does a parallel backoff to hidden state and DA using a mean combination strategy.</Paragraph> <Paragraph position="2"> The HBM thus obtained was then fed into the main loop of our embedded EM algorithm. The training was considered to have converged if either it exceeded 10 iterations (which never happened) or the relative log likelihood change was less than 0.2%. Within each embedded iteration, three EM epochs were used. After each EM iteration, a Viterbi alignment was performed thus obtaining what we expect to be a better hidden state alignment. This updated alignment, was then used to train a new HBM. The newly generated model was then fed back into the embedded training loop until it converged. After the procedure met our convergence criteria, an additional ve EM epochs were carried out in order to provide a good hidden state transition probability table. Finally, after Viterbi alignment and text generation was performed, the word HBM was trained from the best state sequence.</Paragraph> <Paragraph position="3"> To evaluate our hidden backoff model, the Viterbi algorithm was used to nd the best DA sequence according to test data, and the tagging error rates were calculated. In our rst experiment, an equal number of hidden states for all DAs were used in each model. The effect of this number on the accuracy of DA tagging is shown in Table 1.</Paragraph> <Paragraph position="4"> For the baseline system, the backoff path rst drops dialog act, and for the HBMs, all backoff paths drop hidden state rst and drop DA second. From Table 1 we see that with two hidden states for every DA the system can reduce the tagging error rate by more than 5% relative. As a comparison, in (Ang et al., 2005), where conditional maximum entropy models (which are conditionally trained) are used, the error rate is 18.8% when using both word and acoustic prosody features, and and 20.5% without prosody. When the number of hidden states increases to 3, the improvement decreases even though it is still (very slightly) better than the baseline. We believe the reasons are as follows: First, assuming different DAs have the same number of hidden states may not be appropriate. For example, back channels usually have shorter sentences and are constant in discourse pattern over a DA. On the other hand, questions and statements typically have longer, and more complex, discourse structures. Second, even under the same DA, the structure and inherent length of sentence can vary. For example, yes can also be a statement even though it has only one word. Therefore, one-word statements need completely different hidden state patterns than those in subject-verb-object like statements having one monolithic 3state model for statements might be inappropriate.</Paragraph> <Paragraph position="5"> This issue is discussed further in Section 4.4.</Paragraph> </Section> <Section position="2" start_page="284" end_page="284" type="sub_section"> <SectionTitle> 4.2 Different states for different DAs </SectionTitle> <Paragraph position="0"> In order to mitigate the rst problem described above, we allow different numbers of hidden states for each DA. This, however, leads to a combinatorial explosion of possibilities if done in a nacurrency1 ve fashion. Therefore, we attempted only a small number of combinations based on the statistics of numbers of words in each DA given in Table 2.</Paragraph> <Paragraph position="1"> Table 2 shows the mean and median number of words per sentence for each DA as well as the standard deviation. Also, the last column provides the p value according to tting the length histogram to a geometric distribution (1 p)np. As we expected, back channels (b) and place holders (h) tend to have shorter sentences while questions (q) and statements (s) have longer ones. From this analysis, we use fewer states for (b) and (h) and more states for (q) and (s). For disruptions (x), the standard deviation of number of words histogram is relatively high compared with (b) and (h), so we also used more hidden states in this case. In our experimental results below, we used one state for (b) and (h), and various numbers of hidden states for other DAs. Tagging error rates are shown in Table 3.</Paragraph> <Paragraph position="2"> From Table 3, we see that using different numbers of hidden states for different DAs can produce better models. Among all the experiments we per- null formed, the best case is given by three states for (q), two states for (s) and (x), and one state for (b) and (h). This combination gives 6.1% relative reduction of error rate from the baseline.</Paragraph> </Section> <Section position="3" start_page="284" end_page="285" type="sub_section"> <SectionTitle> 4.3 Effect of embedded EM training </SectionTitle> <Paragraph position="0"> Incorporating backoff smoothing procedures into Bayesian networks (and hidden variable training in particular) can show bene ts for any data domain where smoothing is necessary. To understand the properties of our algorithm a bit better, after each training iteration using a partially trained model, we calculated both the log likelihood of the training set and the tagging error rate of the test data. Figure 4 shows these results using the best con guration from the previous section (three states for (q), two for (s)/(x) and one for (b)/(h)). This example is typical of the convergence we see of Algorithm 1, which empirically suggests that our procedure may be similar to a generalized EM (Neal and Hinton, 1998).</Paragraph> <Paragraph position="1"> We nd that the log likelihood after each EM training is strictly increasing, suggesting that our embedded EM algorithm for hidden backoff models is improving the overall joint likelihood of the training data according to the model. This strict increase of likelihood combined with the fact that Viterbi training does not have the same theoretical convergence guarantees as does normal EM indicates that more detailed theoretical analysis of this algorithm used with these particular models is desirable.</Paragraph> <Paragraph position="2"> From the gure we also see that both the log likelihood and tagging error rate converge after around four iterations of embedded training.</Paragraph> <Paragraph position="3"> This quick convergence indicates that our embedded training procedure is effective. The leveling of the error rates after several iterations shows that model over- tting appears not to be an issue presumably due to the smoothed embedded backoff models.</Paragraph> </Section> <Section position="4" start_page="285" end_page="285" type="sub_section"> <SectionTitle> 4.4 Discussion and Error Analysis </SectionTitle> <Paragraph position="0"> A large portion of our tagging errors are due to confusing the DA of short sentences such as yeah , and right . The sentence, yeah can either be a back channel or an af rmative statement. There are also cases where yeah? is a question. These types of confusions are dif cult to remove in the prosodyless framework but there are several possibilities. First, we can allow the use of a fork and join transition matrix, where we fork to each DA-speci c condition (e.g., short or long) and join thereafter.</Paragraph> <Paragraph position="1"> Alternatively, hidden Markov chain structuring algorithms or context (i.e., conditioning the number of sub-DAs on the previous DA) might be helpful.</Paragraph> <Paragraph position="2"> Finding a proper number of hidden states for each DA is also challenging. In our preliminary work, we simply explored different combinations using simple statistics of the data. A systematic procedure would be more bene cial. In this work, we also did not perform any hidden state tying within different DAs. In practice, some states in statements should be able to be bene cially tied with other states within questions. Our results show that having three states for all DAs is not as good as two states for all. But with tying, more states might be more successfully used.</Paragraph> </Section> <Section position="5" start_page="285" end_page="286" type="sub_section"> <SectionTitle> 4.5 In uence of Prosody Cues </SectionTitle> <Paragraph position="0"> It has been shown that prosody cues provide useful information in DA tagging tasks (Shriberg et al., 1998; Ang et al., 2005). We also incorporated prosody features in our models. We used ESPS get f0based on RAPT algorithm (Talkin, 1995) to get F0 values. For each speaker, mean and variance normalization is performed. For each word, a linear regression is carried on the normalized F0 values.</Paragraph> <Paragraph position="1"> We quantize the slope values into 20 bins and treat those as prosody features associated with each word.</Paragraph> <Paragraph position="2"> After adding the prosody features, the simple generative model as shown in Figure 5 gives 18.4% error rate, which is 6.6% improvement over our baseline.</Paragraph> <Paragraph position="3"> There is no statistical difference between the best performance of this prosody model and the earlier best HBM. This implies that the HBM can obtain as good performance as a prosody-based model but without using prosody.</Paragraph> <Paragraph position="4"> The next obvious step is to combine an HBM with the prosody information. Strangely, even after experimenting with many different models (including ones where prosody depends on DA; prosody depends on DA and the hidden state; prosody depends on DA, hidden state, and word; and many variations thereof), we were unsuccessful in obtaining a complementary bene t when using both prosody and an HBM. One hypothesis is that our prosody features are at the word-level (rather than at the DA level). Another problem might be the small size of the MRDA corpus relative to the model complexity.</Paragraph> <Paragraph position="5"> Yet a third hypothesis is that the errors corrected by both methods are the same indeed, we have veri ed that the corrected errors overlap by more than 50%. We plan further investigations in future work.</Paragraph> </Section> </Section> class="xml-element"></Paper>