XML Viewer - n04-4034

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4034_metho.xml
Size: 11,466 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4034">
  <Title>Multi-Speaker Language Modeling</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Multi-speaker Language Modeling
</SectionTitle>
    <Paragraph position="0"> In a conversational setting, such as during a meeting or telephone call, the words spoken by one speaker are affected not only by his or her own previous words but also by other speakers. Such inter-speaker dependency, however, is typically ignored in standard n-gram language models. In this work, information (i.e., word tokens) from other speakers (A) is used to better predict word tokens of the current speaker (W). When predicting wt, instead of using P(wtjw0; ;wt 1), the form P(wtjw0; ;wt 1;a0; ;at) is used. Here at represents a word spoken by some other speaker with appropriate starting time (Section 3). A straight-forward implementation is to extend the normal trigram model as:</Paragraph>
    <Paragraph position="2"> one from a meeting (b). In (a), only two speakers are involved and the words from the current speaker, W, are affected by the other speaker, A. At the beginning of a conversation, the response to Hi is likely to be Hi or Hello. At the end of the phone call, the response to Take care might be Bye , or You too , etc. In (b), we show a typical meeting conversation. Speaker C2 is interrupting C3 when C3 says Sunday . Because Sunday is a day of the week, there is a high probability that C2's response is also a day of the week. In our model, we only consider two streams at a time, W and A. Therefore, when considering the probability of C2's words, it is reasonable to collapse words from all other speakers (C0,C1,C3,C4, and C5) into one stream A as shown in the gure. This makes available to C2 the rest of the meeting to potentially condition on, although it does not distinguish between different speakers.</Paragraph>
    <Paragraph position="3"> Our model, Equation 1, is different from most language modeling systems since our models condition on both previous words and another potential factor A. Such a model is easily represented using a factored language model (FLM), an idea introduced in (Bilmes and Kirchhoff, 2003; Kirchhoff et al., 2003), and incorporated into the SRILM toolkit (Stolcke, 2002). Note that a form of cross-side modeling was used by BBN (Schwartz, 2004), where in a multi-pass speech recognition system the output of a rst-pass from one speaker is used to prime words in the language model for the other speaker.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Initial Evaluation
</SectionTitle>
    <Paragraph position="0"> We evaluate MSLMs on three corpora: Switchboard-I, Switchboard Eval-2003, and ICSI Meeting data. In Switchboard-I, 6.83% of the words are overlapped in time, where we de ne w1 and w2 as being overlapped if s(w1) s(w2) &lt; e(w1) or s(w2) s(w1) &lt; e(w2), where s( ) and e( ) are the starting and ending time of a word.</Paragraph>
    <Paragraph position="1"> The ICSI Meeting Recorder corpus (Janin et al., 2003) consists of a number of meeting conversations with three or more participants. The data we employed has 32 conversations, 35,000 sentences and 307,000 total words, where 8.5% of the words were overlapped. As mentioned previously, we collapse the words from all other speakers into one stream A as a conditioning set for W. The data consists of all speakers taking their turn being W.</Paragraph>
    <Paragraph position="2"> To be used in an FLM, the words in each stream need to be aligned at discrete time points. Clearly, at should not come from wt's future. Therefore, for each wt, we use the closest previous A word in the past for at such that s(wt 1) s(at) &lt; s(wt). Therefore, each at is used only once and no constraints are placed on at's end time. This is reasonable since one can often predict a speaker's word after it starts but before it completes.</Paragraph>
    <Paragraph position="3"> We score using the model P(wtjwt 1;wt 2;at).1 Different back-off strategies, including different back-off paths as well as combination methods (Bilmes and Kirchhoff, 2003), were tried and here we present the best results. The backoff order (for Switchboard-I and Meeting) rst dropped at, then wt 2, wt 1, ending with the uniform distribution. For Switchboard eval-2003, we used a generalized parallel backoff mechanism. In all cases, modi ed Kneser-Ney smoothing (Chen and Goodman, 1998) was used at all back-off points.</Paragraph>
    <Paragraph position="4"> Results on Switchboard-I and the meeting data employed 5-fold cross-validation. Training data for Switchboard eval-2003 consisted of all of Switchboard-I. In Switchboard eval-2003, hand-transcribed time marks are  unavailable, so A was available only at the beginning of utterances of W.2 Results (mean perplexities and standard deviations) are listed in Table 1 (Switchboard-I and meeting) and the jV j column in Table 3.</Paragraph>
    <Paragraph position="5">  likely improve our results.</Paragraph>
    <Paragraph position="6"> In Table 1, the rst column shows data set names. The second and third columns show our best baseline trigram and four-gram perplexities, both of which used interpolation and modi ed Kneser-Ney at every back-off point. The trigram outperforms the four-gram. The fourth column shows the perplexity results with MSLMs and the last column shows the MSLM's relative perplexity reduction over the (better) trigram baseline. This positive reduction indicates that for both data sets, the utilization of additional information from other speakers can better predict the words of the current speaker. The improvement is larger in the highly conversational meeting setting since additional speakers, and thus more interruptions, occur.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Analysis
</SectionTitle>
      <Paragraph position="0"> It is elucidating at this point to identify when and how A-words can help predict W-words. We thus computed the log-probability ratio of P(wtjwt 1;wt 2;at) and the trigram P(wtjwt 1;wt 2) evaluated on all test set tuples of form (wt 2;wt 1wt;at). When this ratio is large and positive, conditioning on at signi cantly increases the probability of wt in the context of wt 1 and wt 2. The opposite is true when the ratio is large and negative. To ensure the signi cance of our results, we de ne large to mean at least 101:5 32, so that using at makes wt at least 32 times more (or less) probable. We chose 32 in a data-driven fashion, to be well above any spurious probability differences due to smoothing of different models.</Paragraph>
      <Paragraph position="1"> At the rst word of a phrase spoken by W, there are a number of cases of A words that signi cantly increase the probability of a W word relative to the trigram alone.</Paragraph>
      <Paragraph position="2"> This includes (in roughly decreasing order of probability) echos (e.g., when A says Friday , W repeats it), greetings/partings (e.g., a W greeting is likely to follow an A greeting), paraphrases (e.g., crappy followed by ugly , or Indiana followed by Purdue ), is-a relationships (e.g., A saying corporation followed by W saying dell , A- actor followed by W- Swayze , A- name followed by W- Patricia , etc.), and word completions.</Paragraph>
      <Paragraph position="3"> On the other hand, some A contexts (e.g., laughter) signi cantly decrease the probability of many W words.</Paragraph>
      <Paragraph position="4"> Within a W phrase, other patterns emerge. In particular, some A words signi cantly decrease the probability that W will nish a commonly-used phrase.</Paragraph>
      <Paragraph position="5"> For example, in a trigram alone, p(biggerjand; bigger), p(forthjand; back), and p(easyjand; quick), all have high probability. When also conditioning on A, some A words signi cantly decrease the probability of nishing such phrases. For example, we nd that p(easyjand; quick; uh-hmm ) p(easyjand; quick).</Paragraph>
      <Paragraph position="6"> A similar phenomena occurs for other commonly used phrases, but only when A has uttered words such as yeah , good , ok , [laughter] , huh , etc. While one possible explanation of this is just due to decreased counts, we found that for such phrases p(wtjwt 1;wt 2;at) minwt 32S p4(wtjwt 1;wt 2;wt 3) where p4 is a four-gram, S = fw : C(wt;wt 1;wt 2;w) &gt; 0g, and C is the 4-gram word count function for the switchboard training and test sets. Therefore, our hypothesis is that when W is in the process of uttering a predictable phrase and A indicates she knows what W will say, it is improbable that W will complete that phrase.</Paragraph>
      <Paragraph position="7"> The examples above came from Switchboard-I, but we found similar phenomena in the other corpora.</Paragraph>
      <Paragraph position="8">  Class-based language models (Brown et al., 1992; Whittaker and Woodland, 2003) yield great bene ts when data sparseness abounds. SRILM (Stolcke, 2002) can produce classes to maximize the mutual information between the classes I(C(wt);C(wt 1)), as described in (Brown et al., 1992). More recently, a method for clustering words at different positions was developed (Yamamoto et al., 2001; Gao et al., 2002). Our goal is to produce classes that improve the scores P(wtjht) = P(wtjwt 1;wt 2;C1(at)), what we call class-based MSLMs. In our case, the vocabulary for A is partitioned into classes by either maximizing conditional mutual information (MCMI) I(wt;C(at)jwt 1;wt 2) or just maximizing mutual information (MMI) I(wt;C(at)).</Paragraph>
      <Paragraph position="9"> While such clusterings can perform poorly under low counts, our results show further consistent improvements. Our new clustering procedures were implemented into the SRILM toolkit. When partitioned into smaller classes, the A-tokens are replaced by their corresponding class IDs. The result is then trained using the same factored language model as before. The resulting perplexities for the MCMI case are presented in Figure 2, where the horizontal axis shows the number of A-stream classes (the right-most shows the case before clustering), and the vertical axis shows average perplexity. In both data corpora, the average perplexities decrease after applying class-based MSLMs. For both Switchboard-I and the meeting data, the best result is achieved using 500 classes (7.1% and 12.2% improvements respectively).</Paragraph>
      <Paragraph position="10"> To compare different clustering algorithms, results with the standard method of (Brown et al., 1992) (SRILM's ngram-class) are also reported. All the perplexities for these three types of class-based MSLMs are given in Table 2. For Switchboard-I, ngram-class does slightly better than without clustering. On the meeting data, it even does slightly worse than no clustering. Our MMI method does show a small improvement, and the perplexities are further (but not signi cantly) reduced using our MCMI method (but at the cost of much more computation during development).</Paragraph>
      <Paragraph position="11"> We also show results on Switchboard eval-2003 in Table 3. We compare an optimized four-gram, a threegram baseline, and various numbers of cluster sizes using our MCMI method and generalized backoff (Bilmes and Kirchhoff, 2003), which, (again) with 500 clusters, achieves an 8.9% relative improvement over the trigram.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML