XML Viewer - w98-1121

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1121_metho.xml
Size: 27,986 bytes
Last Modified: 2025-10-06 14:15:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1121">
  <Title>POS Tagging versus Classes in Language Modeling</Title>
  <Section position="3" start_page="179" end_page="180" type="metho">
    <SectionTitle>
1.2 POS-Based Models
</SectionTitle>
    <Paragraph position="0"> One can also use POS tags, which capture the syntactic role of each word, as the basis of the equivalence classes (Jelinek, 1985). Consider the sequence of words &amp;quot;hello can I help you&amp;quot;. Here, &amp;quot;hello&amp;quot; is being used as an acknowledgment, &amp;quot;can&amp;quot; as a modal verb, 'T' as a pronoun, &amp;quot;help&amp;quot; as an untensed verb, and &amp;quot;you&amp;quot; as a pronoun. To use POS tags in language modeling, the typical approach is to sum over all of the POS possibilities. Below, we give the derivation based on using trigrams.</Paragraph>
    <Paragraph position="2"> The above approach for incorporating POS information into a language model has not been of much success in improving speech recognition performance. Srinivas (1996) reports that suclt a model results in a 24.5% increase in perplexity over a word-based model on the Wall Street Journal; Niesler and Woodland (1996) report an II.3% increase (but a 22-fold decrease in the number of parameters of such a model) for the LOB corpus; and Kneser and  Ney (1993) report a 3% increase on the LOB corpus. The POS tags remove too much of the lexical information that is necessary for predicting the next word. Only by interpolating it with a word-based model is an improvement seen (Jelinek, 1985).</Paragraph>
    <Paragraph position="3"> In the rest of the paper, we first describe the annotations of the Trains corpus. We next present our POS-based language model and contrast its performance with a class-based model. We then augment these models to account for speech repairs and intonational phrase, and show that the POS-based one performs better than the class-based one for modeling speech repairs and intonational phrases.</Paragraph>
  </Section>
  <Section position="4" start_page="180" end_page="180" type="metho">
    <SectionTitle>
2 The Trains Corpus
</SectionTitle>
    <Paragraph position="0"> As part of the TRAINS project (Allen et al., 1995), a long term research project to build a conversationally proficient planning assistant, we collected a corpus of problem solving dialogs (Heeman and Allen, 1995). The dialogs involve two human participants, one who is playing the role of a user and has a certain task to accomplish, and another who is playing the role of a planning assistant. The collection methodology was designed to make the setting as close to human-computer interaction as possible, but was not a wizard scenario, where one person pretends to be a computer. Table 1 gives information about the corpus.</Paragraph>
    <Section position="1" start_page="180" end_page="180" type="sub_section">
      <SectionTitle>
2.1 POS Annotations
</SectionTitle>
      <Paragraph position="0"> Our POS tagset is based on the Penn Treebank tagset (Marcus et al., 1993), but modified to include tags for discourse markers and end-of-turns, and to provide richer syntactic information (Heeman, 1997). Table 2 lists our tagset with differences from the Penn tagset marked in bold. Contractions are annotated using 'A' to conjoin the tag for each part; for instance, &amp;quot;can't&amp;quot; is annotated as</Paragraph>
    </Section>
    <Section position="2" start_page="180" end_page="180" type="sub_section">
      <SectionTitle>
2.2 Speech Repair Annotations
</SectionTitle>
      <Paragraph position="0"> Speech repairs occur where the speaker goes back and changes or repeats what was just said (Heeman, 1997), as illustrated by the following.</Paragraph>
      <Paragraph position="1"> Example 1 (d92a-2.1 utt29) the one with the bananas I mean that's taking the bananas reparandum et alteration Speech repairs have three parts (some of which are optional): the reparandum, which are the words the speaker wants to replace, an editing term, which helps mark the repair, and the alteration, which is the replacement of the reparandum. The end of the reparandum is referred to as the interruption point. For annotating speech repairs, we have extended the scheme proposed by Bear et al. (1992) so that it better deals with overlapping and ambiguous repairs. Like their scheme, ours allows the annotator to capture the word correspondences that exist between the reparandum and the alteration. Below, we illustrate how a speech repair is annotated. In this example, the reparandum is &amp;quot;engine two from Elmi(ra)-&amp;quot;, the editing term is &amp;quot;or&amp;quot;, and the alteration is &amp;quot;engine three from Elmira&amp;quot;. The word matches on &amp;quot;engine&amp;quot; and &amp;quot;from&amp;quot; are annotated with 'm' and the word replacement of &amp;quot;two&amp;quot; by &amp;quot;three&amp;quot; is annotated with 'r'.</Paragraph>
      <Paragraph position="2"> Example 2 (d93-15.2 utt42) engine two from Elmi(ra)- or engine three from Elmira</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="3" start_page="180" end_page="180" type="sub_section">
      <SectionTitle>
2.3 Intonation Annotations
</SectionTitle>
      <Paragraph position="0"> Speakers break up their speech into&amp;quot; intonational phrases. This segmentation serves a similar purpose as punctuation does in written speech. The ToBI annotation scheme (Silverman et al., 1992) involves labeling the accented words, intermediate phrases and intonational phrases with high and low accents.</Paragraph>
      <Paragraph position="1"> Since we are currently only interested in the intonational phrase segmentation, we only label the intonational phrase endings.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="180" end_page="183" type="metho">
    <SectionTitle>
3 POS-Based Language Model
</SectionTitle>
    <Paragraph position="0"> In this section, we present an alternative formulation for using POS tags in a statistical language model.</Paragraph>
    <Paragraph position="1"> Here, POS tags are viewed as part of the output of the speech recognizer, rather than intermediate objects (Heeman and Allen, 1997a; Heeman, 1997).</Paragraph>
    <Section position="1" start_page="180" end_page="181" type="sub_section">
      <SectionTitle>
3.1 Redefining the Recognition Problem
</SectionTitle>
      <Paragraph position="0"> To add POS tags into the language model, we refrain from simply summing over all POS sequences as illustrated in Section 1.2. Instead, we redefine the speech recognition problem so that it finds the best word and POS sequence. Let P be a POS sequence for the word sequence W. The goal of the speech recognizer is to now solve the following.</Paragraph>
      <Paragraph position="2"> The first term Pr(AIWP) is the acoustic model, which traditionally excludes the category assignment. In fact, the acoustic model can probably be reasonably approximated by Pr(AIW). The second term Pr(WP) is the POS-based language model and this accounts for both the sequence of words and the POS assignment for those words. We rewrite the sequence WP explicitly in terms of the N words and their corresponding POS tags, thus giving us the sequence W1,NP1,N. The probability Pr(Wi,NP1,N) forms the basis for POS taggers, with the exception that POS taggers work from a sequence of given words.</Paragraph>
      <Paragraph position="3">  As in Equation 4, we rewrite the probabi\]lity Pr(W1,NP1,N) as follows using the definition of conditional probability,</Paragraph>
      <Paragraph position="5"> Equation 8 involves two probability distributions that need to be estimated. Previous attempts at using POS tags in a language model as well as POS taggers (i.e. (Charniak et al., 1993)) simplify these probability distributions, as given in Equations 9 and 10. However, to successfully incorporate POS information, we need to account for the full richness of the probability distributions. Hence, we cannot use these two assumptions when learning the probability distributions.</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="2" start_page="181" end_page="183" type="sub_section">
      <SectionTitle>
3.2 Estimating the Probabilities
</SectionTitle>
      <Paragraph position="0"> To estimate the probability distributions, we follow the approach of Bahl et al. (1989) and use a decision tree learning algorithm (Breiman et al., 1984) to partition the context into equivalence classes. The algorithm starts with a single node. It then finds a question to ask about the node in order to partition the node into two leaves, each being more informative as to which event occurred than the parent node. Information theoretic metrics, such as minimizing entropy, are used to decide which question to propose. The proposed question is then verified using heldout data: if the split does not lead to a decrease in entropy according to the heldout data, the split is rejected and the node is not further explored (Bahl et al., 1989). This process continues with the new leaves and results in a hierarchical partitioning of the context.</Paragraph>
      <Paragraph position="1"> After growing a tree, the next step is to use the partitioning of the context induced by the decision tree to determine the probability estimates. Using the relative frequencies in each node will be biased towards the training data that was used in choosing the questions. Hence, Bahl et al. smooth these probabilities with the probabilities of the parent node using interpolated estimation with a second heldout dataset.</Paragraph>
      <Paragraph position="2"> Using the decision tree algorithm to estimate probabilities is attractive since the algorithm can choose which parts of the context are relevant, and in what order. Hence, this approach lends itself more readily to allowing extra contextual information to be included, such as both the word identifies and POS tags, and even hierarchical clusterings of them. If the extra information is not relevant, it will not be used. The approach of using decision trees will become even more critical in the next two sections where the probability distributions will be conditioned on even richer context.</Paragraph>
      <Paragraph position="3">  One important aspects of using a decision tree algorithm is the form of the questions that it is allowed to ask. We allow two basic types of information to be used as part of the context: numeric and categorical.</Paragraph>
      <Paragraph position="4"> For a numeric variable N, the decision tree searches for questions of the form 'is N &gt;= n', where n is a numeric constant. For a categorical variable C, it searches over questions of the form 'is C E S' where S is a subset of the possible values of C. We also allow restricted boolean combinations of elementary questions (Bahl et al., 1989).</Paragraph>
      <Paragraph position="5">  The context that we use for estimating the probabilities includes both word identities and POS tags. To make effective use of this information, we need to allow the decision tree algorithm to generalize between words and POS tags that behave similarly.</Paragraph>
      <Paragraph position="6"> To learn which words behave similarly, Black et aL(1989) and Magerrnan (1994) used the clustering algorithm of Brown et al. (1992) to build a hierarchical classification tree. Figure 1 gives the classification tree that we built for the POS tags. The algorithm starts with each token in a separate class and iteratively finds two classes to merge that results in the smallest lost of information about POS adjacency. Rather than stopping at a certain number of classes, one continues until only a single class remains. However, the order in which classes were merged gives a hierarchical binary tree with the root corresponding to the entire tagset, each leaf to a single POS tag, and intermediate nodes to groupings of tags that are statistically similar. The path from the root to a tag gives the binary encoding for the tag.</Paragraph>
      <Paragraph position="7"> The decision tree algorithm can ask which partition a word belongs to by asking questions about the binary encoding.</Paragraph>
      <Paragraph position="8">  For handling word identities, one could follow the approach used for handling the POS tags (e.g. (Black et ai,, 1992; Magerman, 1994))and view the POS tags and word identities as two separate sources of information. Instead, we view the word identifies as a further refinement of the POS tags. We start the clustering algorithm with a separate class for each word and each POS tag that it takes on and only allow it to merge classes if the POS tags are the same. This results in a word classification tree for each POS tag. Building a word classification tree for each POS tag means that the tree will not be polluted by words that are ambiguous as to their POS tag, as exemplified by the word &amp;quot;loads&amp;quot;, which is used in the Trains corpus as both a third-person present tense verb VBZ and as a plural noun iNNS. Furthermore, building dtree for each POS tag simplifies the task because the hand annotations of the POS tags resolve a lot of the difficulty that the algorithm would otherwise have to handle. This allows effective trees to be built even when only a small amount of data is available.</Paragraph>
      <Paragraph position="9">  sonal pronouns (PRP). For reference, we list the number of occurrences of each word. Notice that the algorithm distinguished between the subjective pronouns T, 'we', and 'they', and the objective pronouns 'me', 'us' and 'them'. The pronouns 'you' and 'it' take both cases and were probably clustered according to their most common usage in the corpus. Although we could have added extra POS tags to distinguish between these two types of pronouns, it seems that the clustering algorithm can make up for some of the shortcomings of the POS tagset. The class low is used to group singleton words.</Paragraph>
    </Section>
    <Section position="3" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
3.3 Results
</SectionTitle>
      <Paragraph position="0"> Before giving a comparison between our POS-based model and a class-based model, we first describe the experimental setup and define the perplexity measures that we use to measure the performance.</Paragraph>
      <Paragraph position="1">  To make the best use of our limited data, we used a six-fold cross-validation procedure: each sixth of the data was tested using a model built from the remaining data. Changes in speaker are marked in the word transcription with the special token &lt;turn&gt;.</Paragraph>
      <Paragraph position="2"> We treat contractions, such as &amp;quot;that'll&amp;quot; and &amp;quot;gonna&amp;quot;, as separate words, treating them as &amp;quot;that&amp;quot; and &amp;quot;'ll'&amp;quot; for the first example, and &amp;quot;going&amp;quot; and &amp;quot;ta&amp;quot; for the second. 1 We also changed all word fragments into the token &lt;fragment&gt;.</Paragraph>
      <Paragraph position="3"> Since current speech recognition rates for spontaneous speech are quite low, we have run the experiments on the hand-collected transcripts. In searching for the best sequence of POS tags for the transcribed words, we follow the technique proposed by Chow and Schwartz (1989) and only keep a small number of alternative paths by pruning the low probability paths after processing each word.</Paragraph>
    </Section>
    <Section position="4" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
3.3.2 Branching Perplexity
</SectionTitle>
      <Paragraph position="0"> Our POS-based model is not only predicting the next word, but its POS tag as well. To estimate I See Heeman and Darrmati (1997) for how to treat contractions as separate words in a speech recognizer..</Paragraph>
      <Paragraph position="1"> the branching factor, and thus the size of the search space, we use the following formula for the entropy, where di is the POS tag for word wi.</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="5" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
3.3.3 Word Perplexity
</SectionTitle>
      <Paragraph position="0"> In order to compare a POS-based model against a traditional language model, we should not penalize the POS-based model for incorrect POS tags, and hence we should ignore them when defining the perplexity. Just as with a traditional model, we base the perplexity measure on Pr(wilw~i-1). However, for our model, this probability is not estimated. Hence, we must rewrite it in terms of the probabilities that we do estimate. To do this, our only recourse is to sum over all possible POS sequences.</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="6" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
3.3.4 Using Richer Context
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the effect of varying the richness of the information that the decision tree algorithm is allowed to use in estimating the POS and word probabilities. The second column uses the approximations given in Equation 9 and 10. The third column gives the results using the full context. The results show that adding the extra context has the biggest effect on the perplexity measures, decreasing the word perplexity from 43.22 to 24.04, a reduction of 44.4%. The effect on POS tagging is less pronounced, but still gives an error rate reduction of 3.8%. Hence, to use POS tags during speech recognition, one must use a richer context for estimating the probabilities than what is typically used.</Paragraph>
      <Paragraph position="1">  In this section, we compare the POS-based model against a class-based model. To make the comparison as focused as possible, we use the same methodology for estimating the probability distributions as we used for the POS-based model. The classes were obtained from the word clustering algorithm, but stopping once a certain number of classes has been reached. Unfortunately, the clustering algorithm of Brown et al. does not have a mechanism to decide an optimal number of word classes (cf. (Kneser and Ney, 1993)). Hence, to give an optimal evaluation of the class-based approach, we choose the number of classes that gives the best perplexity results, which was 100 classes. We then built word classification trees, just as we did for the POS-based approach, where words from different classes are not allowed to be merged. The resulting class-based model achieved a perplexity of 25.24 in comparison to 24.04 for the POS-based model. This improvement is due to two factors. First, tracking the syntactic role of each word gives valuable information for predicting the subsequent words. Second, the classification trees for the POS-based approach, which the decision tree algorithm uses to determine the equivalence classes, are of higher quality. This is due to the POS-based classification trees using the hand-annotated POS information, since they take advantage of the hand-coded knowledge present in the POS tags and are not polluted by words that take on more than one syntactic role.</Paragraph>
    </Section>
    <Section position="7" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
3.3.6 Preliminary Wall Street Journal Results
</SectionTitle>
      <Paragraph position="0"> For building a system that partakes in dialogue, read-speech corpora, such as the Wall Street Journal, are not appropriate. However, to make our results more comparable to the literature, we have done preliminary tests on the Wall Street Journal corpus in the Penn Treebank, which has POS annotations. This corpus has a significantly larger vocabulary size (55800 words) than the Trains corpus.</Paragraph>
      <Paragraph position="1"> Our current algorithm for clustering the words takes space in proportion to the square of the number of unique word/POS combinations (minus any that get grouped into the low occurring class). More work is needed to handle larger vocabulary sizes. Using 78,800 words of data, with a vocabulary size of 9711, we achieved a perplexity of 250.75 on the known words in comparison to a trigram word-based backoff model (Katz, 1987) built with the CMU toolkit (Rosenfeld, 1995), which achieved a perplexity of 296.43. More work is needed to see if these results scale up to larger vocabulary and training data sizes.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="183" end_page="186" type="metho">
    <SectionTitle>
4 Adding Repairs and Phrasing
</SectionTitle>
    <Paragraph position="0"> Just as we redefined the speech recognition problem so as to account for POS tagging, we do the same for modeling intonational phrases and speech  repairs. We introduce null tokens between each pair of words ~./)i-1 and wi (Heeman and Allen, 1997b), which will be tagged as to the occurrence of these events. The variable T/indicates if word wi-1 ends an intonational phrase (Ti=%), or not (Ti=null).</Paragraph>
    <Paragraph position="1"> For detecting speech repairs, we have the problem that repairs are often accompanied by an editing term, such as &amp;quot;urn&amp;quot;, &amp;quot;uh&amp;quot;, &amp;quot;okay&amp;quot;, or &amp;quot;well&amp;quot;, and these must be identified as such. Furthermore, an editing term might be composed of a number of words, such as &amp;quot;let's see&amp;quot; or &amp;quot;uh well&amp;quot;. Hence we use two tags: an editing term tag Ei and a repair tag Ri. The editing term tag indicates if wi starts an editing term (Ei=Push), if wi continues an editing term (Ei=ET), if wi-t ends an editing term (Ei=Pop), or otherwise (Ei=null). The repair tag Ri indicates whether word wi is the onset of the alteration of a fresh start (Ri=C), a modification repair (Ri=M), or an abridged repair (Ri=A), or there is no repair (Pa=null). Note that for repairs with an editing term, the repair is tagged after the extent of the editing term has been determined. Below we give an example showing all non-null tone, editing term and repair tags.</Paragraph>
    <Paragraph position="2"> Example 3 (d93-18.1 utt47) it takes one Push you ET know Pop M two hours % If a modification repair or fresh start occurs, we need to determine the extent (or the onset) of the reparandum, which we refer to as correcting the speech repair. Often, speech repairs have strong word correspondences between the reparandum and alteration, involving word matches and word replacements. Hence, knowing the extent of the reparandum means that we can use the reparandum to predict the words (and their POS tags) that make up the alteration. In our full model, we add three variables to account for the correction of speech repairs (Heeman and Allen, 1997b; Heeman, 1997).</Paragraph>
    <Paragraph position="3"> We also add an extra variable to account for silences between words. After a silence has occurred, we can use the silence to better predict whether an intonational boundary or speech repair has just occurred. Below we give the redefinition of the speech recognition problem (without speech repair correction and silence information). The speech recognition problem is redefined so that its goal is to find the maximal assignment for the words as well as the POS, intonational, and repair tags.</Paragraph>
    <Paragraph position="5"> Just as we did in Equation 8, we rewrite the above in terms of five probability distributions, each of which need to be estimated. The context for each of the probability distributions includes all of the previous context. In principal, we could give all of this context to the decision tree algorithm and let it decide what information is relevant in constructing equivalence classes of the contexts. However, the amount of training data is limited (as are the learning techniques) and so we need to encode the context in order to simplify the task of constructing meaningful equivalence classes. Hence we restructure the context to take into account the speech repairs and boundary tones (Heeman, 1997).</Paragraph>
    <Section position="1" start_page="185" end_page="186" type="sub_section">
      <SectionTitle>
4.1 Results
</SectionTitle>
      <Paragraph position="0"> We now contrast the performance of augmenting the POS-based model with speech repair and intonational modeling versus augmenting the class-based model. Just as in Section 3, all results were obtained using a six-fold cross-validation procedure from the the hand-collected transcripts. We ran these transcripts through a word-aligner (Ent, 1994), a speech recognizer constrained to recognize what was transcribed, in order to automatically obtain silence durations. In predicting the end of turn marker &lt;turn&gt;, we do not use any silence information.</Paragraph>
      <Paragraph position="1">  We report results on identifying speech repairs and intonational phrases in terms of recall, precision and error rate. The recall rate is the number of times that the algorithm correctly identifies an event over the total number of times that it actually occurred.</Paragraph>
      <Paragraph position="2"> The precision rate is the number of times the algorithm correctly identifies it over the total number of times it identifies it. The error rate is the number of errors in identifying an event over the number of times that the event occurred.</Paragraph>
      <Paragraph position="3">  The first set of experiments, whose results are given in Table 4, explore how POS tagging and word perplexity benefit from modeling boundary tones and speech repairs. The second column gives the resuits of the POS-based language model, introduced in Section 3. The third column adds in speech repair detection and correction, boundary tone identification, and makes use of silence information in detecting speech repairs and boundary tones. We see that this results in a perplexity reduction of 7.0%, and a POS error reduction of 8. 1%. As we further improve the modeling of the user's utterance, we  expect to see further improvements in the language model. Of course, there is a penalty to pay in terms of increased search space size, as the increase in the branching perplexity shows.</Paragraph>
      <Paragraph position="4">  In Table 5, we demonstrate that modeling intonational phrases benefits from modeling POS tags. Column two gives the results of augmenting the class-based model of Section 3.3.5 with intonational phrase modeling and column three gives the results of augmenting the POS-based model. Contrasting the results in column two with those in column three, we see that using the POS-based model ~ results in a reduction in the error rate of 17.2% over the class-based model. Hence, we see that modeling the POS tags allows much better modeling of intonational phrases than can be achieved with a class-based model. The fourth column reports the results using the full model, which accounts for interactions with speech repairs and the benefit of using silence information (Heeman and Allen, 1997b).</Paragraph>
      <Paragraph position="5">  In Table 6, we demonstrate that modeling the detection of speech repairs (and editing terms) benefits from modeling POS tags. In the results below, we ignore errors that are the result of improperly identifying the type of repair, and hence score a repair as correctly detected as long as it'was identified as either an abridged repair, modification repair or fresh start. Column two gives the results of augmenting the class-based model of Section 3.3.5 with speech repair modeling and column three gives the results of augmenting the POS-based model. In  terms of overall detection, the POS-based model reduces the error rate from 52.0% to 46.2%, a reduction of 11.2%. This shows that speech repair detection profits from being able to make use of syntactic generalizations, which are not available from a class-based approach. The final column gives the results of the full model, which accounts for interactions with speech repair correction and intonational phrasing, and uses silence information.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML