File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1039_metho.xml
Size: 18,951 bytes
Last Modified: 2025-10-06 14:13:47
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1039"> <Title>RECENT IMPROVEMENTS IN THE CMU SPOKEN LANGUAGE UNDERSTANDING SYSTEM</Title> <Section position="4" start_page="0" end_page="213" type="metho"> <SectionTitle> 2. SYSTEM OVERVIEW </SectionTitle> <Paragraph position="0"> The CMU spoken language understanding system is called Phoenix, and has been described in previous papers \[4, 3\]. It is neccessary here to give a brief description of the system in order to understand the context within which we were making changes.</Paragraph> <Paragraph position="1"> Our system has a loose coupfing between the recognition and parsing stages. The recognizer uses stochastic language models to produce a single word string hypothesis. This hypothesis is then passed to a parsing module which uses semantic grammars to produce a semantic representation for the input utterance. We use a stochastic language model in recognition because of its robustness and efficiency. The parsing stage must then be tolerant of mistakes due to disfluencies and misrecognitions.</Paragraph> <Paragraph position="2"> The parser outputs a frame with slots filled from the current utterance. Information from the current frame is integrated with information from previous frames to form the current context. A set of tests is applied to determine whether to reset context. For example if new depart and arrive locations are specified, old context is cleared. The current context is then used to build an SQL query.</Paragraph> <Section position="1" start_page="0" end_page="213" type="sub_section"> <SectionTitle> 2.1. Recognition </SectionTitle> <Paragraph position="0"> The CMU Sphinx-II system \[2\] uses semi-continuous Hidden Markov Models to model context-dependent phones (triphones), ineluding between-word context. The phone models are based on senones, that is, observation distributions are shared between corresponding states in similar models. The system uses four codebooks: Mel-scale cepstra, 1st and 2nd difference cepstra, and power. An observation probability is computed using a mixture of the top 4 distributions from each codebook.</Paragraph> <Paragraph position="1"> The recognizer processes an utterance in four steps.</Paragraph> <Paragraph position="2"> 1. It makes a forward time-synchronous pass using full between-word models, Viterbi scoring and a bigram language model. This produces a word lattice where words have one begin time but several end times.</Paragraph> <Paragraph position="3"> 2. It then makes a backward pass which uses the end times from the words in the first pass and produces a second lattice which contains multiple begin times for words.</Paragraph> <Paragraph position="4"> 3. An A* algorithm is used to generate the set of N-best hypotheses for the utterance from these two lattices. An N of 100 was used for these tests.</Paragraph> <Paragraph position="5"> 4. The set of N-best hypotheses is then reseored using a trigram language model. The best scoring hypothesis after rescoring is output as the result.</Paragraph> <Paragraph position="6"> 2.2. Parsing Our NL understanding system (Phoenix) is designed forr0bust information extraction. It uses a simple frame mechanism to represent task semantics. Frames are associated with the various types of actions that can be taken by the system. Slots in a frame represent the various pieces of information relevant to the action that may be specified by the subject. For example, the most frequently used frame in ATIS is the one corresponding to a request to display some type of flight information. Slots in the frame specify what information is to be displayed (flights, fares, times, airlines, etc), how it is to be tabulated (a list, a count, ete) and the constraints that are to be used (date ranges, time ranges, price ranges, ete).</Paragraph> <Paragraph position="7"> The Phoenix system uses Recursive Transition Networks to encode semantic grammars. The grammars specify word patterns (sequences of words) which correspond to semantic tokens understood by the system. A subset of tokens are considered as top-level tokens, which means they can be recognized independently of surrounding context. Nets call other nets to produee a semantic parse tree. The top-level tokens appear as slots in the frame structures. The frames serve to associate a set of semantic tokens with a function. Information is often represented redundantly in different nets. Some nets represent more complex bindings between tokens, while others represent simple stand-alone values. There is not one large sentential-level grammar, but separate grammars for each slot (there are approximately 70 of these in our ATIS system). The parse is flexible at the slot level in that it allows slots to be filled independent of order. It is not necessary, to represent all different orders in which the slot patterns could occur.</Paragraph> <Paragraph position="8"> on frames. Many different frames, and several different versions of a frame, are pursued simultaneously. The score for each frame hypothesis is the number of words that it accounts for. A file of words not to be counted in the score is included. At the end of an utterante the parser picks the best scoring frame as the result. There is a heuristic procedure for resolving ties. The output of the parser is the frame name and the parse trees for its filled slots.</Paragraph> </Section> </Section> <Section position="5" start_page="213" end_page="214" type="metho"> <SectionTitle> 3. LANGUAGE MODEL GENERATION </SectionTitle> <Paragraph position="0"> Our system uses two different types of language models, a bigram for speech recognition and and semantic grammar for parsing \[4, 3\].</Paragraph> <Section position="1" start_page="213" end_page="214" type="sub_section"> <SectionTitle> 3.1. Parsing Grammar </SectionTitle> <Paragraph position="0"> The frame structures and patterns for the Recursive Transition Networks were developed by processing transcripts of subjects performing scenarios of the ATIS task. The data, which consists of around 20000 utterances from the ATIS2 and ATIS3 corpora, were gathered by several sites. A subset of this data (around 10000 utterances) has been annotated with reference answers. The details of data collection and annotation are described by DaM \[1\].</Paragraph> <Paragraph position="1"> The goal of our system is to extract all relevant information from the utterance. We do not attempt to parse each and every word in the utterance. However, we have a problem with the grammar coverage if we miss one of the content words in the utterance. Let us look at some of the sentences from the December 1993 ARPA evaluation where grammar coverage was a problem: (x0s032sx) list *AIRPORT-DESIGNATIONS for flights from st. petersburg (8k8012sx) find a flight round trip from los angeles to tacoma washington with a stopover in san francisco *NOT -EXCEEDING the price of three hundred dollars for june tenth nineteen ninety three (g02084sx) list *THE -ORIGINS for alaska airlines (g0d014sx) *ARE -SNACKS *SERVED on tower air (i0k05esx) list *THE -DISTANCES to downtown (i0k0eesx) list *THOSE -DISTANCES from los angeles to downtown In these sentences, the words preceded by '-' did not occur in the grammar. However, we could parse the following sentences which are similar to these sentences. The differences are highlighted in bold letters.</Paragraph> <Paragraph position="2"> Our semantic grammars arc written to allow flexibility in the pattern match. The patterns for a semantic token consist of mandatory words or tokens which arc necessary to the meaning of the token and optional elements. The patterns arc also written to overgenerate in ways that do not change the semantics. This overgeneration not only makes the pattern matches more flexible but also serves to make the networks smaller.</Paragraph> <Paragraph position="3"> The parser operates by matching the word patterns for slots against the input text. A set of possible interpretations are pursued simultaneously. The system is implemented as a top-down Recursive Transition Network chart parser for slots. As slot fillers (semantic phrases) are recognized, they are added to frames to which they apply. The algorithm is basically a dynamic programming beam search list airport designation for flights from st. petersburg find a flight round trip from los angeles to tacoma washington with a stopover in san francisco not more than the price of three hundred dollars for june tenth nineteen ninety three list the origin for alaska airlines are snack sewed on tower air list the distance to downtown list those distance from los angeles to downtown Some of the words that occurred in the test set, but were not covered by our grammar are as follows:</Paragraph> </Section> </Section> <Section position="6" start_page="214" end_page="214" type="metho"> <SectionTitle> ALONG COMBINATION DESIGNATIONS DISTANCES EVER EXCEEDING INFORMATIONS ORIGINS RIGHT SEATAC SHOWED SNACKS SOUTHERN TIP TOLD VERY WANNA YEAH </SectionTitle> <Paragraph position="0"> Although we have used much of the training data in developing the grammar, we have mainly focused on the annotated data. For the annotated data, we can compare our answers with the reference answers and determine whether we have parsed the sentence reasonably. This is consistent with the goal of our system, which is to extract all relevant information from the utterance. For the unannotated data, we need to manually look at either the output of the parser or the words missed to decide if the parse is reasonable.</Paragraph> <Paragraph position="1"> We had on overall error rate of 5.8% on the November 1992 ARPA evaluation, and an error rate of around 4% on the ATIS2 annotated data. However, our error rate on the ATIS3 annotated data was around 18% for a nearly identical system. In the November 1992 ARPA evaluation, we had an error rate of 5.6% on the Class A set and an error rate of 6.1% on the Class D set. However, 75% of the Class A errors and 69% of the Class D errors in the evaluation set were caused by the lack of grammar coverage. We do not have exact results, but believe that the overwhelming number of errors in the training set are still caused by lack of grammatical coverage.</Paragraph> <Paragraph position="2"> We try to generalize the grammar to parse strings that are syntactically similar to the utterances in the training data. In our current system, this is achieved by manually adding synonyms and using other completion techniques. We add other words that are related to the words in the training corpus, for example, if we see Monday in the training corpus, we add the word Mondays as well as other days like Tuesday. If we had been more thorough in our completion, we would have correctly parsed the sentences from the December 1993 evaluation that we mentioned above. The process can be vastly improved by automating this completion process based on synonyms, antonyms, plurals, possessives and other semantic classes. We could also use morphemes to derive new words from the words used in training data.</Paragraph> <Section position="1" start_page="214" end_page="214" type="sub_section"> <SectionTitle> 3.2. Recognition Language Model </SectionTitle> <Paragraph position="0"> The lexicon used by the recognizer is initially based on the training data. We augment this lexicon using completion techniques described above, as well as adding the words from the grammar used by the parser. We also added many city names that are not in the database. The recognition dictionary contained 2924 unique words plus 10 nonspeech events. Currently, we allow nonspeech events to occur between any words, like a silence. Since we have added words to the lexicon which were not observed in the training data, we need to generate a bigram with appropriate probabilities for these words. Initially, we used a backed-off class bigram. The class bi-gram used 1573 word classes and was smoothed with the standard back-offmethod. In generating the class-based models, the probability of a word given a class was determined by the unigram frequency of the words rather than treating them as equi-probable. We then compared this to a bigram language model created by interpolating two different bigram models: 1) a backed-off word bigram and 2) a backed-off class bigram. Our initial experiments on an unseen test set indicated that the perplexity decreased from 21.7 to 20.58 when we used an interpolated bigram instead of a class bigram. When the recognizer was run on this test set (516 utterances) with the two language models, word error rate was reduced from 11% to 10% by using the interpolated bigram (compared to the class based). This is and error reduction of about 9%.</Paragraph> <Paragraph position="1"> We then tried to use the parser grammar to help smooth the bigram.</Paragraph> <Paragraph position="2"> The hope was to improve the match between the language models used during recognition and parsing and to get a better estimate of the probabilities of unseen bigrams. A word-pair grammar was generated from the parser grammar, and probabilities added by assuming equi-probable transitions from a word to all of its successors. This should give a better probability to a bigram which had not been seen, but was acceptable to the parser, than to an unseen bigram not allowed by the parser. We then interpolated the class-based, word-based and grammar-based bigrarns. However, when evaluated on a test set, a recognizer using this model was not significantly different than using the interpolated word and class bigrams.</Paragraph> <Paragraph position="3"> The trigram language model used the same word classes as the bigram, also smoothed by the back-off method, but not interpolated with a word-level model. We felt that we did not have enough data to train word trigrams.</Paragraph> </Section> </Section> <Section position="7" start_page="214" end_page="214" type="metho"> <SectionTitle> 4. PARSING </SectionTitle> <Paragraph position="0"> As described earlier, the system is implemented as a chart parser for slot fillers. As tokens (semantic phrases) are recognized, they are added to frames to which they apply. This process naturally produces partial interpretations. Words which don't fit into a interpretation are left out. In many cases the partial interpretation is sufficient for the system to take the desired action. If not, it still provides a good basis to begin a clarification dialog with the user. The system can give meaningful feedback to the user about what was understood and prompt for the most relevant missing information. The algorithm also produces multiple interpretations. Many different frames, and several different versions of a frame, are pursued simultaneously. In earlier papers \[4\], we have described some of the heuristics used by the parser to select the best interpretation for an utterance. If the score is below a certain threshold (based on the number of words in the utterance) it does not generate any parse.</Paragraph> <Paragraph position="1"> The implementation of a chart mechanism had other advantages. The system does not allow overlapping phrases in a frame, two different slots could not use a given word in the input. The previous version of the system used a subsumption process to favor matching the longest strings possible, but did not keep substrings of the longest string. This led to a phrase overlap problem if one slot could start with the same word that could end another slot in the same frame. For example in the utterance &quot;Show flights from Boston&quot;, if &quot;show flights&quot; matched one slot and &quot;flights from Boston&quot; matched another slot, both could not be assigned to the same frame since they overlap on the word &quot;flights&quot;. The grammar writer could avoid the problem, but had to be careful to do so. The chart algorithm efficiently produces and keeps substrings. Now, it is not neccessary to bias for longer strings, the overlap problem is solved by producing both parses at very little extra cost. In the example, we would produce &quot;show flights&quot; &quot;from Boston&quot; and &quot;show&quot; &quot;flights from Boston&quot;, assuming the grammars allowed these phrases.</Paragraph> </Section> <Section position="8" start_page="214" end_page="215" type="metho"> <SectionTitle> 5. Alternate Interpretations </SectionTitle> <Paragraph position="0"> Sometimes heuristics fail to identify the best interpretation, since they do not use additional information available in the context. We found it helpful to sometimes pursue alternate interpretations, that is interpretations which were not the best according to the heuristics used by the parser. In this section, we will describe when alternate interpretations are needed.</Paragraph> <Paragraph position="1"> One of the design decisions in our system was to use task knowledge in the backend instead of in the parser. The parser only uses domain independent heuristics. This sometimes leads to ambiguities. For example, Show me the fares could refer to fares for flights or fares for ground transportations. In general, heuristics correctly identify the parse. However, consider the following two sentences:</Paragraph> </Section> <Section position="9" start_page="215" end_page="215" type="metho"> <SectionTitle> I NEED FARES FROM MILWAUKEE TO LONG BEACH AIRPORT (q0g0b7ss) I NEED FARES FROM MILWAUKEE TO GENERAL MITCHELL INTERNATIONAL AIRPORT </SectionTitle> <Paragraph position="0"> These sentences are syntactically identical. The user is most likely asking for flight fares in the first sentence. In the second sentence, the user is most likely asking for the cost of ground transportation.</Paragraph> <Paragraph position="1"> However, the system needs domain knowledge to make these distinctions; it needs to know that Long Beach airport is in California, while Generel Mitchell International is in Milwaukee. Since the parser has no domain knowledge, it cannot parse both of these sentences correctly. However, the backend has domain knowledge and notices that one of the parses is incorrect.</Paragraph> <Paragraph position="2"> We address this problem by generating a beam of interpretations.</Paragraph> <Paragraph position="3"> The parser still produces the single best interpretation, but keeps track of a number of other interpretations. Whenever the backend notices a problem, it asks the parser for another interpretation. The parser then selects the next best interpretation. However, the score of the new interpretation must be within a certain threshold of the best score.</Paragraph> <Paragraph position="4"> We next look at the parses for the above mentioned sentences: i need fares from milwaukee to general mitchell international airport</Paragraph> </Section> class="xml-element"></Paper>