File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/91/h91-1072_evalu.xml
Size: 10,514 bytes
Last Modified: 2025-10-06 14:00:00
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1072"> <Title>The Test Results</Title> <Section position="7" start_page="367" end_page="369" type="evalu"> <SectionTitle> EXPERIMENTAL RESULTS </SectionTitle> <Paragraph position="0"> We have used the N-best interface with TINA as the filter for our baseline in measuring performance improvements derived from combining parse and acoustic information. To aid us in assessing the impact of various changes, we have used a composite score for system performance, computed as percent of correct answers minus the percent of wrong answers 2.</Paragraph> <Paragraph position="1"> Here we define &quot;correct answer&quot; very strictly, namely producing tbe call to the VOYAGER back-end that would have been produced by a &quot;clean&quot; transcription of the sentence, removing false starts and filled pauses? The advantage of this strict method is that the procedure can be fully automated and requires no human judgements. It does allow certain &quot;meaning preserving&quot; alterations, e.g, insertion or deletion hess in the results reported in the DARPA June 1990 meeting \[10\]. For those results, correctness was judged in terms of producing the same action, as judged by an expert. Under the new, stricter criterion, if the transcribed sentence produces no function call (action), the recognized sentence cannot possibly be correct - even if it has produced a reasonable interpretation of the input. We estimate that approximately 5% of the sentences that are incorrect here would have been judged correct under the earlier criterion. This accounts for a difference in about 19 points in score, bringing the two results into approximate agreement. of &quot;the&quot; as in &quot;the Royal East&quot; vs. &quot;Royal East&quot;. Such a criterion seems reasonable, given that correctness for a spoken language system should measure understanding, rather than word accuracy.</Paragraph> <Paragraph position="2"> The N-best Interface In the N-best interface, the grammar functions only as the filter, and N is used as a rejection criterion. As N increases, the number of correct answers increases, but the number of incorrect answers also increases. Overall system performance rises rapidly between N = 1 and N = 6, peaks at N = 25 and then drops off gradually, as the system finds incorrect answers at a faster rate than correct answers (see</Paragraph> <Section position="1" start_page="368" end_page="368" type="sub_section"> <SectionTitle> Incorrect Adding Parse Probabilities </SectionTitle> <Paragraph position="0"> If we now combine a parse score with the acoustic score, we get much better results. We can see how this works by looking at the example in Table 2. Here we see that the correct answer is eventually found in the N-best output (the eleventh sentence). However, it is preceded by other sentences that parse and produce possible (but wrong)function calls to VOYAGER. The N-best output produces its candidates in order of acoustic score. We see that the correct sentence has a worse acoustic score (-1336 vs. -1521) but its parse score is substantially better (-14.1 vs. -18.0). In general, we note that the normalized parse score is a good discriminator of right vs. wrong sentences: the mean for correct answers is -2.92 with a standard deviation of 0.75, while for incorrect answers it is -4.31 with a standard deviation of 1.78. If we compute the most obvious thing, which is a linear combination of the normalized parse score and acoustic score, it is possible, by proper choice of weight, to get the correct answer to have the best combined score. This is illustrated in Table 3, which shows the relative combined scores at two</Paragraph> </Section> <Section position="2" start_page="368" end_page="369" type="sub_section"> <SectionTitle> Rank Acoustics Pat'se #Wds Sentence </SectionTitle> <Paragraph position="0"> 1. -1336 X it i get to kendall sq 2. -1387 -18.0 6 could i get to kendall sq 3. -1432 -18.0 6 would i get to kendall sq 4. -1455 X it i'd get to kendall sq 5. -1460 X it i do the kendall sq 6. -1472 X at do i get to kendall sq 7. -1506 X could i'd get to kendall sq 8. -1509 X i'd i get to kendall sq 9. -1511 X could i do the kendall sq 10. -1516 X it i get at kendall sq 11. -1521 -14.1 7 how do i get to kendall sq Table 2: N-best Output With Acoustic and Parse Scores 2. could i get to kendall sq different weights; at weight W = 100, the wrong answer still has a higher combined score. However, as we increase the weight of the parse score (e.g., to W = 200), the correct parse receives a higher combined score. We can determine an optimal parse score weight for the training data by looking at overall score (percent correct minus percent incorrect) as a function of parse score weight. The combination that produced the optimal overall score for the VOYAGER training data was Acoustics + 600 * Normalized-Parse, as shown in Figure 3. In order to determine the effect of size of N on this number, we also ran experiments varying size of N. It turns out that although optimal N using only the acoustic score is N = 25, optimal N for the combined parse and acoustic score is 35, but it is fairly stable between N = 25 to N = 100. Using the combined acoustic plus weighted parse score, some of the original errors are corrected: the percent correct (at N = 25) goes up from 36.1% for the N-best case to 38.7% for the combined score, while the incorrect percent goes down from 17.8% to 15.1%. At N = 25, we get an overall score of 23.6%, compared to 18.3% for N-best alone (an increase of 30%).</Paragraph> <Paragraph position="1"> Finally, if we make use of the normalized parse score to formulate an explicit rejection criterion, we find that we can improve our results still further. Figure 4 shows how percent correct and percent incorrect vary with the choice of threshold. Using an empirically determined threshold of4.0, the performance at N = 25 shows 37.5% correct (losing some correct answers that fall below the threshold), 10.9% incorrect (a substantial reduction from 15.1% without the use of a rejection threshold), and an overall score of 26.6% (up from 23.6% for use of combined parse and acoustic score without a rejection criterion). We also experimented with a rejection criterion based on acoustic score (e.g., difference between best score and current score) but did not find it useful in this domain; however this did turn out to be useful in the ATIS domain \[12\].</Paragraph> <Paragraph position="2"> A comparison of the following four different configurations at N = 25 is shown in Figure 5, with results in Table 4. The Test Results The overall score was optimized by running on a set of 568 training sentences (the development test set). Once we determined optimum parameters for parse score weight (W -- 600), rejection threshold (T = -4), and value of N, we then ran the test data (497 sentences) using these parameters. The resulting increases in score are shown in Figure 6 for both training and test data. Overall, the test results are quite comparable to the training results. The use of a combined parse plus acoustic score resulted in an increase from 21.5 to 28.0 in overall score (30%). The use of a rejection threshold together with the combined score resulted in a small additional increase to 28.8, more than 33% over the</Paragraph> </Section> </Section> <Section position="8" start_page="369" end_page="370" type="evalu"> <SectionTitle> N-best results for N = 25. FUTURE DIRECTIONS </SectionTitle> <Paragraph position="0"> All of this research has been done as a first step towards coupling the recognizer and the language understanding system more closely. Our initial results show more than 33% improvement in score by using parse information in addition to acoustic score. Having demonstrated that it is beneficial to change the shape of the search space using this knowledge, we are now pursuing experiments with a tightly coupled system to explore ways of increasing search e~ciency. We currently have a tightly coupled version of the system running that produces the identical output but uses TINA to predict allowable next words for the recognizer, given a string of words hypothesized by the recognizer. This approach has the potential to reduce the search space for the recognizer, since it will explore only word strings that can be interpreted by TINA. This reduction in search space is done, of course, at the price of considerable computation (namely parsing the current hypotheses). We plan to investigate the trade-offs involved between the greater pruning provided by tight coupling vs. the greater computation required. However, our initial results are quite promising: the tightly coupled system produces its answer in under a minute, running unoptimized on a Sun SPARC-2 workstation. The next step in tight coupling will be to incorporate the parse probabilities into the overall A* (or other) search strategy. By tuning the algorithm and off-loading some of the acoustic search to special purpose signal processing boards, we believe that the tightly coupled mode will provide improved performance over the Our results to date provide strong evidence that we can use additional knowledge from syntactic and semantic probabilities to improve overall system performance. It also indicates that explicit rejection criteria play an important part in improving system performance. In particular, the parse score threshold provides a good rejection criterion based on syntactic and semantic information. Once we develop reliable rejection criteria, we can begin to experiment with recovery strategies from rejection. For example, given a sentence that fails the rejection criterion, it might be possible to interact with the user, saying e.g., &quot;I thought you said '..?; did I understand you correctly?&quot; This would allow the user to confirm a correctly understood sentence and to correct a misunderstood sentence. This is surely preferable to providing misleading information on the basis of an incorrectly understood sentence. The notion of rejection criteria should also be helpful in identifying new words and sentences which contain these words. We plan to explore how to use human-machine interaction and combined syntax, semantic and acoustic knowledge to make further improvements in performance and usability of the spoken language interface.</Paragraph> </Section> class="xml-element"></Paper>