File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1017_evalu.xml
Size: 10,658 bytes
Last Modified: 2025-10-06 13:59:07
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1017"> <Title>Lattice-Based Search for Spoken Utterance Retrieval</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Evaluation Metrics </SectionTitle> <Paragraph position="0"> For evaluating ASR performance we use the standard word error rate (WER) as our metric. Since we are interested in retrieval we use OOV rate by type to measure the OOV word characteristics. For evaluating retrieval performance we use precision and recall with respect to manual transcriptions. Let Correct(q) be the number of times the query q is found correctly, Answer(q) be the number of answers to the query q, and Reference(q) be the number of times q is found in the reference.</Paragraph> <Paragraph position="2"> We compute precision and recall rates for each query and report the average over all queries. The set of queries Q consists of all the words seen in the reference except for a stoplist of 100 most common words. The measurement is not weighted by frequency - i.e. each query q [?] Q is presented to the system only once, independent of the number of occurences of q in the transcriptions.</Paragraph> <Paragraph position="4"> For lattice based retrieval methods, different operating points can be obtained by changing the threshold. The precision and recall at these operating points can be plotted as a curve.</Paragraph> <Paragraph position="5"> In addition to individual precision-recall values we also compute the F-measure defined as</Paragraph> <Paragraph position="7"> and report the maximum F-measure (maxF) to summarize the information in a precision-recall curve.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Corpora </SectionTitle> <Paragraph position="0"> We use three different corpora to assess the effectiveness of different retrieval techniques.</Paragraph> <Paragraph position="1"> The first corpus is the DARPA Broadcast News corpus consisting of excerpts from TV or radio programs including various acoustic conditions. The test set is the 1998 Hub-4 Broadcast News (hub4e98) evaluation test set (available from LDC, Catalog no. LDC2000S86) which is 3 hours long and was manually segmented into 940 segments. It contains 32411 word tokens and 4885 word types. For ASR we use a real-time system (Saraclar et al., 2002). Since the system was designed for SDR, the recognition vocabulary of the system has over 200K words. The pronunciation dictionary has 1.25 pronunciations per word.</Paragraph> <Paragraph position="2"> The second corpus is the Switchboard corpus consisting of two party telephone conversations. The test set is the RT02 evaluation test set which is 5 hours long, has 120 conversation sides and was manually segmented into 6266 segments. It contains 65255 word tokens and 3788 word types. For ASR we use the first pass of the evaluation system (Ljolje et al., 2002). The recognition vocabulary of the system has over 45K words. For these words the average number of pronunciations per word is 1.07.</Paragraph> <Paragraph position="3"> The third corpus is named Teleconferences since it consists of multiparty teleconferences on various topics. The audio from the legs of the conference are summed and recorded as a single channel. A test set of six teleconferences (about 3.5 hours) was transcribed. It contains 31106 word tokens and 2779 word types. Calls are automatically segmented into a total of 1157 segments prior to ASR, using an algorithm that detects changes in the acoustics. We again use the first pass of the Switchboard evaluation system for ASR.</Paragraph> <Paragraph position="4"> In Table 1 we present the ASR performance on these three tasks as well as the OOV Rate by type of the corpora. It is important to note that the recognition vocabulary for the Switchboard and Teleconferences tasks are the same and no data from the Teleconferences task was used while building the ASR systems. The mismatch between the Teleconference data and the models trained on the Switchboard corpus contributes to the significant increase in WER.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Using ASR Best Word Hypotheses </SectionTitle> <Paragraph position="0"> As a baseline, we use the best word hypotheses of the ASR system for indexing and retrieval. The performance type) of various LVCSR tasks of this baseline system is given in Table 2. As expected, we obtain very good performance on the Broadcast News corpus. It is interesting to note that when moving from Switchboard to Teleconferences the degradation in precision-recall is the same as the degradation in WER.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Using ASR Word Lattices </SectionTitle> <Paragraph position="0"> In the second set of experiments we investigate the use of ASR word lattices. In order to reduce storage requirements, lattices can be pruned to contain only the paths whose costs (i.e. negative log likelihood) are within a threshold with respect to the best path. The smaller this cost threshold is, the smaller the lattices and the index files are. In Figure 1 we present the precision-recall curves for different pruning thresholds on the Teleconferences task.</Paragraph> <Paragraph position="1"> In Table 3 the resulting index sizes and maximum F-measure values are given. On the teleconferences task we observed that cost=6 yields good results, and used this value for the rest of the experiments. Note that this increases the index size with respect to the ASR 1-best case by 3 times for Broadcast News, by 5 times for Switchboard and by 9 times for Teleconferences.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Using ASR Phone Lattices </SectionTitle> <Paragraph position="0"> Next, we compare using the two methods of phonetic transcription discussed in Section 3.3 - phone recognition and word-to-phone conversion - for retrieval using only phone lattices. In Table 4 the precision and recall values that yield the maximum F-measure as well as the maximum F-measure values are presented. These results clearly indicate that phone recognition is inferior for our purposes.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.6 Using ASR Word and Phone Lattices </SectionTitle> <Paragraph position="0"> We investigated using the strategies mentioned in Section 3.4, and found strategy 3 - search the word index, if no result is returned search the phone index - to be superior to others. We give a comparison of the maximum F-values for the three strategies in Table 5.</Paragraph> <Paragraph position="1"> In Figure 2 we present results for this strategy on the Teleconferences corpus. The phone indices used in these experiments were obtained by converting the word lattices into phone lattices. Using the phone indices obtained by phone recognition gave significantly worse re- null hybrid strategies for teleconferences</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.7 Effect of Minimum Pronunciation Length for Queries </SectionTitle> <Paragraph position="0"> When searching for words with short pronunciations in the phone index the system will produce many false alarms. One way of reducing the number of false alarms is to disallow queries with short pronunciations. In Figure 3 we show the effect of imposing a minimum pronunciation length for queries. For a query to be answered its pronunciation has to have more than minphone phones, otherwise no answers are returned. Best maximum F-measure result is obtained using minphone=3.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.8 Effects of Recognition Vocabulary Size </SectionTitle> <Paragraph position="0"> In Figure 4 we present results for different recognition vocabulary sizes (5k, 20k, 45k) on the Switchboard corpus. The OOV rates by type are 32%, 10% and 6% respectively. The word error rates are 41.5%, 40.1% and 40.1% respectively. The precision recall curves are almost the same for 20k and 45k vocabulary sizes.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.9 Using Word Pair Queries </SectionTitle> <Paragraph position="0"> So far, in all the experiments the query list consisted of single words. In order to observe the behavior of various methods when faced with longer queries we used a set of a word/phone hybrid strategy for teleconferences word pair queries. Instead of using all the word pairs seen in the reference transcriptions, we chose the ones which were more likely to occur together than with other words. For this, we sorted the word pairs (w1,w2) according to their pointwise mutual information log p(w1,w2)p(w 1)p(w2) and used the top pairs as queries in our experiments. Note that in these experiments only the query set is changed and the indices remain the same as before.</Paragraph> <Paragraph position="1"> As it turns out, the precision of the system is very high on this type of queries. For this reason, it is more interesting to look at the operating point that achieves the maximum F-measure for each technique, which in this case coincides with the point that yields the highest recall. In Table 6 we present results on the Switchboard corpus using 1004 word pair queries. Using word lattices it is possible to increase the recall of the system by 16.4% while degrading the precision by only 2.2%. Using phone lattices we can get another 3.7% increase in recall for 1.2% loss in precision. The final system still has 95% precision.</Paragraph> <Paragraph position="2"> Finally, we make a comparison of various techniques on different tasks. In Table 7 maximum F-measure (maxF) is given. Using word lattices yields a relative gain of 35% in maxF over using best word hypotheses. For the final system that uses both word and phone lattices, the relative gain over the baseline increases to 8-12%.</Paragraph> <Paragraph position="3"> tasks In Figure 5 we present the precision recall curves. The gain from using better techniques utilizing word and phone lattices increases as retrieval performance gets worse.</Paragraph> </Section> </Section> class="xml-element"></Paper>