File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0404_metho.xml
Size: 14,336 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0404"> <Title>Correct parts extraction from speech recognition results using semantic distance calculation, and its application to speech translation</Title> <Section position="4" start_page="24" end_page="25" type="metho"> <SectionTitle> 2 Correct Parts Extraction us- </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="24" end_page="24" type="sub_section"> <SectionTitle> ing Constituent Boundary Parser 2.1 Constituent Boundary Parser </SectionTitle> <Paragraph position="0"> (CB-parser) For effective and robust spoken-language translation, a speech translation system called Transfer Driven Machine Translation (TDMT) which carries out analysis and translation in an example-based framework has been proposed\[6\]. TDMT which refers to as Example-Based Machine translation(EBMT)\[7\] does not require a full analysis and instead defines patterns on sentences/phrases expressed by &quot;variables&quot; and &quot;constituent boundaries&quot;. These patterns are classified into several classes, for example a complex sentence pattern class, an embedded clause pattern class, and phrase class. A long-distance dependency structure can be handled by complex sentence patterns. The process employs a fast nearest-matching method to find the closest translation example by measuring the semantic conceptual distance of a given linguistic expression from a set of equivalents in the example corpus. null In general, the EBMT method is particularly effective when the structure of an input expression is short or well-defined and its bounds have been recognized. When applying it in translation of longer utterances, the input must first be chunked to determine potential patterns by analyzing it into phrases after adding part-of-speech tags. In TDMT, translation is performed by means of stored translation examples which are represented by &quot;constituent boundary patterns&quot;. These are built using limited word-tag information, derived from morphological analysis, in the following sequence\[6\]: (a) insertion of constituent boundary markers, (b) derivation of possible structures by pattern matching, and (c) structural disambiguation using similarity calculation\[8\].</Paragraph> <Paragraph position="1"> Language model for speech recognition: word hi-gram Threshold for semantic distance: 0.2 Input sentence : He says the bus leaves Kyoto at 11 a.rn. Recognition result : He sells though the bus leaves; Kyoto at 11 a.m. He sells though I the bus leaves Kyoto at 11 a.rn.J</Paragraph> <Paragraph position="3"> If the process of the similarity calculations for candidate phrase patterns were executed top-down ~: breadth-first, then the calculation cost would be too expensive and the decision on the best phrase would have to be postponed.</Paragraph> <Paragraph position="4"> The translation cost are reduced in TDMT and phrases or partial sentences are analyzed because that the current TDMT uses instead on incremental method to determine the best structure locally in a bottom-up & best-only way to constrain the number of competing structures. This means that even TDMT fails for a whole sentence analysis, substructures partially analyzed can be gotten.</Paragraph> </Section> <Section position="2" start_page="24" end_page="25" type="sub_section"> <SectionTitle> 2.2 Correct Parts Extraction </SectionTitle> <Paragraph position="0"> Our proposed correct parts extraction (CPE) method obtains correct parts from recognition results by using the CB-parser. CPE uses the following two factors for the extraction: (1) the semantic distance between the input expression and an example expression, and (2) the structure selected by the shortest semantic distance.</Paragraph> <Paragraph position="1"> The merits of using the CB-parser are as follows. null The CB-parser can analyze spontaneous speech which can not be analyzed by the CFG framework, only if the example expressions are selected from a spontaneous speech corpus. With more expressions in spontaneous speech, there is an increased ability to distinguish between erroneous sentences and correct ones.</Paragraph> <Paragraph position="2"> The CB-parser can deal with patterns including over N words which can not be dealt with during speech recognition. (see Table 5).</Paragraph> <Paragraph position="3"> * The CB-parser can extract some partial structures independently from results of parsing, even if the parsing fails for a whole sentence.</Paragraph> <Paragraph position="4"> Correct parts are extracted under the following conditions: * When expressions including erroneous words show big distance values to the examples. When the distances are over the distance threshold, the parts are defined as &quot;erroneous parts&quot;.</Paragraph> <Paragraph position="5"> * Correct parts are extracted only from global parts consisting of over N words. If local parts including less than N words can not have a relation to other parts, the parts are defined as &quot;erroneous parts&quot;, even if the semantic distances are under the threshold. Figure 1 shows an example of CPE. The input sentence /He says the bus leaves Kyoto at 11 a.m./ is recognized as /He sells though the bus leaves Kyoto at 11 a.m./ by continuous speech recognition using a word bi-gram. The solid lines in Figure 1 indicate partial structures and the number for each structure denotes the corresponding semantic distance value. The dotted line indicates the failure analysis result. In this example, the analysis for the whole sentence is unsuccessful because the part /He says/is mis-recognized as/He sell though/. At first, the distance value of the longest part,/though the bus leaves Kyoto at 11 a.m./, is compared with the threshold value . The part is considered to include erroneous words because the distance value 0.4 is larger than the threshold value 0.2 . Secondly, the next longest part/the bus leaves Kyoto at 11 a.m./ is evaluated. This part is extracted as a correct part because the distance 0.005 is under the threshold value. Thirdly, the remaining part/He sells/is evaluated. The distance of the part/He sells/is under the threshold value, but the part includes only two words which are under N, so the part /He sells/ is regarded as an erroneous part.</Paragraph> </Section> </Section> <Section position="5" start_page="25" end_page="27" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> We evaluated CPE using the speech translation system shown in Figure 2. CPE has already been integrated into TDMT as explained in the previous section. At first, the obtained recognition results were analyzed and then partial structures and their semantic distances were output. Next, the correct parts were extracted and only the extracted parts were translated into target sentences. null We evaluated the following three things: (1) the recall and precision rates of the extracted parts , (2) the effectiveness of the method in understanding misrecognized results, and (3) the effectiveness of the method in improving the translation rate. For the evaluations, we used 70 erroneous results output by a speech recognition experiment using the ATR spoken language database on travel arrangement \[10\].</Paragraph> <Section position="1" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 3.1 Rate of correct parts extraction </SectionTitle> <Paragraph position="0"> To evaluate CPE, we compared the recall and precision rates after extraction to the same rates before extraction. Recall and precision are defined as follows: recall = number of correct words in extracted parts number of words in the correct sentence precision = num. of correct words in extracted parts num. of words in the recognition results The extraction defines the threshold for the number of words in the structure to be N+I, on the assumption that the semantic distances of the local parts consisting of under N words rate and the threshold of the semantic distance are not useful for determining whether the parts are correct or not. To confirm whether the assumption is true or not, extraction experiments were performed under variable threshold conditions for the number of words in the structure. Figure 3 shows the obtained recall and precision rates.</Paragraph> <Paragraph position="1"> * The recall rates under all conditions are over 92% and the best recall rate is 97%. This indicates that the rates increased over 15% from before the extraction.</Paragraph> <Paragraph position="2"> * The precision rates show a decrease of over 20% from before the extraction. This means that some correct parts could not be extracted. null * When the threshold is two, the recall rates decrease much more than when the threshold is over three.</Paragraph> <Paragraph position="3"> * When the threshold is over four, the precision rate deceases a lot.</Paragraph> <Paragraph position="4"> Furthermore, extraction experiments were performed under variable threshold values of the semantic distance for examining the relation between the threshold for the semantic distance and the rate of correct parts extraction. The recall and precision rates are shown in Figure 4. * There is a general trend that when the threshold increases, the recall rate decreases and the precision rate increases. But the differences of these rates are less than the differences by changing the threshold of the number of words as shown in Figure 3. \[n particular, the precision rate changes only slightly.</Paragraph> <Paragraph position="5"> * When the threshold is defined as below 0.2. the recall and precision rates do not change. These results show the following; * Words extracted by CPE are almost the real correct words.</Paragraph> <Paragraph position="6"> * The threshold for the number of words should be defined as over three when a &quot;BI&quot; gram is adopted, because the recall rates decrease when the threshold is two. It therefore seems the assumption is true that local parts consisting of under N words are not useful for determining the correct parts.</Paragraph> <Paragraph position="7"> * The best threshold condition for the number of words is three in consideration of both the recall and the precision. Under this condition, the recall rate is typically 96% and the precision rate is typically 63%.</Paragraph> <Paragraph position="8"> * The best threshold condition for the semantic distance is 0.2, because when the threshold is defined as over 0.2, the recall rate decreases. null</Paragraph> </Section> <Section position="2" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 3.2 Effect to speech understanding </SectionTitle> <Paragraph position="0"> To confirm the effectiveness of CPE in understanding speech recognition sentences, we compared the understanding rate of extracted parts using CPE with the rate of the recognition results before extraction. The same 70 erroneous sentences as in the previous experiments were used. The threshold for the number of words was defined as three and the threshold for the semantic distance was defined as 0.2, which were confirmed to be the best values in Figure 3 and by five Japanese. They gave one of the following five levels (Li)-(L5) to each misrecognition result before extraction and after extraction, by comparing the results with the corresponding correct sentence before speech recognition. The five levels were:</Paragraph> <Paragraph position="2"> Able to understand, but the expression is slightly awkward.</Paragraph> <Paragraph position="3"> Unable to understand, but the result is helpful in imagining the correct sentence. Understanding of the wrong meaning. CPE is not helpful.</Paragraph> <Paragraph position="4"> Output of the message &quot;Recognition impossible.&quot; null Each of the average rates of the five evaluators is shown in Table 1. CPE was effective in reducing the misunderstanding rate over half (35.5% to 15.2%). The results able to be understood which are given (L1) and (L2) increased but only a little ( 19.6% to 20.3% for (L1), 22.0% to 22.6% for (L2)) by using CPE. The tendency was that most of the misrecognition sentences including only negligible errors could be understood even without CPE, because the evaluators could see the errors themselves while reading the misrecognition results. On the other hand, most of the misrecognition sentences that included many erroneous parts were understood incorrectly. The proposed CEP was very effective here in preventing misunderstandings. Nonetheless, other additional mechanisms seem necessary, like an error recovering mechanism that increases the number of understandable sentences.</Paragraph> </Section> <Section position="3" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 3.3 Effect to speech translation </SectionTitle> <Paragraph position="0"> We evaluated the effectiveness of CPE in Japanese-English speech translation experiments using the speech translation system shown in and the threshold values for the CPE method were the same as in the previous experiments. The translation results were evaluated by three Japanese each with a high ability to converse in the English language. They gave one of five levels ( L i)-(L5 ) to each translation result of the misrecognized sentences, by comparing the result with the corresponding translation result of the correct sentence before speech recognition. (L1)(L4) for the evaluations were the same as in the previous experiments and (L5) meant &quot;Cannot translate&quot;.</Paragraph> <Paragraph position="1"> Each of the average rates of the three evaluators is shown in Table 2.</Paragraph> <Paragraph position="2"> Without CPE, 85.7% of the recognition results could not be translated. It seems that CPE is good for (L1)-(L3) but poor for (L4): (L5) shows negligible effect. The correctness rate for translation after CPE is more than double the rate before CPE (11.9% to 25.7%}. The sum of (LI)(L3) is 69%. This means that the proposed CPE is effective in improving the translation performance. However, we cannot ignore the fact that 21% of the recognition results were translated to erroneous sentences.</Paragraph> </Section> </Section> class="xml-element"></Paper>