File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1168_metho.xml
Size: 22,185 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1168"> <Title>A Uni ed Approach in Speech-to-Speech Translation: Integrating Features of Speech Recognition and Machine Translation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Feature-based Log-linear Models </SectionTitle> <Paragraph position="0"> in Speech Translation The speech translation experimental system used in this study illustrated in Fig. 1 is a typical, statistics-based one. It consists of two major cascaded components: an automatic speech recognition (ASR) module and a statistical machine translation (SMT) module. Additionally, a third module, 'Rescore', has been added to the system and it forms a key component in the system. Features derived from ASR and SMT are combined in this module to rescore translation candidates.</Paragraph> <Paragraph position="1"> Without loss of its generality, in this paper we use Japanese-to-English translation to explain the generic speech translation process. Let X denote acoustic observations of a Japanese utterance, typically a sequence of short-time spectral vectors received at a frame rate of every centi-second. It is rst recognized as a Japanese sentence, J. The recognized sentence is then translated into a corresponding English sentence, E.</Paragraph> <Paragraph position="2"> The conversion from X to J is performed in the ASR module. Based on Bayes' rule, P(JjX) can be written as</Paragraph> <Paragraph position="4"> where Pam(XjJ) is the acoustic model likelihood of the observations given the recognized sentence J; Plm(J), the source language model probability; and P(X), the probability of all acoustic observations.</Paragraph> <Paragraph position="5"> In the experiment we generated a set of N-best hypotheses, JN1 = fJ1;J2; ;JNg 1 and each Ji is determined by</Paragraph> <Paragraph position="7"> where i is the set of all possible source sentences excluding all higher ranked Jk's, 1 k i 1.</Paragraph> <Paragraph position="8"> The conversion from J to E in Fig. 1 is the machine translation process. According to the statistical machine translation formalism (Brown et al., 1993), the translation process is to search for the best sentence bE such that</Paragraph> <Paragraph position="10"> where P(JjE) is a translation model characterizing the correspondence between E and J; P(E), the English language model probability.</Paragraph> <Paragraph position="11"> In the IBM model 4, the translation model P(JjE) is further decomposed into four submodels: null Lexicon Model { t(jje): probability of a word j in the Japanese language being translated into a word e in the English language. null 1Hereafter, J1 is called the single-best hypothesis of speech recognition; JN1 , the N-best hypotheses. Fertility model { n( je): probability of a English language word e generating words.</Paragraph> <Paragraph position="12"> Distortion model { d: probability of distortion, which is decomposed into the distortion probabilities of head words and non-head words.</Paragraph> <Paragraph position="13"> NULL translation model { p1: a xed probability of inserting a NULL word after determining each English word.</Paragraph> <Paragraph position="14"> In the above we listed seven features: two from ASR (Pam(XjJ), Plm(J)) and ve from SMT (P(E), t(jje), n( je), d, p1).</Paragraph> <Paragraph position="15"> The third module in Fig. 1 is to rescore translation hypotheses from SMT by using a feature-based log-linear model. All translation candidates output through the speech recognition and translation modules are re-evaluated by using all relevant features and searching for the best translation candidate of the highest score. The log-linear model used in our speech translation process, P(EjX), is</Paragraph> <Paragraph position="17"> In the Eq. 1, fi(X;E) is the logarithm value of the i-th feature; i is the weight of the i-th feature. Integrating di erent features in the equation results in di erent models. In the experiments performed in section 4, four di erent models will be trained by increasing the number of features successively to investigate the e ect of di erent features for improving speech translation. null In addition to the above seven features, the following features are also incorporated.</Paragraph> <Paragraph position="18"> Part-of-speech language models: English part-of-speech language models were used.</Paragraph> <Paragraph position="19"> POS dependence of a translated English sentence is an e ective constraint in pruning English sentence candidates. In our experiments 81 part-of-speech tags and a 5-gram POS language model were used.</Paragraph> <Paragraph position="20"> Length model P(ljE;J): l is the length (number of words) of a translated English sentence.</Paragraph> <Paragraph position="21"> Jump weight: Jump width for adjacent cepts in Model 4 (Marcu and Wong, 2002).</Paragraph> <Paragraph position="22"> Example matching score: The translated English sentence is matched with phrase translation examples. A score is derived based on the count of matches (Watanabe and Sumita, 2003).</Paragraph> <Paragraph position="23"> Dynamic example matching score: Similar to the example matching score but phrases were extracted dynamically from sentence examples (Watanabe and Sumita, 2003).</Paragraph> <Paragraph position="24"> Altogether, we used M(=12) di erent features. In section 3, we review Powell's algorithm (Press et al., 2000) as our tool to optimize model parameters, M1 , based on di erent objective translation metrics.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Parameter Optimization Based on Translation Metrics </SectionTitle> <Paragraph position="0"> The denominator in Eq. 1 can be ignored since the normalization is applied equally to every hypothesis. Hence, the choice of the best translation, bE, out of all possible translations, E, is independent of the denominator,</Paragraph> <Paragraph position="2"> where we write features, fi(X;E), explicitly in logarithm, logPi(X;E).</Paragraph> <Paragraph position="3"> The e ectiveness of the model in Eq. 2 depends upon the parameter optimization of the parameter set M1 , with respect to some objectively measurable but subjectively relevant metrics. null Suppose we have L speech utterances and for each utterance, we generate N best speech recognition hypotheses. For each recognition hypothesis, K English language translation hypotheses are generated. For the l-th input speech utterance, there are then Cl = fEl1; ;ElN Kg translations. All L speech utterances generate L N K translations in total. null Our goal is to minimize the translation \distortion&quot; between the reference translations, R, and the translated sentences, bE.</Paragraph> <Paragraph position="5"> where bE = f bE1; ; bELg is a set of translations of all utterances. The translation bEl of the l-th utterance is produced by the (Eq. 2), where E 2 Cl.</Paragraph> <Paragraph position="6"> Let R = fR1; ;RLg be the set of translation references for all utterances. Human translators paraphrased 16 reference sentences for each utterance, i.e., Rl contains 16 reference candidates for the l-th utterance.</Paragraph> <Paragraph position="7"> D( bE;R) is a translation \distortion&quot; or an objective translation assessment. The following four metrics were used speci cally in this study: BLEU (Papineni et al., 2002): A weighted geometric mean of the n-gram matches between test and reference sentences multiplied by a brevity penalty that penalizes short translation sentences.</Paragraph> <Paragraph position="8"> NIST : An arithmetic mean of the n-gram matches between test and reference sentences multiplied by a length factor which again penalizes short translation sentences.</Paragraph> <Paragraph position="9"> mWER (Niessen et al., 2000): Multiple reference word error rate, which computes the edit distance (minimum number of insertions, deletions, and substitutions) between test and reference sentences.</Paragraph> <Paragraph position="10"> mPER: Multiple reference position independent word error rate, which computes the edit distance without considering the word order.</Paragraph> <Paragraph position="11"> The BLEU score and NIST score are calculated by the tool downloadable 2.</Paragraph> <Paragraph position="12"> Because the objective function in the model (Eq. 3) is not smoothed function, we used Powell's search method to nd a solution. The Powell's algorithm used in this work is similar as the one from (Press et al., 2000) but we modi ed the line optimization codes, a subroutine of Powell's algorithm, with reference to (Och, 2003).</Paragraph> <Paragraph position="13"> Finding a global optimum is usually di cult in a high dimensional vector space. To make sure that we had found a good local optimum, we restarted the algorithm by using various initializations and used the best local optimum as the nal solution.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Corpus & System </SectionTitle> <Paragraph position="0"> The data used in this study was the Basic Travel Expression Corpus (BTEC) (Kikui et al., 2003), consisting of commonly used sentences listed in travel guidebooks and tour conversations. The corpus were designed for developing multiple language speech-to-speech translation systems. It contains four di erent languages: Chinese, Japanese, Korean and English. Only Japanese-English parallel data was used in this study. The speech data was recorded by multiple speakers and was used to train the acoustic models, while the text database was used for training the language and translation models.</Paragraph> <Paragraph position="1"> The standard BTEC training corpus, the rst le and the second le from BTEC standard test corpus #01 were used for training, development and test respectively. The statistics of corpus is shown in table 1.</Paragraph> <Paragraph position="2"> The speech recognition engine used in the experiments was an HMM-based, large vocabulary continuous speech recognizer. The acoustic HMMs were triphone models with 2,100 states in total, using 25 dimensional, short-time spectrum features. In the rst and second pass of decoding, a multiclass word bigram of a lexicon of 37,000 words plus 10,000 compound words was used. A word trigram was used in rescoring the results.</Paragraph> <Paragraph position="3"> The machine translation system is a graph-based decoder (Ue ng et al., 2002). The rst pass of the decoder generates a word-graph, a compact representation of alternative translation candidates, using a beam search based on the scores of the lexicon and language models. In the second pass an A* search traverses the graph. The edges of the word-graph, or the phrase translation candidates, are generated by the list of word translations obtained from the inverted lexicon model. The phrase translations extracted from the Viterbi alignments of the training corpus also constitute the edges. Similarly, the edges are also created from dynamically extracted phrase translations from the bilingual sentences (Watanabe and Sumita, 2003). The decoder used the IBM Model 4 with a trigram language model and a 5-gram part-of-speech language model. The training of IBM model 4 was implemented by the GIZA++ package (Och and Ney, 2003).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Model Training </SectionTitle> <Paragraph position="0"> In order to quantify translation improvement by features from speech recognition and machine translation respectively, we built four log-linear models by adding features successively. The four models are: Standard translation model(stm): Only features from the IBM model 4 (M=5) described in section 2 were used in the log-linear models. We did not perform parameter optimization on this model. It is equivalent to setting all the M1 to 1. This model was the standard model used in most statistical machine translation system. It is referred to as the baseline model.</Paragraph> <Paragraph position="1"> Optimized standard translation models (ostm): This model consists of the same features as the previous model \stm&quot; but the parameters were optimized by Powell's algorithm. We intended to exhibit the effect of parameter optimization by comparing this model with the baseline \stm&quot;. Optimized enhanced translation models (oetm): We incorporated additional translation features described in section 2 to enrich the model \ostm&quot;. In this model the number of the total features, M, is 10.</Paragraph> <Paragraph position="2"> Model parameters were optimized. We intended to show how much the enhanced features can improve translation quality.</Paragraph> <Paragraph position="3"> Optimized enhanced speech translation models (oestm): Features from speech recognition, likelihood scores of acoustic and language models, were incorporated additionally into the model \oetm&quot;. All the 12 features described in section 2 were used. Model parameters were optimized.</Paragraph> <Paragraph position="4"> To optimize parameters of the log-linear models, we used the development data of 510 speech utterances. We adopted an N-best hypothesis approach (Och, 2003) to train .</Paragraph> <Paragraph position="5"> For each input speech utterance, N K candidate translations were generated, where N is the number of generated recognition hypotheses and K is the number of translation hypotheses. A vector of dimension M, corresponding to multiple features used in the translation model, was generated for each translation candidate.</Paragraph> <Paragraph position="6"> The Powell's algorithm was used to optimize these parameters. We used a large K to ensure that promising translation candidates were not hypotheses of speech recognition performance in terms of word accuracy, sentence accuracy, insertion, deletion and substitution error rates word sent ins del sub</Paragraph> <Paragraph position="8"> single-best 93.5 78.7 2.0 0.8 3.6 N-best 96.1 87.0 1.2 0.3 2.2 pruned out. In the training, we set N=100 and K=1;000.</Paragraph> <Paragraph position="9"> By using di erent objective translation evaluation metrics described in section 3, for each model we obtained four sets of optimized parameters with respect to BLEU, NIST, mWER and mPER metrics, respectively.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Translation Improvement by Additional Features </SectionTitle> <Paragraph position="0"> All 508 utterances in the test data were used to evaluate the models. Similar to processing the development data, the speech recognizer generated N-best (N=100) recognition hypotheses for each test speech utterance. Table 2 shows speech recognition results of the test data set in single-best and N-best hypotheses. We observed that over 8% sentence accuracy improvement was obtained from the single-best to the N-best recognition hypotheses. The recognized sentences were then translated into corresponding English sentences. 1,000 such translation candidates were produced for each recognition hypothesis. These candidates were then rescored by each of the four models with four sets of optimized parameters obtained in the training respectively. The candidates with the best score were chosen.</Paragraph> <Paragraph position="1"> The best translations generated by a model were evaluated by the translation assessment metrics used to optimize the model parameters in the development. The experimental results are shown in Table 3.</Paragraph> <Paragraph position="2"> In the experiments we changed the number of speech recognition hypotheses, N, to see how translation performance is changed as N. We found that the best translation was achieved when a relatively smaller set of hypotheses, N=5, was used. Hence, the values in Table 3 were obtained when N was set to 5.</Paragraph> <Paragraph position="3"> We test each model by employing the single-best recognition hypothesis translations and the N-best recognition hypothesis translations.</Paragraph> <Paragraph position="4"> baseline model(stm) to the optimized enhanced speech translation model(oestm): Models are optimized using the same metric as shown in the columns. Numbers are in percentage except The single-best translation was from the translation of the single best hypotheses of the speech recognition and the N-best hypothesis translation was from the translations of all the hypotheses produced by speech recognition.</Paragraph> <Paragraph position="5"> In Table 3, we observe that a large improvement is achieved from the baseline model \stm&quot; to the nal model \oestm&quot;. The BLEU, NIST, mWER, mPER scores are improved by 7.9%, 2.7, 6.1%, 5.4% respectively. Note that a high value of BLEU and NIST score means a good translation while a worse translation for mWER and mPER. Consistent performance improvement was achieved in the single-best and N-best recognition hypotheses translations. We observed that the improvement were due to the following reasons: Optimization. Models with optimized parameters yielded a better translation than the models with unoptimized parameters.</Paragraph> <Paragraph position="6"> It can be seen by comparing the model \stm&quot; with the model \ostm&quot; for both the single-best and the N-best results.</Paragraph> <Paragraph position="7"> N-best recognition hypotheses. In majority of the cells in Table 3, translation performance of the N-best recognition is better than of the corresponding single-best recognition. N-best BLEU score of \ostm&quot; improved over the single-best of \ostm&quot; by 2.1%. However, NIST score is indi erent to the change. It appears that NIST score is insensitive to detect slight translation changes.</Paragraph> <Paragraph position="8"> Enhanced features. Translation performance is improved steadily when more features are incorporated into the log-linear models. Translation performance of model \oetm&quot; is better than model \ostm&quot; because more e ective translation features are used. Model \oestm&quot; is better than model \oetm&quot; due to its enhanced speech recognition features. It con rms that our approach to integrate features from speech recognition and translation features works very well.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Recognition Improvement of Incorrectly Recognized Sentences </SectionTitle> <Paragraph position="0"> In previous experiments we demonstrated that speech translation performance was improved by the proposed enhanced speech translation model \oestm&quot;. In this section we want to show that this improvement is because of the signi cant improvement of incorrectly recognized sentences when N-best recognition hypotheses are used.</Paragraph> <Paragraph position="1"> We carried out the following experiments.</Paragraph> <Paragraph position="2"> Only incorrectly recognized sentences were extracted for translation and re-scored by the model \oetm&quot; for the single-best case and the model \oestm&quot; for the N-best case. The translation results are shown in Table 4. Translation of incorrectly recognized sentences are improved signi cantly as shown in the table.</Paragraph> <Paragraph position="3"> Because we used N-best recognition hypotheses, the log-linear model chose the recognition hypothesis among the N hypotheses which yielded the best translation. As a result, speech recognition could be improved if the higher accurate recognition hypotheses was chosen for translation. This e ect can be observed clearly if we extracted the chosen recognition hypotheses of incorrectly recognized sentences. Table 5 shows the word accuracy and sentence accuracy of the recognition hypotheses selected by the translation module. The sentence accuracy of incorrectly recognized sentences was improved</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Discussions </SectionTitle> <Paragraph position="0"> As regards to integrating speech recognition with translation, a coupling structure (Ney, 1999) was proposed as a speech translation infrastructure that multiplies acoustic probabilities with translation probabilities in a one-step decoding procedure. But no experimental results have been given on whether and how this coupling structure improved speech translation.</Paragraph> <Paragraph position="1"> (Casacuberta et al., 2002) used a nite-state transducer where scores from acoustic information sources and lexicon translation models were integrated together. Word pairs of source and target languages were tied in the decoding graph. However, this method was only tested for a pair of similar languages, i.e., Spanish to English. For translating between languages of di erent families where the syntactic structures can be quite di erent, like Japanese and English, rigid tying of word pair still remains to be shown its e ectiveness for translation.</Paragraph> <Paragraph position="2"> Our approach is rather general, easy to implement and exible to expand. In the experiments we incorporated features from acoustic models and language models. But this framework is exible to include more e ective features. Indeed, the proposed speech translation paradigm of log-linear models have been shown e ective in many applications (Beyerlein, 1998) (Vergyri, 2000) (Och, 2003).</Paragraph> <Paragraph position="3"> In order to use speech recognition features, the N-best speech recognition hypotheses were needed. Using N-best could bear computing burden. However, our experiments have shown a smaller N seems to be adequate to achieve most of the translation improvement without signi cant increasing of computations.</Paragraph> </Section> class="xml-element"></Paper>