File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2061_metho.xml

Size: 21,686 bytes

Last Modified: 2025-10-06 14:10:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2061">
  <Title>Integration of Speech to Computer-Assisted Translation Using Finite-State Automata</Title>
  <Section position="4" start_page="467" end_page="468" type="metho">
    <SectionTitle>
2 Speech-Enabled CAT Models
</SectionTitle>
    <Paragraph position="0"> In a speech-enabled computer-assisted translation system, we are given a source language sentence fJ1 = f1 ...fj ...fJ, which is to be translated into a target language sentence eI1 = e1 ...ei ...eI, and an acoustic signal xT1 = x1 ...xt ...xT , which is the spoken target language sentence.</Paragraph>
    <Paragraph position="1"> Among all possible target language sentences, we will choose the sentence with the highest probability: null</Paragraph>
    <Paragraph position="3"> Eq. 1 is decomposed into Eq. 2 by assuming conditional independency between xT1 and fJ1 .</Paragraph>
    <Paragraph position="4"> The decomposition into three knowledge sources allows for an independent modeling of the target language model Pr(eI1), the translation model Pr(fJ1 |eI1) and the acoustic model Pr(xT1|eI1).</Paragraph>
    <Paragraph position="5"> Another approach for modeling the posterior probability Pr(eI1|fJ1 ,xT1 ) is direct modeling using a log-linear model. The decision rule is given by:</Paragraph>
    <Paragraph position="7"> Each of the terms hm(eI1,fJ1 ,xT1 ) denotes one of the various models which are involved in the recognition procedure. Each individual model is weighted by its scaling factor lm. As there is no direct dependence between fJ1 and xT1 , the hm(eI1,fJ1 ,xT1 ) is in one of these two forms: hm(eI1,xT1 ) and hm(eI1,fJ1 ). Due to the argmax operator which denotes the search, no renormalization is considered in Eq. 3. This approach has been suggested by (Papineni et al., 1997; Papineni et al., 1998) for a natural language understanding task, by (Beyerlein, 1998) for an ASR task, and by (Och and Ney, 2002) for an MT task. This approach is a generalization of Eq. 2. The direct modeling has the advantage that additional models can be easily integrated into the overall system. The model scaling factors lM1 are trained on a development corpus according to the final recognition quality measured by the word error rate (WER)(Och, 2003).</Paragraph>
    <Paragraph position="8"> Search The search in the MT and the ASR systems is already very complex, therefore a fully integrated search to combine ASR and MT models will considerably increase the complexity. To reduce the complexity of the search, we perform two independent searches with the MT and the ASR systems, the search result of each system will be represented as a large word graph. We consider MT and ASR word graphs as FSA. Then, we are able to use FSA algorithms to integrate MT and ASR word graphs. The FSA implementation of the search allows us to use standard optimized algorithms, e.g. available from an open source toolkit (Kanthak and Ney, 2004).</Paragraph>
    <Paragraph position="9"> The recognition process is performed in two steps. First, the baseline ASR system generates a word graph in the FSA format for a given utterance xT1 . Second, the translation models rescore each word graph based on the corresponding source language sentence. For each utterance, the decision about the best sentence is made according to the recognition and the translation models.</Paragraph>
  </Section>
  <Section position="5" start_page="468" end_page="468" type="metho">
    <SectionTitle>
3 Baseline Components
</SectionTitle>
    <Paragraph position="0"> In this section, we briefly describe the basic system components, namely the MT and the ASR systems.</Paragraph>
    <Section position="1" start_page="468" end_page="468" type="sub_section">
      <SectionTitle>
3.1 Machine Translation System
</SectionTitle>
      <Paragraph position="0"> We make use of the RWTH phrase-based statistical machine translation system for the English to German automatic translation. The system includes the following models: an n-gram language model, a phrase translation model and a word-based lexicon model. The latter two models are used for both directions: German to English and English to German. Additionally, a word penalty and a phrase penalty are included. The reordering model of the baseline system is distance-based, i.e.</Paragraph>
      <Paragraph position="1"> it assigns costs based on the distance from the end position of a phrase to the start position of the next phrase. More details about the baseline system can be found in (Zens and Ney, 2004; Zens et al., 2005).</Paragraph>
    </Section>
    <Section position="2" start_page="468" end_page="468" type="sub_section">
      <SectionTitle>
3.2 Automatic Speech Recognition System
</SectionTitle>
      <Paragraph position="0"> The acoustic model of the ASR system is trained on the VerbMobil II corpus (Sixtus et al., 2000).</Paragraph>
      <Paragraph position="1"> The corpus consists of German large-vocabulary conversational speech: 36k training sentences (61.5h) from 857 speakers. The test corpus is created from the German part of the bilingual English-German XEROX corpus (Khadivi et al., 2005): 1562 sentences including 18k running words (2.6h) from 10 speakers. The test corpus contains 114 out-of-vocabulary (OOV) words. The remaining part of the XEROX corpus is used to train a back off trigram language model using the SRI language modeling toolkit (Stolcke, 2002). The LM perplexity of the speech recognition test corpus is about 83. The acoustic model of the ASR system can be characterized as follows:  The test corpus recognition word error rate is 20.4%. Compared to the previous system (Khadivi et al., 2005), which has a WER of 21.2%, we obtain a 3.8% relative improvement in WER. This improvement is due to a better and complete optimization of the overall ASR system.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="468" end_page="472" type="metho">
    <SectionTitle>
4 Integration Approaches
</SectionTitle>
    <Paragraph position="0"> In this section, we will introduce several approaches to integrate the MT models with the ASR models. To present the content of this section in a more reader-friendly way, we will first explain the task and corpus statistics, then we will present the results of N-best rescoring. Afterwards, we will describe the new methods for integrating the MT models with the ASR models. In each sub-section, we will also present the recognition results.</Paragraph>
    <Section position="1" start_page="468" end_page="468" type="sub_section">
      <SectionTitle>
4.1 Task
</SectionTitle>
      <Paragraph position="0"> The translation models are trained on the part of the English-German XEROX corpus which was not used in the speech recognition test corpus. We divide the speech recognition test corpus into two parts, the first 700 utterances as the development corpus and the rest as the evaluation corpus. The development corpus is used to optimize the scaling factors of different models (explained in Sec-</Paragraph>
    </Section>
    <Section position="2" start_page="468" end_page="469" type="sub_section">
      <SectionTitle>
4.2 N-best Rescoring
</SectionTitle>
      <Paragraph position="0"> To rescore the N-best lists, we use the method of (Khadivi et al., 2005). But the results shown here are different from that work due to a better optimization of the overall ASR system, using a  better MT system, and generating a larger N-best list from the ASR word graphs. We rescore the ASR N-best lists with the standard HMM (Vogel  et al., 1996) and IBM (Brown et al., 1993) MT models. The development and evaluation sets N-best lists sizes are sufficiently large to achieve almost the best possible results, on average 1738 hypotheses per each source sentence are extracted from the ASR word graphs.</Paragraph>
      <Paragraph position="1"> The recognition results are summarized in Table 2. In this table, the translation results of the MT system are shown first, which are obtained using the phrase-based approach. Then the recognition results of the ASR system are shown. Afterwards, the results of combined speech recognition and translation models are presented.</Paragraph>
      <Paragraph position="2"> For each translation model, the N-best lists are rescored based on the translation probability p(eI1|fJ1 ) of that model and the probabilities of speech recognition and language models. In the last row of Table 2, the N-best lists are rescored based on the full machine translation system explained in Section 3.1.</Paragraph>
      <Paragraph position="3"> The best possible hypothesis achievable from the N-best list has the WER (oracle WER) of 11.2% and 12.4% for development and test sets, respectively.</Paragraph>
    </Section>
    <Section position="3" start_page="469" end_page="469" type="sub_section">
      <SectionTitle>
4.3 Direct Integration
</SectionTitle>
      <Paragraph position="0"> At the first glance, an obvious method to combine the ASR and MT systems is the integration at the level of word graphs. This means the ASR system generates a large word graph for the input target language speech, and the MT system also generates a large word graph for the source language text. Both MT and ASR word graphs are in the target language. These two word graphs can be considered as two FSA, then using FSA theory, we can integrate two word graphs by applying the composition algorithm.</Paragraph>
      <Paragraph position="1"> We conducted a set of experiments to integrate the ASR and MT systems using this method. We obtain a WER of 19.0% and 20.9% for development and evaluation sets, respectively. The results are comparable to N-best rescoring results for the phrase-based model which is presented in  ASR baseline are statistically significant at the 99% level (Bisani and Ney, 2004). However, the results are not promising compared to the results of the rescoring method presented in Table 2 for HMM and IBM translation models. A detailed analysis revealed that only 31.8% and 26.7% of sentences in the development and evaluation sets have identical paths in both FSA, respectively. In other words, the search algorithm was not able to find any identical paths in two given FSA for the remaining sentences. Thus, the two FSA are very different from each other. One explanation for the failure of this method is the large difference between the WERs of two systems, as shown in Table 2 the WER for the MT system is more than twice as high as for the ASR system.</Paragraph>
    </Section>
    <Section position="4" start_page="469" end_page="469" type="sub_section">
      <SectionTitle>
4.4 Integrated Search
</SectionTitle>
      <Paragraph position="0"> In Section 4.3, two separate word graphs are generated using the MT and the ASR systems.</Paragraph>
      <Paragraph position="1"> Another explanation for the failure of the direct integration method is the independent search to generate the word graphs. The search in the MT and the ASR systems is already very complex, therefore a full integrated search to combine ASR and MT models will considerably increase the complexity.</Paragraph>
      <Paragraph position="2"> However, it is possible to reduce this problem by integrating the ASR word graphs into the generation process of the MT word graphs. This means, the ASR word graph is used in addition to the usual language model. This kind of integration forces the MT system to generate identical paths to those in the ASR word graph. Using this approach, the number of identical paths in MT and ASR word graphs are increased to 39.7% and 34.4% of the sentences in development and evaluation sets, respectively. The WER of the integrated system are 19.0% and 20.7% for development and evaluation sets.</Paragraph>
    </Section>
    <Section position="5" start_page="469" end_page="470" type="sub_section">
      <SectionTitle>
4.5 Lexicon-Based Transducer
</SectionTitle>
      <Paragraph position="0"> The idea of a dynamic vocabulary, restricting and weighting the word lexicon of the ASR was first  introduced in (Brousseau et al., 1995). The idea was also seen later in (Paulik et al., 2005b), they extract the words of the MT N-best list to restrict the vocabulary of the ASR system. But they both reported a negative effect from this method on the recognition accuracy. Here, we extend the dynamic vocabulary idea by weighting the ASR vocabulary based on the source language text and the translation models. We use the lexicon model of the HMM and the IBM MT models. Based on these lexicon models, we assign to each possible target word e the probability Pr(e|fJ1 ). One way to compute this probability is inspired by IBM</Paragraph>
      <Paragraph position="2"> We can design a simple transducer (or more precisely an acceptor) using probability in Eq. 4 to efficiently rescore all paths (hypotheses) in the word graph with IBM Model 1:</Paragraph>
      <Paragraph position="4"> The transducer is formed by one node and a number of self loops for each target language word. In each arc of this transducer, the input label is target word e and the weight is [?]log 1J+1 *p(e|fJ1 ).</Paragraph>
      <Paragraph position="5"> We conducted experiments using the proposed transducer. We built different transducers with the lexicons of HMM and IBM translation models. In Table 3, the recognition results of the rescored word graphs are shown. The results are very promising compared to the N-best list rescoring, especially as the designed transducer is very simple. Similar to the results for the N-best rescoring approach, these experiments also show the benefit of using HMM and IBM Models to rescore the ASR word graphs.</Paragraph>
      <Paragraph position="6"> Due to its simplicity, this model can be easily integrated into the ASR search. It is a sentence specific unigram LM.</Paragraph>
    </Section>
    <Section position="6" start_page="470" end_page="471" type="sub_section">
      <SectionTitle>
4.6 Phrase-Based Transducer
</SectionTitle>
      <Paragraph position="0"> The phrase-based translation model is the main component of our translation system. The pairs of source and corresponding target phrases are extracted from the word-aligned bilingual training  design a transducer to rescore the ASR word graph using the phrase-based model of the MT system. For each source language sentence, we extract all possible phrases from the word-aligned training corpus. Using the target part of these phrases we build a transducer similar to the lexicon-based transducer. But instead of a target word on each arc, we have the target part of a phrase. The weight of each arc is the negative logarithm of the phrase translation probability.</Paragraph>
      <Paragraph position="1"> This transducer is a good approximation of non-monotone phrase-based-lexicon score. Using the designed transducer it is possible that some parts of the source texts are not covered or covered more than once. Then, this model can be compared to the IBM-3 and IBM-4 models, as they also have the same characteristic in covering the source words. The above assumption is not critical for rescoring the ASR word graphs, as we are confident that the word order is correct in the ASR output. In addition, we assume low probability for the existence of phrase pairs that have the same target phrase but different source phrases within a particular source language sentence.</Paragraph>
      <Paragraph position="2"> Using the phrase-based transducer to rescore the ASR word graph results in WER of 18.8% and 20.2% for development and evaluation sets, respectively. The improvements are statistically significant at the 99% level compared to the ASR system. The results are very similar to the results obtained using N-best rescoring method. But the transducer implementation is much simpler because it does not consider the word-based lexicon, the word penalty, the phrase penalty, and the reordering models, it just makes use of phrase translation model. The designed transducer is much faster in rescoring the word graph than the MT system in rescoring the N-best list. The average speed to rescore the ASR word graphs with this transducer is 49.4 words/sec (source language  text words), while the average speed to translate the source language text using the MT system is 8.3 words/sec. The average speed for rescoring the N-best list is even slower and it depends on the size of N-best list.</Paragraph>
      <Paragraph position="3"> A surprising result of the experiments as has also been observed in (Khadivi et al., 2005), is that the phrase-based model, which performs the best in MT, has the least contribution in improving the recognition results. The phrase-based model uses more context in the source language to generate better translations by means of better word selection and better word order. In a CAT system, the ASR system has much better recognition quality than MT system, and the word order of the ASR output is correct. On the other hand, the ASR recognition errors are usually single word errors and they are independent from the context. Therefore, the task of the MT models in a CAT system is to enhance the confidence of the recognized words based on the source language text, and it seems that the single word based MT models are more suitable than phrase-based model in this task.</Paragraph>
    </Section>
    <Section position="7" start_page="471" end_page="472" type="sub_section">
      <SectionTitle>
4.7 Fertility-Based Transducer
</SectionTitle>
      <Paragraph position="0"> In (Brown et al., 1993), three alignment models are described that include fertility models, these are IBM Models 3, 4, and 5. The fertility-based alignment models have a more complicated structure than the simple IBM Model 1. The fertility model estimates the probability distribution for aligning multiple source words to a single target word. The fertility model provides the probabilities p(ph|e) for aligning a target word e to ph source words. In this section, we propose a method for rescoring ASR word graphs based on the lexicon and fertility models.</Paragraph>
      <Paragraph position="1"> In (Knight and Al-Onaizan, 1998), some transducers are described to build a finite-state based translation system. We use the same transducers for rescoring ASR word graphs. Here, we have three transducers: lexicon, null-emitter, and fertility. The lexicon transducer is formed by one node and a number of self loops for each target language word, similar to IBM Model 1 transducer in Section 4.5. On each arc of the lexicon transducer, there is a lexicon entry: the input label is a target word e, the output label is a source word f, and the weight is [?]logp(f|e).</Paragraph>
      <Paragraph position="2"> The null-emitter transducer, as its name states, emits the null word with a pre-defined probability after each input word. The fertility transducer is also a simple transducer to map zero or several instances of a source word to one instance of the source word.</Paragraph>
      <Paragraph position="3"> The ASR word graphs are composed successively with the lexicon, null-emitter, fertility transducers and finally with the source language sentence. In the resulting transducer, the input labels of the best path represent the best hypothesis.</Paragraph>
      <Paragraph position="4"> The mathematical description of the proposed method is as follows. We can decompose Eq. 1 using Bayes' decision rule:</Paragraph>
      <Paragraph position="6"> In Eq. 5, the term Pr(xT1|eI1) is the acoustic model and can be represented with the ASR word graph1, the term Pr(eI1|fJ1 ) is the translation model of the target language text to the source language text. The translation model can be represented by lexicon, fertility, and null-emitter transducers.</Paragraph>
      <Paragraph position="7"> Finally, the term Pr(fJ1 ) is a very simple language model, it is the source language sentence.</Paragraph>
      <Paragraph position="8"> The source language model in Eq. 5 can be formed into the acceptor form in two different ways: 1. a linear acceptor, i.e. a sequence of nodes with one incoming arc and one outgoing arc, the words of source language text are placed consecutively in the arcs of the acceptor, 2. an acceptor containing possible permutations. To limit the permutations, we used an approach as in (Kanthak et al., 2005).</Paragraph>
      <Paragraph position="9"> Each of these two acceptors results in different constraints for the generation of the hypotheses. The first acceptor restricts the system to generate exactly the same source language sentence, while the second acceptor forces the system to generate the hypotheses that are a reordered variant of the source language sentence. The experiments conducted do not show any significant difference in the recognition results among the two source language acceptors, except that the second acceptor is much slower than the first acceptor. Therefore, we use the first model in our experiments. Table 4 shows the results of rescoring the ASR word graphs using the fertility-based transducers. 1Actually, the ASR word graph is obtained by using Pr(xT1 |eI1) and Pr(eI1) models. However, It does not cause any problem in the modeling, especially when we make use of the direct modeling, Eq. 3  As Table 4 shows, we get almost the same or slightly better results when compared to the lexicon-based transducers.</Paragraph>
      <Paragraph position="10"> Another interesting point about Eq. 5 is its similarity to speech translation (translation from target spoken language to source language text). Then, we can describe a speech-enabled CAT system as similar to a speech translation system, except that we aim to get the best ASR output (the best path in the ASR word graph) rather than the best translation. This is because the best translation, which is the source language sentence, is already given.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML