File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0718_intro.xml

Size: 15,057 bytes

Last Modified: 2025-10-06 14:01:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0718">
  <Title>The VI framework program in Europe: some thoughts about Speech to Speech Translation research.</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In the last ten years, many projects addressed the speech to speech translation problem, S2ST, i.e.</Paragraph>
    <Paragraph position="1"> VERBMOBIL [1], C-STAR [2], NESPOLE! [3], EU-TRANS [4], BABYLON [5], .. Many results and advancements have been achieved in methodology, approaches and even performance. These projects have shown prototypes and demonstrations in different communicative situations: speech to speech translation over the telephone, machine mediated translation in a face to face communication ( both in a real face to face or through videoconferencing). Some basic approaches have been explored: direct translation or data driven (both example based and statistical), indirect translation trough interlingua-interchange format (IF) and mixed approaches, i.e. multiengine. In terms of performance significant results have been obtained in the VERBMOBIL project using a statistical approach.</Paragraph>
    <Paragraph position="2"> Real applications using ASR technology are used in many applications in every day life [6]. Dictation machines in limited domain, simple automatic services over telephone, command and control in car, spoken document retrieval from broadcast news. Despite the new economy bubble and some dramatic events, like the L&amp;H case, speech companies are still on the market. However in terms of technology employed, we are far from providing a free communication functionality which is necessary when more complex automatic services are needed, even considering communicative situations where a small number of concepts are involved (very limited domain). Automatic time Association for Computational Linguistics.</Paragraph>
    <Paragraph position="3"> Algorithms and Systems, Philadelphia, July 2002, pp. 129-135. Proceedings of the Workshop on Speech-to-Speech Translation: table inquiry systems are working in a strictly menu driven approach. Automatic directory assistance services can also be classified in this class of applications. Here a further complexity is given by the high perplexity of the directory names, but in the end is still a complex communicative situation.</Paragraph>
    <Paragraph position="4"> In fact consider the difficulty in modelling the high number of sentences that can be used when trying to get the telephone number of an item of the Yellow Pages.</Paragraph>
    <Paragraph position="5"> The microelectronic and telecommunication market offers new opportunity of communication by cells phones, pdas, laptops in a wired or wireless environment. The communication process in this case is helped or &amp;quot;complicated&amp;quot; by multimodal interfaces and multimedia information. A new framework could be offered by the Web, which &amp;quot;integrates&amp;quot; potentially multimedia data with multimodal communication. In this case the paradigm is shifted towards a multimedia, multimodal person to person communication, in which the meanings are conveyed by the language and enhanced with multimedia content and non verbal cues. The answer to a given question in a multilingual conversation could be more effective if given in text and/or visual form. In this case the problem to afford becomes a combination of language understanding, information extraction and multimedia generation in the target language. Document retrieval, summarization and translation could also be involved in this communication process. All these technologies should be thought as pieces of a whole: a new model for person to person, information mediated, communication that brings together all of the resources available: verbal and non verbal communication, multimedia, face to face. Approaching the multilingual communication as a whole means to implement each new technology as a brick within an entire edifice.</Paragraph>
    <Paragraph position="6"> Starting from the state of the art in speech to speech translation research, considering the experience carried on in setting real applications in ASR and having in mind the opportunities offered by new devices in a wired and wireless environment, a question arise in order to develop real multilingual communication in the next decade: what next? Which are the main breakthroughs needed? Many issues need to be addressed. First of all how can we reach the necessary performance required by the three basic technologies needed, i.e. speech recognition, synthesis and machine translation.</Paragraph>
    <Paragraph position="7"> Shall we need a shift in the paradigm of research ? Is it mainly a matter of amount and quality of data needed? How important are issues as devices, multimedia information involved in a human to human dialog, environmental-contextual information provided by intelligent networks? How to integrate all these contextual information in a consistent way ? Many steps and advancements are needed in order to answer these questions. These are some of the questions addressed in a project whose acronym is TC-SPAR_P, technology and corpora for speech to speech translation, recently funded by European Union in the last call of the V framework. In what follows, first of all a state of the art of the basic technologies involved in a S2ST systems is summarized, then the most important challenges are listed and finally the TC- null STAR_P project is presented.</Paragraph>
    <Paragraph position="8"> 2 State of the art</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Speech recognition
</SectionTitle>
      <Paragraph position="0"> In the last 15 years a number of speech recognition tasks have been studied and evaluated. Each task presented different challenges. The features characterizing these tasks are: type of speech (well formed vs spontaneous), target of communication (computer, audience, person), bandwidth ( FWB, full bandwidth TWB, telephone bandwidth, FF, far field). Some of these tasks are dictation (WSJ), broadcast news, switchboard, voicemail and meetings. In what follows, they are ordered in terms of the word error rate (wer) Dictation: 7%, well formed, computer, FBW Broadcast news: 12%, various, audience, FBW Switchboard : 20-30% spontaneous, person, TBW Voicemail: 30% spontaneous, person, TWB Meetings: 50-60% spontaneous, person FF At present the spontaneous speech is the feature with the largest effect on word error rate, followed by environment effect and domain dependence.</Paragraph>
      <Paragraph position="1"> The main challenge for the next years will be to develop speech recognition systems that mimics human performance. This means in general independent of environment, domain and working as well for spontaneous as for read speech. The focus areas will mainly concentrate first of all on improving the spontaneous speech models ( i.e prosodic features and articolatory models, multispeaker speech, collect adequate amount of conversational speech,...), modeling and training techniques for multi-environment and multidomain. Then another key issue will be language modeling. It is well known that different static language models work best on specific domain. To implement a language model that works well on many domains will be an important achievement towards the goal of mimicking the human performance. A very quick dynamic adaptation at the level of word/sentence is an important target of the research. Finally other factors driving progress will be the continuous improving of computer speed over time, the independence from vocabulary and the involvement of all the potential researchers in the field, not only a few institutions. Improving the performance of conversational speech and introducing highly dynamic language models are the two fundamental requirement for improving S2ST performances. This is maybe the most critical point because performing under 10%, in conversational speech, seems today an hard problem.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Speech synthesis
</SectionTitle>
      <Paragraph position="0"> Speech synthesis is an important component in a speech to speech translation system. To mimics human voice is still one of the most challenging goal for speech synthesis. The multilingual human to human communication framework introduce new challenges, gender, age and cultural adaptation. Emotion and prosody are also very important issues [7] [8].</Paragraph>
      <Paragraph position="1"> Today the most effective way to generate synthetic speech is based on the concatenation of different acoustic units. This approach is in contrast to traditional rule-based synthesis where the design of the deterministic units required explicit knowledge and expertise. In a corpus based approach the unit selection process involves a combinatorial search over the entire speech corpus, and consequently, fast search algorithms have been developed for this purpose as an integral part of current synthesis systems. null Three are the main factors of the corpus-based methods for a specification of the speech segments required for concatenative synthesis: first of all a unit selection algorithm, then some objective measures used in the selection criteria and finally the design of the required speech corpus. From the application point of view the huge amount of memory necessary for exploiting the concatenation of speech units, strongly limits the class of application. null Prosody and speaker characteristics are, together with speech segments design, the other two important issues in speech synthesis. In order to control prosody, it is necessary to ensure adequate intonation and stress, rhythm, tempo and accent. Segmental duration control and fundamental frequency control are needed. Speech waveforms contain not only linguistic information but also speaker voice characteristics, as manifested in the glottal waveform of voice excitation and in the global spectral features representing vocal tract characteristics. Moreover paralinguistic factors cause changes in speaking styles reflected in a change of both voice quality and prosody.</Paragraph>
      <Paragraph position="2"> Prosodic modeling is probably the domain from which most of the improvements will come. Investigation in this direction, try to master linguistic and extra-linguistic phenomena, will address probably multicultural issues, which are very important in a multilingual human to human communication framework.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Machine Translation
</SectionTitle>
      <Paragraph position="0"> Beside speech recognition and synthesis the translation component is the core of a speech to speech translation system. The classical machine translation (MT) problem, to translate a text in a given language, i.e. Italian, in a target language, i.e. Chinese, is a completely different problem from the S2PT problem First of all in the classical MT problem no human is involved. The process is a one way process. The text is supposed to be linguistically 'correct'. In the S2ST process two humans are involved, the process is bi-directional, the language is conversational, spontaneous, ungrammatical and mixed with non verbal cues.</Paragraph>
      <Paragraph position="1"> Moreover the environment, in terms of acoustic noise and modality of interaction is a critical issue.</Paragraph>
      <Paragraph position="2"> A near real time translation is mandatory in S2ST.</Paragraph>
      <Paragraph position="3"> Then, because humans are involved directly in the process, the understanding phase is carried on by humans in a collaborative way. Finally given that anyhow a machine is involved in the translation an important issue related to human machine communication has also to be considered. In order to afford the S2ST problem all these factors have to be taken into account.</Paragraph>
      <Paragraph position="4"> Different architectures have been exploited: some using an intermediate language (interlingua, interchange format), some exploiting a direct translation method. A typical example of the first case is represented by JANUS [9] and NESPOLE! architectures. The Italian implementation of NESPOLE! S2ST system architecture] consists of two main processing chains: the analysis chain and the synthesis chain. The analysis chain converts a Italian acoustic signal into a (sequence of), IF representation(s) by going through: the recognizer, which produces a sequence of word hypotheses for the input signal; the understanding module, which exploits a multi-layer argument extractor and a statistical based classifier to deliver IF representations. The synthesis chain starts from an IF expression and produces a target language synthesized audio message expressing that content. It consists of two modules. The generator first converts the IF representation into a more language oriented representation and then integrates it with domain knowledge to produce sentences in Italian. Such sentences feed a speech synthesizer.</Paragraph>
      <Paragraph position="5"> An example of the direct translation approach is represented by the ATR-MATRIX [10] architecture, which exploit a cascade of a speech recognizer with a direct translation algorithm, TDMT, whose produced text is then synthesized. The direct translation approach is implemented using example based algorithms. A second example of direct translation, based on statistical modeling , has been pioneered by IBM[11] [12], starting from text translation. Statistical translation has also been developed in the European project EU-TRANS and in the framework of German project VERBMOBIL.</Paragraph>
      <Paragraph position="6"> At the moment research is going on in order to develop unified or integrated approaches. To unify speech recognition, understanding, and translation as an entire statistical processing is the ultimate goal of this approach as well stated in [13] &amp;quot; We consider this integrated approach and its suitable implementation to be an open question for future research on spoken language translation&amp;quot; From the performance point of view the most important experience obtained in the VERBMOBIL project, in particular a large-scale end-to-end evaluation, showed that the statistical approach resulted in significantly lower error rates than three competing translation approaches: the sentence error rate was 29% in comparison with 52% to 62% for the other translation approaches.</Paragraph>
      <Paragraph position="7"> Finally a key issue for S2ST systems is the end to end evaluation methodology. The goal is to develop a methodology based on objective measurement. Evaluation methodologies have been proposed and developed in VERBMOBIL, C-STAR, and by many other groups.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML