File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0415_metho.xml
Size: 8,897 bytes
Last Modified: 2025-10-06 14:14:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0415"> <Title>Spoken Language Translation with the ITSVox System</Title> <Section position="2" start_page="0" end_page="96" type="metho"> <SectionTitle> 2 Architecture of ITSVox </SectionTitle> <Paragraph position="0"> The ITSVox system consists (i) of a signal processing module based on the standard N-best approach (cf.</Paragraph> <Paragraph position="1"> Jimenez et al., 1995 I, (ii I a robust GB-based parser, (iii I a transfer-based translation module, and (iv) a speech synthesis module.</Paragraph> <Paragraph position="2"> To sketch the translation process and the interaction of the four components, consider the following example.</Paragraph> <Paragraph position="3"> (1) Je voudrais une chambre avec douche et avec rue sur le jardin.</Paragraph> <Paragraph position="4"> 'I would like a room with shower and with view on the garden' The spoken input is first processed by the HMM-based signal processing component, which produces a word lattice, which is then mapped into ranked strings of phonetic words.</Paragraph> <Paragraph position="5"> For simplicity, let us consider only the highest candidate in the list, which might (ideally) be something like (2), where / stands for voiceless fricatives and 0 for schwas.</Paragraph> <Paragraph position="6"> (2) j vudr iin /gbr av k du/ av k vii sir lO jardi Those word hypotheses constitute the input for the linguistic component. A lexical lookup using the phonetic trie representation described in the next section will produce a lexical chart. Applying linguistic constraints, the parser will try to disambiguate these words to produce a set of ranked GB-style enriched surface structures as illustrated in (3). (3) \[TI&quot; \[Or' je\] voudrais \[DP une chambre \[ConiV \[el&quot; avec douche\] et \[Pv avec vue \[Pv sur le jardin\]\]\]\]\] The best analysis in the automatic mode -- or the analysis chosen by the user in the interactive mode -- undergoes lexical transfer and then a generation process (involving transformations and morphology) which produces target language GB-style enriched surface structures, as displayed in (41 . These structures serve as input either to the orthographic display component or to the speech synthesis component. In the case of English output, most of the speech synthesis work relies on the DeeTalk system, although linguistic structures help to disambiguate non-homophonous homographs (read, lead, record, wind, etc.). The French speech output uses the M-BROLA synthesizer developed by T. Dutoit, at the University of Mons.</Paragraph> <Paragraph position="7"> (4) \['r~&quot; \[DP I\] would like It>,, a room \[con iP \[vp with shower\] and \[vl, with view \[pp on the garden\]\]\]\]\] Several of the components used by ITSVox have been described elsewhere. For instance, the translation engine is based on the ITS-2 interactive model (cf. Wehrli, 1996). The GB-parser (French and English) have been discussed in cf. Laenzlinger & Wehrli, 1991, Wehrli, 1992. As for the French speech synthesis system, it is described in Gaudinat and Wehrli (1997).</Paragraph> <Section position="1" start_page="96" end_page="96" type="sub_section"> <SectionTitle> 2.1 The phonetic trie </SectionTitle> <Paragraph position="0"> The phonetic lexicon is organized as atrie structure (Knuth, 19731, that is a tree structure in which nodes correspond to phonemes and subtrees to possible continuations. Each terminal node specifies one or more lexical entries in the lexical database. For instance, the phonetic sequence \[sa\] leads to a terminal node in the trie connected to the lexical entries corresponding (i I to the feminine possessive determiner sa (her), and (ii) to the demonstrative pronoun ~a (that).</Paragraph> <Paragraph position="1"> With such a structure, words are recognized one phoneme at a time. Each time the system reaches a terminal node, it has recognized a lexical unit, which is inserted into a chart (oriented graph), which serves as data structure for the syntactic parsing.</Paragraph> </Section> <Section position="2" start_page="96" end_page="96" type="sub_section"> <SectionTitle> 2.2 Interaction </SectionTitle> <Paragraph position="0"> ITSVox is interactive in the sense that it can request on-line information from the user. Typically, interaction takes the form of clarification dialogues. Furthermore, all interactions are conducted in source language only, which means that target knowledge is not a prerequisite for users of ITSVox. User consultation can occur at several levels of the translation process. First, at the lexicographie level, if an input sentence contains unknown words. In such cases, the system opens an editing window with the input sentence and asks the user to correct or modify the sentence. null At the syntactic level, interaction occurs when the parser faces difficult ambiguities, for instance when the resolution of an ambiguity depends on contextual or extra-linguistic knowledge, as in the case of some prepositional phrase attachments or coordination structures. By far, the most frequent cases of interaction occur during transfer, to a large extent due to the fact that lexical correspondences are all too often of the many-to-many variety, even at the abstract level of lexemes. It is also at this level that our decision to restrict dialogues to the source language is the most chaLlenging. While some cases of polysemy can be disambiguated relatively easily for instance on the basis of a gender distinction in the source sentence, as in (5), other cases such as the (much simplified) one in (6) are obviously much harder to handle, unless additional information is included in the bilingual dictionary.</Paragraph> <Paragraph position="1"> (5)a. Jean regarde les voiles.</Paragraph> <Paragraph position="2"> 'Jean is looking at the sails/veils' b. masculin (le voile) fdminin (la voile) (6)a. Jean n'aime pass les avocats.</Paragraph> <Paragraph position="3"> 'Jean doesn't like lawyers/advocadoes' b. avocats: homme de loi (la~lter) fruit (.b.uit) Another common case of interaction that occurs during transfer concerns the interpretation of pronouns, or rather the determination of their antecedent. In an sentence such as (7), the possessive son could refer either to Jean, to Marie or (less likely) to some other person, depending on contexts. (7) Jean dlt g Marie que son livre se vend bien.</Paragraph> <Paragraph position="4"> 'Jean told Marie that his/her book is selling well' In such a case, a dialogue box specifying all possible (SL) antecedents is presented to the user, who can select the most appropriate one(s).</Paragraph> </Section> <Section position="3" start_page="96" end_page="96" type="sub_section"> <SectionTitle> 2.8 Speech output </SectionTitle> <Paragraph position="0"> Good quality speech synthesis systems need a significant amount of linguistic knowledge in order (i) to disambiguate homographs which are not homophones (words with the same spelling but different pronunciations such as to lead/tile lead, to wind/tt~e wind, he read/to read, he records/the records, etc., (ii) to derive the syntactic structure which is used to segment sentences into phrases, to set accent levels, etc., and finally to determine an appropriate prosodic pattern. In a language like French, the type of attachment is crucial to determine whether a liaison between a word ending with a (latent) consonant and a word starting with a vowel is obligatory/possible/impossible 1 .</Paragraph> <Paragraph position="1"> 1 For instance, liaison is obligatory between a prenominal adjective and a noun (e.g. petit animal), or between Such information is available during the translation process. It turns out that in a linguisticallysound machine translation system, the surface structure representations specify all the lexical, morphological and syntactic information that a speech synthesis system needs.</Paragraph> </Section> </Section> <Section position="3" start_page="96" end_page="96" type="metho"> <SectionTitle> 3 Concluding remark </SectionTitle> <Paragraph position="0"> Although a small prototype has been completed, the ITSVox system described in this paper needs further improvements. The speech processing system under development at IDIAP is speaker-independent, HMM-based and contains models of phonetic units.</Paragraph> <Paragraph position="1"> A lexicon of word forms and a N-gram language model constitute the linguistic knowledge of this component. With respect to the linguistic components, current efforts focus on such tasks as retrieving ponctuation and use of stochastic information to rank parses. Those developments, however, will not affect the basic guideline of this project, which is that speech-to-speech translation systems and text translation systems must be minimally different.</Paragraph> </Section> class="xml-element"></Paper>