File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-4015_metho.xml
Size: 4,671 bytes
Last Modified: 2025-10-06 14:08:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-4015"> <Title>SPEECHALATOR: TWO-WAY SPEECH-TO-SPEECH TRANSLATION IN YOUR HAND</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. TRANSLATION </SectionTitle> <Paragraph position="0"> As part of this work we investigated two different techniques for translation, both interlingua based. The first was purely knowledge-based, following our previous work [3].</Paragraph> <Paragraph position="1"> The engine developed for this was too large to run on the device, although we were able to run the generation part off-line seamlessly connected by a wireless link from the hand-held device. The second technique we investigated used a statistical training method to build a model to translate structured interlingua IF to text in the target language. Because this approach was developed with the handheld in mind, it is efficient enough to run directly on the device, and is used in this demo.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. SYNTHESIS </SectionTitle> <Paragraph position="0"> The synthesis engine is Cepstral's Theta system. As the Speechalator runs on very small hardware devices (at least small compared to standard desktops), it was important that the synthesis footprint remained as small as possible.</Paragraph> <Paragraph position="1"> The speechalator is to be used for people with little exposure to synthetic speech, and the output quality must be very high. Cepstral's unit selection voices, tailored to the domain, meet the requirements for both quality and size.</Paragraph> <Paragraph position="2"> Normal unit selection voices may take hundreds of megabytes, but the 11KHz voices developed by Cepstral were around 9 megabytes each.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. ARABIC </SectionTitle> <Paragraph position="0"> The Arabic language poses a number of challenges for any speech translation system. The first problem is the wide range of dialects of the language. Just as Jamaican and Glaswegian speakers may find it difficult to understand each other's dialect of English, Arabic speakers of different dialects may find it impossible to communicate.</Paragraph> <Paragraph position="1"> Modern Standard Arabic (MSA) is well-defined and widely understood by educated speakers across the Arab world.</Paragraph> <Paragraph position="2"> MSA is principally a written language and not a spoken language, however. Our interest was in dealing with a normal spoken dialect, and we chose Egyptian Arabic; speakers of that dialect were readily accessible to us, and media influences have made it perhaps the most broadly understood of the regional dialects.</Paragraph> <Paragraph position="3"> Another feature of Arabic is that the written form, except in specific rare cases, does not include vowels. For speech recognition and synthesis, this makes pronunciations hard. Solutions have been tested for recognition where the vowels are not explicitly modeled, but implicitly modeled by context. This would not work well for synthesis; we have defined an internal romanization, based on the CallHome [4] romanization, from which full phonetic forms can easily be derived. This romanization is suitable for both recognizer and synthesis systems, and can easily be transformed into the Arabic script for display.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6. SYSTEM </SectionTitle> <Paragraph position="0"> The end-to-end system runs on a standard Pocket PC device. We have tested it on a number of different machines, including various HP (Compaq) iPaq machines (38xx 39xx) and Dell Axims. It can run on 32M machines, but runs best on a 64M machine with about 40M made available for program space. Time from the end of spoken input to start of translated speech is around 2-4 seconds depending on the length of the sentence and the actual processor. We have found StrongARM 206MHz processors, found on the older Pocket PCs, slightly faster than XScale 400MHz, though no optimization for the newer processors has been attempted.</Paragraph> <Paragraph position="1"> Upon startup, the user is presented with the screen as shown in Figure 1. A push-to-talk button is used and the speaker speaks in his language. The recognized utterance is first displayed, with the translation following, and the utterance is then spoken in the target language. Buttons are provided for replaying the output and for switching the input to the other language.</Paragraph> </Section> class="xml-element"></Paper>