File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1003_metho.xml
Size: 23,182 bytes
Last Modified: 2025-10-06 14:12:36
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1003"> <Title>THE ESPRIT PROJECT POLYGLOT</Title> <Section position="4" start_page="0" end_page="7" type="metho"> <SectionTitle> COMMON TASKS </SectionTitle> <Paragraph position="0"> In ESPRIT projects are set up as collaborative enterprises.</Paragraph> <Paragraph position="1"> Thus explicit efforts are made to ensure that all partners cooperating in a Consortium use common, or at least compatible procedures. Ideally, they should even use common hardware. In Polyglot the ideal of common hardware could not be reached, since most of the partners already had acquired most of the computers necessary for carrying out the research before the start of the project. The budget available for the project did not allow the purchase of completely new equipment for the project. This necessitated considerable effort in specifying standards for hardware and software in order to obtain a common platform, One very important advantage of a collaborative project is that the costs of software development can be kept to a minimum by distributing the development tasks over the partners. Obviously, this is one aspect where the distinction between Language Independent frameworks and Language Specific data plays a major role: the Language Independent software needs to be written only once and made available to all partners. Since it is not yet feasible to produce completely system independent software it was specified that all software written as part of the project should be in 'C' and that, with few exceptions, every program should be able to compile and run on a SUN station and on art MS-DOS PC.</Paragraph> <Paragraph position="2"> Another field where standardization is crucial is the recording of speech databases. Since databases for seven different languages are needed, it was not possible to do all recordings at a single site. In order to obtain compatible recordings from seven or so different sites in seven different countries, precise specifications of the recording conditions had to be developed. That process was complicated by the fact that several Work Packages had different requirements with respect to recording quality and procedures.</Paragraph> <Paragraph position="3"> It has been the task of the WP Common Tasks to provide all standards and specifications. Moreover, this WP was responsible for the organization and the monitoring of the acquisition of all databases needed in the other WP's. Finally, it was responsible for the production of the software for COITIITtOn use.</Paragraph> </Section> <Section position="5" start_page="7" end_page="7" type="metho"> <SectionTitle> ISOLATED WORD SPEECH RECOGNITION </SectionTitle> <Paragraph position="0"> The WP IWSR aims at the implementation of very large vocabulary, speaker adaptive, isolated word speech recognition for all seven languages of the consortium. In practice, an attempt is made to extend an existing system for Italian to six other languages \[3\]. That system was designed to offer fast speaker enrollment, easy modification of the dictionary and flexible control.</Paragraph> <Paragraph position="1"> The systems run on an MS-DOS PC that uses one or two special-purpose plug-in boards. After signal processing, resulting in vectors of 20 LPC Cepstrurn coefficients and two energy values for each 10 ms speech frame, each frame is given the label of the nearest phonemic template. The string of prototypes thus formed is then used for a fast lexical access that retrieves the 100 or so most likely word candidates. The Dynamic Programming string match used in this preselection phase relies on knowledge about phoneme confusions, phoneme durations and phoneme and diphone frequencies in the language. Typically, some 25 templates are used during preselection. Next, Fine Phonetic Analysis (FPA) is used to sort the word candidates produced during preselection and retain only the 1-5 best scoring candidates. The objective function is based on the distance from spectral prototypes (during FPA the number of prototypes is typically around 70), duration and three features derived from energy. A left-to-right beam search is used to find the optimal alignment between the speech and the phonetic representations of the words returned by preselection. Finally, a language model is used to select the best scoring word among the output of FPA. The language model combines word frequencies, a bigram model and some deterministic knowledge and the acoustic probability of each candidate in a single probabilistie score.</Paragraph> <Paragraph position="2"> Equipped with just the DSP board that performs the LPC analysis the system runs in real-time with vocabularies of 20,000 words. In order to obtain real-time performance with much larger vocabularies (say between 60,000 and 100,000 words) another special purpose board, built with four different ASIC's that speed up preselection has been designed.</Paragraph> <Paragraph position="3"> Speaker Enrollment Most of the knowledge in the system is obtained from processing large amounts of speech from a large number of different speakers. Thus, only the prototypes used in preselection are speaker independent. Since the number of prototypes used during that stage is typically very small, it is an easy task to acquire personal prototypes for a new speaker.</Paragraph> <Paragraph position="4"> Enrollment consists of speaking some 40 carefully chosen words that are processed by automatic prototype extraction software.</Paragraph> <Paragraph position="5"> Modifying the Vocabulary Tools are provided for the maintenance of the dictionaries.</Paragraph> <Paragraph position="6"> When new words are added, the graphemic forms are automatically converted to phonemic forms and rules are provided for the generation of the most common pronunciation variants. Since lexical access during preselection and the scoring during FPA heavily rely on phonemic models, accurate modeling of the pronunciation is mandatory.</Paragraph> <Section position="1" start_page="7" end_page="7" type="sub_section"> <SectionTitle> Flexible Control </SectionTitle> <Paragraph position="0"> For many applications the performance of an IWSR system can be immensely improved if the size of the vocabulary can be dynamically adapted to the state of the dialogue. In fact, very large vocabularies are only needed during free text dictation.</Paragraph> <Paragraph position="1"> The IWSR system developed for Italian, and in the process of adaptation for the other languages, allows on-line selection of subsets of words from the base vocabulary. Obviously, the ability to make this selection is especially important in preselection.</Paragraph> <Paragraph position="2"> State-of-the-Art Most of the work needed to implement preselection for all languages is now ready. Dictionaries comprising representations for use in preselection have been compiled for all languages. Also, prototypes for a small number of speakers in each language have been built. Preliminary tests run in January 1991 have shown that acceptable preselection results (for at least 98% of the words spoken, the correct word is in the preselection list) are obtained for all languages under consideration. Formal tests of preselection performance with lexica of 2000 words are planned for April 1991. For the pilot languages English and Dutch much larger dictionaries are available (and will be tested).</Paragraph> </Section> </Section> <Section position="6" start_page="7" end_page="7" type="metho"> <SectionTitle> CONTINUOUS SPEECH RECOGNITION </SectionTitle> <Paragraph position="0"> the feasibility of continuous speech recognition in situations that differ from the DARPA Resource Management task. In addition, it is proposed to carry out an in-depth investigation of the viability of alternatives for the HMM approach. Still, the DARPA RM task was chosen as a reference, in order to be able to relate the performance of the systems built in Polyglot to a generally accepted and well-understood standard.</Paragraph> <Paragraph position="1"> Originally it was planned to do a large number of experiments in which integrated search should be compared with bottom-up phoneme based search. Moreover, both search strategies should be used with HMM, TDNN, and the frame labels produced by the Olivetti IWSR system described above.</Paragraph> <Paragraph position="2"> Finally, the approaches should be compared with respect to acoustic-phonetic decoding and word and sentence accuracy.</Paragraph> <Paragraph position="3"> Unfortunately, the limited resources available for this WP combined with a host of practical problems forced a drastic reduction of these plans. It is now intended to limit the investigation to TDNN and HMM in integrated search. On the other hand, more emphasis will be put on work on language models and on system integration aspects, especially with respect to the possibility of using linguistic constraints to improve phonetic decoding. Apart from the DARPA speaker dependent RM task the systems will be tested on a corpus of read newspaper text. Such a corpus, limited to a vocabulary of 5000 words, is available for British-English and German.</Paragraph> <Paragraph position="4"> Formal tests of the performance on the DARPA RM task will be available in August 1991. Preliminary tests with the Continuous Mixtures Densities HMM system developed by Philips Hamburg Research showed encouraging performance: using 46 monophones and 227 triphones a word error rate of 23.3% was obtained in the no-grammar condition. The triphones were selected by choosing only those that had a frequency of occurrence > 10 in the dictionary (data presented by Herman Ney during the January 1991 Review Meeting).</Paragraph> </Section> <Section position="7" start_page="7" end_page="7" type="metho"> <SectionTitle> TEXT-TO-SPEECH CONVERSION </SectionTitle> <Paragraph position="0"> In Polyglot, a relatively large part of the resources is devoted to Text-to-Speech ('ITS) conversion. This is because we believe that a high quality TTS system is essential for the majority of the applications in which speech technology is to provide the major user interface. Moreover, high quality TTS systems are not yet available for most languages represented in Polyglot. Last but not least, even if such systems would exist for some languages, they cannot be integrated into a single system that has an architecture that is uniform for all languages.</Paragraph> <Paragraph position="1"> Advanced Features Automatic Language Identification. The TTS system that is being developed in Polyglot will have a number of unique features. One is an automatic Language Identification Module (LIM) tIat is able to identify the language of each sentence sent to the TTS system. Since we will have a multi-lingual system with a uniform architecture for all languages the LIM will act as an intelligent switch that selects the appropriate language for each sentence. LIM is implemented as a rule-based program that uses three knowledge sources: * a list of very frequent words for all languages * a list of letter combinations that can or cannot occur word initially and word finally in any of the languages * a list of letter sequences that cannot occur word internally in any of the languages These word-level knowledge sources are not sufficient to determine the language for each word. In fact, many words can, and do, occur in more than one language, Therefore, a sentence level scoring mechanism is added that selects the most likely language for each complete sentence \[4\]. It has been shown that LIM in its present form performs virtually without errors. Most problem cases that were found in a test on large text corpora appeared to be due to errors of a very preliminary version of the sentence boundary detection algorithm and the inability of the LIM to recognize, e.g,, addresses in foreign languages.</Paragraph> <Paragraph position="2"> Syntax and Prosody. Most existing TTS systems suffer from inadequate prosody, due to the fact that syntactic processing is kept to a minimum. However, the Polyglot system will do sufficient syntactic and prosodic processing to be able to generate adequate intonation in most neutral texts. To that end it will use a medium sized lexicon (between 5,000 and 10,000 most frequent full words for each language) containing phonemic forms and word class information, a set of 'morphological' rules that guess the word classes of the words not found in the dictionary, a Markov grammar that computes the optimal ordering of the possible classes of all words and a Wild Card Parser (WPC), i.e., a deterministic parser based on a Context Free Grammar. The WPC attempts to account for the maximum number of words in an input sentence using the minimum number of major syntactic constituents.</Paragraph> <Paragraph position="3"> Thus, it yields a partial parse each time a complete parse of the input is not possible. Partial parses may contain words that are not part of a syntactic constituent; these unaccounted words are called Wild Cards \[5\]. The output of the WPC is given to a prosodic processor that implements a form of the 'Focus-Accent' theory that predicts the relation between syntax and prosody as well as the words that should carry pitch accents \[6,7\]. Experiments for Dutch have shown that the approach yields excellent results. Consultation with the partners working on other languages has confirmed that the same approach should work for all languages under consideration.</Paragraph> <Paragraph position="4"> Multi-level Data Structure. In order to be able to take full advantage of the syntactic and prosodic information it was necessary to design a multi-level data structure for the TTS system in which information on several levels can be stored in such a way that levels can be linked with one another and each rule can access all information on all levels that are relevant \[8\]. In order for this quite complicated data structure to be addressable by phonetic rule writers, a rule formalism had to be designed that allows expression of rules in the form of a graphical representation of the relevant levels of the data structure. It is expected that the prosodic information will not only be helpful in generating high quality intonation contours, but that it will also enable us to improve the segmental rules, because it offers an easy way to account for interaction between prosodic and segmental phenomena.</Paragraph> <Paragraph position="5"> The Architecture The architecture of the Polyglot 'ITS system is highly modular. Thanks to the flexibility of the multi-layered data structure and the access functions that come with it, it is possible for each language to use exactly those modules that are needed. It is possible to add layers to the data structure, and that can be done in such a way that only those languages that use the new layers will actually implement them. Thus it is possible for each language to choose exactly those types of information (i.e., layers) that are necessary. For instance, morphological analysis is essential for English, but it may not be necessary for Italian. If the Italian version of the 'ITS system does not need the morphology layer, it simply does not use it. Of course, it is then not possible for rules in the Italian system to refer to morphological data.</Paragraph> <Paragraph position="6"> The Polyglot 'ITS system will use rule based synthesis.</Paragraph> <Paragraph position="7"> Segmental rules that produce highly intelligible speech are available for Italian, Spanish and Dutch. For Dutch the rule based system that is under development, partly in the framework of Polyglot, partly under the national Dutch SPIN program, has been shown to equal the best competing diphone system in intelligibility both on the level of segments and paragraphs \[9\]. It is believed that rule based synthesis offers better opportunities to improve speech quality than diphone systems. The rules for the other languages will be obtained by adapting existing rules for other languages. In order to support that conversion task a very flexible working environment has been built that allows the rule developer to look at parameter tracks and spectrograms of both natural and synthetic utterances, to listen to both natural and synthetic versions of an utterance, and to change rules interactively.</Paragraph> <Paragraph position="8"> For all languages a prosody data base has been recorded, that consists of large numbers of sentences covering most syntactic and prosodic structures of interest as well as a number of short prose passages. All material has been read by two professional speakers, a female and a male. The speech material is transcribed on the segmental and the suprasegmental level and the resulting information is stored in a data base that allows one to access the data in many different ways. It is intended to segment and label the speech on the level of segments, so that the information can be used to develop and test duration rules.</Paragraph> <Paragraph position="9"> In addition, the suprasegmental transcriptions are used to derive and test rules that predict pause locations and pitch contours.</Paragraph> <Paragraph position="10"> Most of the linguistic and phonetic rules are implemented in such a way that they can be executed on a PC. The conversion of phoneme target values to filter parameters and the computation of the eventual speech signal are done on a DSP32C board.</Paragraph> </Section> <Section position="8" start_page="7" end_page="10" type="metho"> <SectionTitle> APPLICATIONS </SectionTitle> <Paragraph position="0"> ESPRIT projects are mainly application driven. Unlike the situation in the DARPA community, one of the major evaluation criteria for ESPRIT projects is the commercial interest of the results and the commitment of the industrial partners to commercialize those results. Thus, it is only natural that a considerable amount of the resources in Polyglot are spent to the development of prototype applications. Unlike most groups working on speech technology the Polyglot Consortium is not exclusively aiming at applications that involve information access via the public telephone network.</Paragraph> <Paragraph position="1"> Most of the applications that are under development are intended for use in the office or in the classroom. Of course, many commercially viable applications will require the combination, if not the integration of speech recognition and text-to-speech conversion. As can be seen below, this is reflected in some of the prototypical applications in Polyglot.</Paragraph> <Paragraph position="2"> Also, the applications that are at this moment limited to either speech input or speech output can be easily extended to exploit a combination of both technologies.</Paragraph> <Paragraph position="3"> Application Architecture In Polyglot it was decided that the work on applications should not be limited to the development of a number of prototypes. Instead, it was felt that the development of a uniform Application Architecture that would enable application developers to integrate speech technology in an easy way was at least as important from the point of view of future exploitation of the technology. Therefore, one of the major tasks of the APP WP is the specification and implementation of so called Application Programming Interfaces (API) that will allow almost every application program to interface with the Polyglot IWSR and 'ITS system. The API's will be provided for MS-DOS and MS-WINDOWS; the desirability to provide APrs for UNIX is still under investigation.</Paragraph> <Paragraph position="4"> Application Prototypes Dictation. Polyglot is working on a number of prototype applications. Perhaps the most ambitious one is report preparation. Although there are many domains where such a system would be useful, the demonstration prototype will be limited to medical dictation. The prototype will work in two modes, viz. interactive and free dictation. In interactive mode the system will guide the user through a predefined protocol.</Paragraph> <Paragraph position="5"> The system will ask a number of questions, using the TTS system, and at each point in the dialogue the user is offered a number of possible answers that he or she can speak. Each alternative will generate the appropriate passage in the eventual report. In free dictation mode the dialogue will still be the same, thus ensuring that the report will be complete under all circumstances, but now the user is offered an extra alternative after each question; if (s)he does not choose one of the fixed alternatives, but the added alternative free text, the system will go into dictation mode and the user can enter arbitrary text that will be copied to the report. A laboratory version of the complete system is available for a Radiology application in Italian. A fully developed prototype will follow soon. For British-English a laboratory version of the interactive mode is available for applications in Radiology and Pathology. Prototypes for the other languages are expected to be ready by the end of 1991.</Paragraph> <Paragraph position="6"> Office Automation. Bull SA already markets an application named Mlcroname that offers access to a data base of telephone numbers in a company. At present, access is only via a terminal, Ergonomic and marketing studies have shown that there is a need for access by voice. The speech version of Mlcroname will be offered integrated with a speech-driven version of another product, called Micropost, that offers telephone access to Electronic Mail. The user will be offered the possibility of accessing his or her E-Mail system via the public telephone network, scan through the messages and ask that one or more messages be read by the TTS system.</Paragraph> <Paragraph position="7"> Obviously, this is one of the applications where automatic identification of the language of the message is essential.</Paragraph> <Paragraph position="8"> Teaching Aids. In a multi-lingual community like Europe there is an ever increasing need for language training and teaching. One of the Polyglot applications addresses this need by offering flexible computer assisted instruction in learning the meaning of words, in spoken language comprehension and in spelling proficiency. Based on a CD-ROM containing a large number of multi-lingual dictionaries the user will be able to enter any word in one of the languages and see and hear the translation in another language. Since most words will have several, if not many, translations, it will be possible to view and to hear the words in sentence contexts that help to make the correct choice. In another application the TTS system will read sentences or passages to the user in one of the languages. The user is then asked questions that test comprehension. On request, the computer can check the spelling proficiency of by asking the user to type the sentences that were spoken and compare the user input with the text sent to the 'Iq'S system.</Paragraph> </Section> class="xml-element"></Paper>