File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1010_metho.xml
Size: 25,726 bytes
Last Modified: 2025-10-06 14:13:05
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1010"> <Title>Spoken Language Processing in the Framework of Human-Machine Communication at LIMSI</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> LIMSI & CNRS </SectionTitle> <Paragraph position="0"> The LIMSI laboratory is a &quot;full&quot; CNRS (French National Research Agency) laboratory. The acronym stands for &quot;Laboratoire d'Informatique pour la M6canique et les Sciences de t'Ing6nieur&quot; (Laboratory of Informatics for Mechanical, Chemical and Electrical Engineering).</Paragraph> <Paragraph position="1"> CNRS is the French National Research Agency. It was created in 1939. With 12,000 permanent researchers and 15,000 permanent technical and administrative staff, it is probably the largest research institution in Europe. It has 375 full laboratories, and 1000 academic &quot;associated&quot; laboratories. The 1991 budget of CNRS was 2,000 M$ (74% for salaries). CNRS has a general director, and is divided into 6 different scientific departments, each having a scientific director. LIMSI is attached to the &quot;Engineering Sciences&quot; department, with a secondary attachment to the &quot;Social and Human Sciences&quot; Department.</Paragraph> <Paragraph position="2"> There are also 40 advisory committees, forming the &quot;National Committee for Scientific Research&quot;, and corresponding to general research areas. Those committees are responsible for evaluating the quality of each laboratory and of each researcher. LIMSI is principally attached to the committee 7 (Information Sciences and Technologies: Computer Science, Control and Signal Processing), but has secondary attachments to sections 10 (Chemical and Mechanical Engineering), 34 (Language Sciences), and may in the near future be attached to section 29 (Cognition and Psychology).</Paragraph> <Paragraph position="3"> For 1991, the laboratory staff comprised approximately 180 members, with a total budget of 6 MS (60% for salaries).</Paragraph> <Paragraph position="4"> With J. Mariani as director, the laboratory is structured in Departments, Groups and Research Topics, each with its own manager. It has two departments: Human-Machine Communication, headed by J. Mariani and Mechanical and Chemical Engineering, headed by P. Le Qu6r6, which will not be presented here.</Paragraph> </Section> <Section position="4" start_page="0" end_page="55" type="metho"> <SectionTitle> THE HUMAN-MACHINE COMMUNICATION DEPARTMENT </SectionTitle> <Paragraph position="0"> The Human-Machine Communication department has a total of about 100 persons (38 permanent researchers (CNRS and University, including 30 PhDs), 3 technical and administrative staff, 38 PhD students and 36 postdoctoral, contractual, associate and visiting researchers, or Master Thesis students). There are 3 research groups: the Speech Communication group, headed by F. N6el, the Language C/J Cognition group, headed by G. Sabah, and the Non-Verbal Commu- null nication group, headed by D. Teil. The groups are divided in research topics, each being headed by a researcher. This structure is flexible, and researchers may participate in different, research topics (see Appendix).</Paragraph> <Paragraph position="1"> The activities in Speech Communication in the laboratory were initiated in 1968 by Dr J.S. Li~nard. The group itself was created in 1981. The Natural Language Processinggroup was created in 1985, when a group headed by Dr G. Sabah at University Paris VI joined a group already situated at University Paris XI, the location of LIMSI. The Non-verbal Communication group was created in Fall 1989, by merging teams in the laboratory already working on 3D modeling,</Paragraph> <Section position="1" start_page="55" end_page="55" type="sub_section"> <SectionTitle> Computer Vision and Robotics. The Human-Machine Communication Department con- </SectionTitle> <Paragraph position="0"> ducts research in closely related areas: Speech, Language, Vision and other means of communication between humans and machines. These areas use common methodologies and, having them studied in the same laboratory, allows for an in-depth study of the different communication modes, which, we believe, is mandatory for the development of multimodal communication systems. The research activities are multidisciplinary, and deal with Computer Science, Linguistics, Cognitive Psychology and Neuro-Sciences, and have both theoretical and applied aspects.</Paragraph> <Paragraph position="1"> The communication process is studied as a triplet Perception - Production - Cognition: Perception by the machine of speech (that could be enlarged to any sound), of vision (with reading as a subpart), of touch or gesture; Production of speech or text, image synthesis (that could be enlarged to Solid Modeling); And all the cognitive aspects related to dialogue, reasoning, knowledge representation and integration of different communicatitm modes. We try as much as possible to :link the studies of production and perception (i.e. the emission and reception of information), as it can be observed in speech recognition and synthesis, text analysis and generation, scene analysis and image synthesis, movement sensing and effort feedback.</Paragraph> <Paragraph position="2"> The relationship between Language and Image processing is becoming more and more important, with the concept of &quot;Intelligent Images&quot;, where the different parts of the image are in agreement with their mutual constraints, and with the constraints of the real physical world (Newton Law of gravity, phenomena which occurs in an explosion ...). Those images need advanced Human-Machine Communication, as the task is complex with commands like &quot;Put the ball on the table and make it bounce back&quot;.</Paragraph> <Paragraph position="3"> This opens the perspective of a true multimodal communication system, including oral, written, visual and gestual communication, with the typical problems of multireference processing (like a voice message accompanied by a gesture).</Paragraph> <Paragraph position="4"> It can even be thought that processing multiple communication modes simultaneously is mandatory for each of the modes, as it is a necessity for knowledge acquisition in training within self-organizing approaches.</Paragraph> <Paragraph position="5"> We find common methodologies in the different domains: signal processing techniques, statistical or structural coding, Vector or Matrix Quantization in Speech and Vision. Pattern recognition techniques such as Hidden Markov Models, Markov Random Fields, Multi Layer Perceptrons, Boltzmann Machines are used in speech, language and vision.</Paragraph> <Paragraph position="6"> Morphological, Lexical, Grammatical, Syntactic, Semantic or Pragmatic analysis are applicable to written and spoken language, with specifics for speaking or writing.</Paragraph> <Paragraph position="7"> Finally, these domains also have the commonality of requiring a study of Human Factors (Ergonomics) for design of acceptable systems.</Paragraph> </Section> </Section> <Section position="5" start_page="55" end_page="56" type="metho"> <SectionTitle> THE SPEECH COMMUNICATION GROUP </SectionTitle> <Paragraph position="0"> The Speech Communication group produced very early a Text-to-Speech stand-alone synthesis system in French (Icophone). The Icophone V system was marketed in 1975 (and appeared in Electronics, June 1975) by the TITN company, and a single unit of this one cubic meter TTS system was sold (to the Iranian Minister of Education...) at that time.</Paragraph> <Paragraph position="1"> The TTS system used a battery of 44 analog oscillators to produce diphone-based synthesis (using 427 diphones). In 1980, it was replaced by a single board digital version, Icolog, which is marketed by the Vecsys company. The grapheme-to-phoneme conversion algorithm uses about 1,000 rules. It has to take into account the difficult problem of liaisons between words in French.</Paragraph> <Paragraph position="2"> Another early application was formant synthesis with a first synthesizer designed in 1975 (Icophone VI). A parallel formant synthesizer based on segmental rules is now being developed in the framework of the CEC/Esprit Polyglot project, and should be adapted for 7 different languages within the consortium.</Paragraph> <Paragraph position="3"> Together with the Text-to-Speech effort, research was conducted on segmentation of continuous speech into phonemes and speech recognition. An analog speech segmentator, using a 32 analog filters bank was presented in 1974 (J.S. Li~nard et al., 1974). The speech recognition work addressed both analytical phoneme recognition, in the design of a &quot;Speech Understanding System&quot; comparable to those developed in the ARPA-SUR project (J. Mariani et al., 1978), and word-based recognition.</Paragraph> <Paragraph position="4"> The first approach was also aimed at designing a &quot;phoneme vocoder', with a 50 b/s rate. The experiments conducted in this project showed that a phoneme recognition rate of at least 85%, with no major recognition errors (like vowel/consonant substitutions), was necessary in order to transmit a message that can be understood by the human listener (J.S. Li~nard et al., 1977).</Paragraph> <Paragraph position="5"> The word-based approach took advantage of the idea that the stationary parts of the signal convey less, and more variable information than transition parts (3.L. Gauvain, J.</Paragraph> <Paragraph position="6"> Mariani, J.S. Li~nard, 1983). This led us to non-linear fixed-length compression for isolated word recognition (Morse system in 1980), and non-linear variable-length compression for connected word recognition (Mozart system in 1982) (J.L.</Paragraph> <Paragraph position="7"> Gauvain, 3. Mariani, 1982). Both systems used template matching via dynamic programming. Both systems resulted in single-board products which were marketed by the Vecsys company.</Paragraph> <Paragraph position="8"> Theses systems were the first IWR and CWR systems made in France and have been used in several applications, like Voice-activated telephone dialing (Jeumont-Sehneider, 1985), or packet sorting (NMPP, 1983). Most of these applications provided the opportunity for experimenting with voice input as a new communication means, and the market stayed limited to a few units. The application where most efforts have been devoted was pilot-plane dialog. In collaboration with the Crouzet company, a program was conducted for the &quot;Research and Technology Agency&quot; (Dret) of the French DoD. A flight with an IWR voice command system with actual commands to the plane was made in July 1982, and was reported as the first voice-controlled flight. The conclusions of the flight trials specified the need of continuous speech recognition, the necessity of a stable level of performances, whoever the pilot, and that voice input was especially interesting in critical conditions (high Gs, stress), which unfortunately correspond to adverse environments which tend to lower the recognition rates.</Paragraph> <Paragraph position="9"> The computing power needed for the Dynamic Programming algorithm led us to develop an Asic chip. The MuPCD (Microprocessor for Dynamic Programming) was developed at Limsi, together with the Bull company on a contract of the Ministry of Telecommunications. It was available in 1989 (G. Qudnot et al., 1989). It has 120,000 transistors in Cmos 2 micron technology, and a power of 70 MOPS. It allows for the recognition of up to 5,000 isolated words, or 300 connected words. This chip is used in the last generation of Vecsys Datavox recognition systems.</Paragraph> <Paragraph position="10"> The work on Language modeling was also an early project at LIMSI. It started with a set of experiments on the problem of phoneme-to-grapheme conversion in French. A casual 9 phoneme string can generate more than 32,000 possible combinations of segmentation into words, and spelfing of those words. In many cases, the absence of pronunciation of the mark of the plural (-s at the end of the nouns, -nt at the end of verbs), generates many of those homophones. First, a simple heuristic was tried with a 20,000 word lexicon which segmented the phoneme string into the smallest number of words. It gave good results for the segmentation task, but also demonstrated the necessity of using a language model to improve the quality. This resulted in a collaboration with researchers using stochastic language modeling based on grammatical categories for document retrieval, and results were reported in 1979 on phoneme-to-grapheme conversion with a 270,000 word full-form lexicon (A. Andreewski et al., 1979).</Paragraph> <Paragraph position="11"> This approach was also applied to stenotype-to-grapheme couversion.</Paragraph> <Paragraph position="12"> Another area of research is speaker verification (J. Mariani et al., 1983). The algorithm used for word-based recognition was adapted to speaker verification, with dynamic adaptation of the reference templates of the speaker to their daily variations. The Sesame system was tested &quot;live&quot; at the Machines Parlantes exhibition in 1985, and had an impostor acceptance of 4 per 1000 obtained with informal test conditions. The system is currently in use in everyday operational conditions as the entry system at LIMSI by about 100 users since 1987.</Paragraph> </Section> <Section position="6" start_page="56" end_page="57" type="metho"> <SectionTitle> PRESENT PROJECTS </SectionTitle> <Paragraph position="0"> In the Speech Communication group, work is now conducted around two main projects: the Dictation project, and the Dialog project.</Paragraph> <Paragraph position="1"> In the dictation project, several steps have been taken in the design of a Voice-Activated typewriter (VAT) in French since the beginning of the project 10 years ago. Continuing the study on phoneme-to-grapheme conversion, for continuous, error-free phonemic strings, using a large vocabulary and a natural language syntax, LIMSI participated in the ESPRIT project 860 &quot;Linguistic Analysis of the European Languages&quot;. In this framework, the approach for language modeling developed at LIMSI has been extended to 7 European languages. The link between the language model and the acoustic recognition was made, and resulted in a complete system (Hamlet), for a limited vocabulary (2,000 words), pronounced in isolation, and then to a 5,000 word VAT system taking advantage of the existence of the specialized MuPCD DTW chip. The complete system was demonstrated in Spring 1988. Now, work is being conducted within the ESPRIT Polyglot project, with the goal of designing speech-to-text and text-to-speech systems for the 7 languages. In this framework, the methods first developed at Olivetti for dictation in isolated mode is adapted to French, and other methods are being developed, based on discrete & tied-mixture HMMs, and TDNNs & TDNNs-HMMs combinations, for continuous speech recognition. Comparative tests are being conducted on part of the DARPA Resource Management Database.</Paragraph> <Paragraph position="2"> We have recorded BREF, a large read-speech corpus for French, containing over 36 GBytes of speech data from 120 speakers. The text materials were selected verbatim from the French newspaper Le Monde, so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments. Separate text materials, with similar distributional properties, were selected for training, development test and evaluation purposes. A series of experiments for vocabulary independent phone recognition has been recently carried out using this corpus. A baseline phone accuracy of 60% was obtained with context-independent phone models, and no phone grammar, and a phone accuracy of 68.6% with context dependent phone models and a bigram phone language model (J.L. Gauvain, L. Lamel, this conference).</Paragraph> <Paragraph position="3"> The dialog project has been explored within the framework of the application of air-controller training. Currently, the training sessions are limited by the availability of the human instructor who plays the role of a pilot. Our goal is to replace him by a spoken dialog system. This allows for more availability of the system, and also to have several voices corresponding to different pilots in the synthesis module. In this project, speech understanding uses speech recognition in conjunction with a representation of the semantic and pragmatic knowledge related to the task. While the language is supposed to follow a pre-defined &quot;phraseology&quot;, it is not the case most of the time. The language model is a bigram model based on grammatical categories. The probabilities for word successions are changed depending on the previous step in the dialog (prediction), and corrections of the recognized sentence can be made, using redundancy within the sentence, and a word confusion matrix. An evaluation test involving 6 speakers, and 5 scripts of an average of 20 sentences each, prediction improved the results by 10%, and correction added an extra 18.5% (the sentence understanding rate improved from 68% to 96.5%) (A. Matrouf et al., 1991). Prior to this work, large &quot;Wizard of Oz&quot; experiments were conducted (D. Luzzati, 1984), and the linguistic analysis of the resulting corpus in a train timetable enquiry system simulation was realized (D. Luzzati, 1987). In general, all the implementations used linguistic analysis of real corpora in order to meet the user's needs. The recording of a spontaneous speech database (Spot) has also been started.</Paragraph> <Paragraph position="4"> Other research topics concern speech signal processing, with both basic research on wavelet analysis, and development oil a real time PC-Based speech analysis tools (the Unice package which is marketed by Vecsys). The study of phonological variations has been pursued on a text (&quot;La Bise et le Soleil&quot;) pronounced by several speakers, and will continue with the analysis of BREF. Another area of interest is the use of symbolic coding for improving cochlear implants. Also, apart from the classical Multilayer perceptron or TDNN approaches, an original &quot;Guided Propagation&quot; connectionist model is experimented. In the hardware domain, the use of several MuPCD Asic chips in parallel is now being implemented, and the design of a new chip, taking advantage of improved technology is envisioned.</Paragraph> </Section> <Section position="7" start_page="57" end_page="57" type="metho"> <SectionTitle> APPLICATION OF THE ALGORITHMS DEVELOPED FOR SPEECH TO OTHER AREAS </SectionTitle> <Paragraph position="0"> The algorithms developed for speech recognition have been appfied in other areas. The DP matching process developed for connected word recognition, with the MuPCD Asic, has been adapted to the problem of optical character recognition. Instead of considering the recognition of individual characters after segmentation (including eventually segmentation errors), the complete line of characters is considered, and segmentation is included in the recognition process (M. Khemakhem, 1987). The algorithm has also been extended successfully in 2 dimensions in Computer Vision for matching similar images, with application to stereovision, and to movement analysis (G. Qu~not, 1992). Preliminary studies for gesture recognition (throwing away, taking, hoMing tight...) using a Data GloveTM have also been conducted. It is expected that the increase of quality obtained by stochastic modeling instead of template matching in speech recognition can also be obtained in the fields of character recognition and computer vision, and we are now considering applying these techniques.</Paragraph> </Section> <Section position="8" start_page="57" end_page="57" type="metho"> <SectionTitle> THE LANGUAGE & COGNITION GROUP </SectionTitle> <Paragraph position="0"> The activities of the Language ~ Cognition group have of course many interactions with the Speech Communication group. Several common research projects have been, or are being, conducted, such as the use of Conceptual Graphs to represent semantic information in a speech understanding system, or stochastic modeling of semantic information. Also, the grapheme-to-phoneme conversion software has been used to correct errors. There are also many interactions between the Connectionist Systems and the Connectionist Models research topics in the two groups. The &quot;Time 8z Space representation&quot; topic has also close relationship with the Non-verbal Communication group.</Paragraph> <Paragraph position="1"> A new activity is now starting in the field of cognitive psychology. A group is being created, integrating the former Center for Cognitive Psychology (Cepco), a university Paris XI laboratory within Limsi. It includes researchers in the field of the psychology of reading and text comprehension, visuo-spatial mental representation and cognitive ergonomics.</Paragraph> </Section> <Section position="9" start_page="57" end_page="57" type="metho"> <SectionTitle> SPEECH IN THE FRAMEWORK OF MULTIMODAL COMMUNICATION </SectionTitle> <Paragraph position="0"> Speech can be used with other communication modes in order to obtain the more versatile and reliable means for human-machine communication. A study on automated telematic (voice / text) switchboard has been conducted by both the Speech Communication and the Language 8J Cognition groups. Speech recognition together with gesture (touch screen) showed that using both together allows for better efficiency and better comfort (D. Teil et al., 1991), with gestual communication being preferable anytime there is a need to give a low level analogous information. Timing and co-reference are the difficult problems to solve with the integrated system.</Paragraph> </Section> <Section position="10" start_page="57" end_page="57" type="metho"> <SectionTitle> PERSPECTIVES FOR FUTURE RESEARCH </SectionTitle> <Paragraph position="0"> A more ambitious project is now starting, including computer vision and 3D modeling, natural language and knowledge representation, and speech and gestual communication.</Paragraph> <Paragraph position="1"> This project aims at examining the theoretical problems of model training in the framework of multimodal information, how non-verbal (visual, gestual) information can be used in building a language model, and how linguistic information can help in order to build models of objects.</Paragraph> </Section> <Section position="11" start_page="57" end_page="1987" type="metho"> <SectionTitle> REFERENCES </SectionTitle> <Paragraph position="0"> &quot;French Ready terminal that speaks English&quot;, Electronics, June 26, 1975.</Paragraph> <Paragraph position="1"> A. Andreewski, J.P. Binquet, F. Debili, C. Fluhr, Y. Hlal, B. Pouderoux, J.S. Li~nard, , J. Mariani, &quot;Les dictionnalres en forme complete et leur utilisation dans la transformation lexicale et syntaxique de cha\[nes phon6tiques correctes', 10~mes JEP du GALF, Grenoble, Mai-Juin 1979 J.L. Gauvaln, J. Mariani, &quot;A method for connected word recognition and word spotting on a microprocessor.&quot;, Proc. IEEE ICASSP 82. Paris, 3-5 mal 1982.</Paragraph> <Paragraph position="2"> J.L. Gauvaln, J. Mariani, J.S. Li~nard, &quot;On the use of time compression for word-based recognition.&quot;, ICASSP 83. Boston, April 14-16, 1983.</Paragraph> <Paragraph position="3"> 3.L. Gauvain, L.F. Lamel, &quot;Speaker-Independent Phone Recognition using BREF&quot;, DARPA Speech and Language Workshop, Arden House, February 1992 M. Khemakhem, J.L. Gauvain, J. Rivaillier, &quot;Reconnaissance de caract~res imprim6s par comparaison dynamique.&quot;,</Paragraph> <Section position="1" start_page="58" end_page="1987" type="sub_section"> <SectionTitle> 6~me Congr~s AFCET-INRIA &quot;Reconnaissance des Formes </SectionTitle> <Paragraph position="0"> et Intelligence Artificielle&quot;. Antibes, 16-20 novembre 1987.</Paragraph> <Paragraph position="1"> J.S. Li6nard, M. Mlouka, J. Mariani, J. Sapaly, &quot;Time segmentation of speech&quot;, Speech Communication Seminar, Stockholm, Aofit 1974 J.S. Li6nard, J. Mariani, G. Renard, &quot;Intelligibilit6 de phrases synth6tiques altdr6es : application ~t la transmission phon6tique de la parole&quot;, ICA, Madrid, Juillet 1977 D. Luzzati, &quot;ORSO. Projet pour la constitution et l'6tude de dialogues homme-machine.&quot;, LIMSI internal report, Septembre 1984.</Paragraph> <Paragraph position="2"> D. Luzzati, &quot;ALORS : a skimming parser for spontaneous speech processing.&quot;, Computer Speech and Language, Vol.2, J. Mariani, J.S. Li6nard, &quot;ESOPE 0 : un programme de compr6hension de la parole continue procddant par pr6diction-vdrification aux niveaux phondtique, lexical et syntaxique&quot;, ler Congr~s AFCET &quot;Reconnaissance des formes et Intelligence Artificielle, Chatenay-Malabry, F6vrier 1978 J. Mariani, J.L. Gauvain, J.L. Soury, &quot;Un syst~me de v6rification du locuteur&quot;, 13~mes JEP du GALF, 28-30 Mai 1984 K. Matrouf, F. N6el, &quot;Use of Upper Level Knowledge to Improve Human-Machine Interaction&quot;, Venaco Workshop & ETRW on &quot;The structure of MultimodaJ Dialogue&quot;,</Paragraph> </Section> </Section> class="xml-element"></Paper>