File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1041_metho.xml
Size: 26,244 bytes
Last Modified: 2025-10-06 14:13:23
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1041"> <Title>RECENT ADVANCES IN JANUS: A SPEECH TRANSLATION SYSTEM</Title> <Section position="3" start_page="0" end_page="211" type="metho"> <SectionTitle> 1. ~TRODUCTION </SectionTitle> <Paragraph position="0"> In this paper we describe recent improvements of JANUS, a speech to speech translation system. Improvements have been made mainly along the following dimensions: 1.) better context-dependent modeling improves performance in the speech recognition module, 2.) improved language models, smoothing, and word equivalence classes improve coverage and robustness of the sentence that the system accepts, 3.) an improved N-best search reduces run-time from several minutes to now real time, 4.) trigram and parser rescoring improves selection of suitable hypotheses from the N-best list for subsequent translation. On the machine translation side, 5.) a cleaner interlinguawas designed and syntactic and domain-specific analysis were separated for greater reusability of components and greater quality of translation, 6.) a semantic parser was developed to achieve semantic analysis, should more careful analysis fail.</Paragraph> <Paragraph position="1"> The JANUS \[1, 2\] framework as it is presented here also allows us to experiment with components of a speech translation system, in an effort to achieve both robustness and high-quality translation. In the following we describe these efforts and system components that have been developed to date. At present, JANUS consists conceptually out of three major components: speech recognition, machine translation and speech synthesis. Since we have not made any significant attempts at improving performance on the synthesis end (DEC-talk and synthesizers produced by NEC and AEG-Daimler are used for English, Japanese and Gerrnan output, respectively), our discussion will focus on the recognition and translation parts.</Paragraph> </Section> <Section position="4" start_page="211" end_page="211" type="metho"> <SectionTitle> 2. RECOGNITION ENGINE </SectionTitle> <Paragraph position="0"> Our recognition engine uses several techniques to optimize the overall system performance. Speech input is preprocessed into time frames of spectral coefficients. Acoustic models are trained to give a score for each phoneme, representing the phoneme probability at the given frame. These scores are used by an N-best search algorithm to produce a list of sentencchypothcses. Based on thislist, more computationally expensive language models are then applied to achieve further improvement of recognition accuracy.</Paragraph> <Section position="1" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 2.1. Acoustic modeling </SectionTitle> <Paragraph position="0"> For acoustic modeling, several alternative algorithms are being evaluated including TDNN, MS-TDNN, MLP and LVQ \[6, 5\]. In the main JANUS system, an LVQ algorithm with context-dependent phonemes is now used for speaker independent recognition. For each phoneme, there is a context independent set of prototypical vectors. The output scores for each phoneme segment are computed from the euclidian distance using context dependent segment weights.</Paragraph> <Paragraph position="1"> Error rates using context dependent phonemes are lower by a factor 2 to 3 for English (1.5 to 2 for German) than using context independent phonemes. Results are shown in table 1.</Paragraph> <Paragraph position="2"> The performance on the RM-task at comparable perplexities is significantly better than for the CR-task, suggesting that the CR-task is somewhat more difficult.</Paragraph> </Section> <Section position="2" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 2.2. Search </SectionTitle> <Paragraph position="0"> The search module of the recognizer builds a sorted list of sentence hypotheses. Speed and memory requirements could be dramatically improved: Though the amount of hypotheses computed for each utterance was increased from 6 to 100 hypotheses, the time required for their computation could be reduced from typically 3 minutes to 3 seconds.</Paragraph> <Paragraph position="1"> This was achieved by implementing the word dependent N-best algorithm\[3\] as backward pass in the forward backward algorithm\[4\]: First a fast firstbest only search is performed, saving the scores at each possible word ending. In a second pass, this information is used for aggressive pruning to reduce the search effort for the N-best search. Further speedup was achieved by dynamically adapting the beam width to keep number of active states constant, and by carefully avoiding the evaluation of states in large inactive regions of words. Important for total system performance is the fact that the firstbest hypothesis can already be analyzed by theMT modules while the N-best list is computed.</Paragraph> <Paragraph position="2"> All language models (word-pairs, bigrarns or smoothed bigrams, and trigrams for resorting) are now trained on more than 1000 CR-sentences, using word class specific equivalence classes (digits, names, towns, languages etc.)</Paragraph> </Section> <Section position="3" start_page="211" end_page="211" type="sub_section"> <SectionTitle> 2.3. Resorting </SectionTitle> <Paragraph position="0"> The resulting N-best list is resorted using trigrams to further improve results. Resorting improves the word accuracy for the best scoring hypothesis (created using smoothed bigrams) from 91.5% to 98%, and the average rank of the correct hypothesis within the list from 5.7 to 1.1; Much longer N-best lists have been used for experiments (5001000). However it is very unlikely that a rescoring algorithm moves a hypothesis from the very bottom of such a long list to the 1st position. For practical application, a number of 100 hypotheses was found to be best.</Paragraph> </Section> </Section> <Section position="5" start_page="211" end_page="212" type="metho"> <SectionTitle> 3. THE MACHINE TRANSLATION (MT) ENGINE </SectionTitle> <Paragraph position="0"> The MT-component that we have previously used has now been replaced by a new module that can run several alternate processing strategies in parallel. To translate spoken language from one language to another, the analysis of spoken sentences, that suffer from ill-formed input and recognition errors is most certainly the hardest part. Based on the list of N-best hypotheses delivered by the recognition engine, we can now attempt to select and analyze the most plausible sentence hypothesis in view of producing and accurate and meaningful translation. Two goals are central in this attempt: high fMelity and accurate translation wherever possible, and robustness or graceful degradation, should attempts for high fidelity translation fail in face of ill-formed or misrecognized input. At present, three parallel modules attempt to address these goals: 1) an LR-parser based syntactic approach, 2) a semantic pattern based approach and 3) a connectionist approach. The most useful analysis from these modules is mapped onto a common Interlingua, a language independent, but domain-specific representation of meaning. The analysis stage attempts to derive a high precision analysis first, using a strict syntax and domain specific semantics. Connectionist and/or semantic parsers are currently applied as back-up, if the higher precision analysis falls. The Interlingua ensures that alternate modules can be applied in a modular fashion and that different output languages can be added without redesign of the analysis stage.</Paragraph> <Paragraph position="1"> 3.1. Generalized LR Parser 'the first step of the translation process is syntactic parsing with the Generalized LR Parser/Compiler \[16\]. The Generalized LR parsing algorithm is an extension of LR parsing with the special device called &quot;Graph-Structured Stack&quot; \[14\], and it can handle arbitrary context-free grammars while most of the LR efficiency is preserved. A grammar with about 455 rules for general colloquial English is written in a Pseudo Unification formalism \[15\], that is similar to Unification Grammar and LFG formalisms. Figure2 shows the result of syntactic parsing of the sentence &quot;Hello is this the conference office&quot;. Robust GLR Parsing: Modifications have been made to make the Generalized LR Parser more robust against ill-formed input sentences \[18\]. In case the standard parsing procedure fails to parse an input sentence, the parser nondeterministically skips some word(s) in the sentence, and returns the parse with fewest skipped words. In this mode, the parser will return some parse(s) with any input sentence, unless no part of the sentence can be recognized at all.</Paragraph> </Section> <Section position="6" start_page="212" end_page="214" type="metho"> <SectionTitle> (HELLO IS S~IS THE COmrER~CE OFFICE $) </SectionTitle> <Paragraph position="0"> ;++++ GLR Parser ru~ninR to produce ~glish structure +++/ (I) amblgulties fou~ ~d took 1.164878 seconds of r~l time</Paragraph> <Paragraph position="2"> In the example in figure 3, the input sentence &quot;Hello is this is this the office for the AI conference which will be held soon&quot; is parsed as &quot;Hello is this the office for the conference&quot; by skipping 8 words. Because the analysis gramrnar or the interligua does not handle the relative clause &quot;which will be held soon&quot;, 8 is the fewest possible words to skip to obtain a grammatical sentence which can be represented in the interligua. In the Generalized LR parsing, an extra procedure is applied every time a word is shifted onto the Graph Structured Stack. A heuristic similar to beam search makes the algorithm computationally tractable.</Paragraph> <Paragraph position="3"> When the standard GLR parser fails on all of the 20 best sentence candidates, this robust GLR parser is applied to the best sentence candidate.</Paragraph> <Section position="1" start_page="212" end_page="212" type="sub_section"> <SectionTitle> 3.2. The Interlingua </SectionTitle> <Paragraph position="0"> This result, called &quot;syntactic f-structure&quot;, is then fed into a mapper to produce an Interlingua representation. For the mapper, we use a software tool called Transformation Kit \[17\]. A mapping grammar with about 300 rules is written for the Conference Registration domain of English.</Paragraph> <Paragraph position="1"> Figure 4 is an example of Interlingua representation produced from the sentence &quot;Hello is this the conference office&quot;. In the example, &quot;Hello&quot; is represented as speech-act *ACKNOWL-EDGEMENT, and the rest as speech-act *IDENTFY-OTHER.</Paragraph> <Paragraph position="2"> Input s~nt~ce , (hello is this is thls the AI confeDe~ce office which wlll be held soon $1)</Paragraph> <Paragraph position="4"> The JANUS interlingua is tailored to dialog translation. Each utterance is represented as one or more speech acts. A speech act can be thought of as what effect the speaker is intending a particular utterance to have on the listener. Our interlingua currently has eleven speech acts such as request direction, inform, and command. For purposes of this task, each sentence utterance corresponds to exactly one speech act. So the first task in the mapping process is to match each sentence with its corresponding speech act. In the current system, this is done on a sentence by sentence basis. Rules in the mapping grammar look for cues in the syntactic f-structure such as mood, combinations of auxilliary verbs, and person of the subject and object where it applies. In the future we plan to use more information from context in determining which speech act to assign to each sentence.</Paragraph> <Paragraph position="5"> Once the speech act is determined, the rule for a particular speech act is fired. Each speech act has a top level semantic slot where the semantic representation for a particular instance of the speech act is stored during translation. This semantic structure is represented as a hierarchical concept list which resembles the argument structure of the sentence. Each speech act rule contains information about where in the syntactic structure to look for constituents to fill thematic roles such as agent, recipient, and patient in the semantic structure. Specific lexical rules map nouns and verbs onto concepts. In addition to the top level semantic slot, there are slots where information about tone and mood are stored. Each speech act rule contains information about what to look for in the syntactic structure in order to know how to fill this slot. For instance the auxiliary verb which is used in a command determines how imperative the command is. For example, 'You must register for the conference within a week' is much more imperative than 'You should register for the conference within a week'. The second example leaves some room for negotiation where the first does not.</Paragraph> </Section> <Section position="2" start_page="212" end_page="213" type="sub_section"> <SectionTitle> 3.3. The Generator </SectionTitle> <Paragraph position="0"> The generation of target language from an Interlingua representation involves two steps. Figure 5 shows sample traces of C~'man and Japanese, from the Interlingua in figure 4.</Paragraph> <Paragraph position="1"> First, with the same Transformation Kit used in the analysis phase, Interlingua representation is mapped into syntactic f- null structure of the target language.</Paragraph> <Paragraph position="2"> There are about 300 rules in the generation mapping grammar for German, and 230 rules for Japanese. The f-structure is then fed into sentence generation software called &quot;GENK1T&quot; \[17\] to produce a sentence in the target language. A grammar for GENK1T is written in the same formalism as the Generalized LR Parser: phrase structure rules augmented with pseudo unification equations. The GENKIT grammar for general colloquial German has about 90 rules, and Japanese about 60 rules. Software called MORPHEis also used for motphlogical generation for German.</Paragraph> </Section> <Section position="3" start_page="213" end_page="214" type="sub_section"> <SectionTitle> 3.4. Semantic Pattern Based Parsing </SectionTitle> <Paragraph position="0"> A human-human translation task is even harder than human-machine communication, in that the dialog structure in human-human communication is more complicated and the range of topics is usually less restricted. These factors point to the requirement for robust strategies in speech translation systems.</Paragraph> <Paragraph position="1"> Our robust semantic parser combines frame based semantics with semantic phrase grammars. We use a frame based parser similar to the DYPAR parser used by Carbonell, et al. to process ill-formed text,\[9\] and the MINDS system previously developed at CMU.\[10\] Semantic information is represented in a set of frames. Each frame contains a set of slots representing pieces of information. In order to fill the slots in the frames, we use semantic fragment grammars. Each slot type is represented by a separate Recursive Transition Network, which specifies all ways of saying the meaning represented by the slot. The grammar is a semantic grammar, non-terminals are semantic concepts instead of parts of speech. The grammar is also written so that information carrying fragments (semantic fragments) can stand alone (be recognized by a net) as well as being embedded in a sentence. Fragments which do not form a grammatical English sentence are still parsed by the system.</Paragraph> <Paragraph position="2"> Here there is not one large network representing all sentence level patterns, but many small nets representing information carrying chunks. Networks can &quot;call&quot; other networks, thereby significantly reducing the overall size of the system. These networks are used to perform pattern matches against input word strings. This general approach has been described in earlier papers. \[7, 8\] The operation of the parser can be viewed as &quot;phrase spotting&quot;. A beam of possible interpretations are pursued simultaneously. An interpretation is a frame with some of its slots filled. The RTNs perform pattern matches against the input string. When a phrase is recognized, it attempts to extend all current interpretations. That is, it is assigned to slots in active interpretations that it can fill. Phrases assigned to slots in the same interpretation are not allowed to overlap. In case of overlap, multiple interpretations are produced. When two interpretations for the same frame end with the same phrase, the lower scoring one is pruned. This amounts to dynamic programming on series of phrases. The score for an interpretation is the number of input words that it accounts for. At the end of the utterance, the best scoring interpretation is picked.</Paragraph> <Paragraph position="3"> Our strategy is to apply grammatical constraints at the phrase level and to associate phrases in frames. Phrases represent word strings that can fill slots in frames. The slots represent information which, taken together, the frame is able to act on.</Paragraph> <Paragraph position="4"> We also use semantic rather than lexical grammars. Semantics provide more constraint than parts of speech and must ultimately be delt with in order to take actions. We believe that this approach offers a good compromise of constraint and robustness for the phenomena of spontaneous speech.</Paragraph> <Paragraph position="5"> Restarts and repeats are most often between phases, so individual phrases can still be recognized correctly. Poorly consVucted grammar often consists of well-formed phrases, and is often semantically well-formed. It is only syntactically incorrect.</Paragraph> <Paragraph position="6"> The parsing grammar was designed so that each frame has exactly one corresponding speech act. Each top level slot corresponds to some thematic role or other major semantic concept such as action. Subnets correspond to more specific semantic classes of constituents. In this way, theinterpretation returned by the parser can be easily mapped onto the inter-lingua and missing information can be filled by meaningful default values with minimal effort.</Paragraph> <Paragraph position="7"> Once an utterance is parsed in this way, it must then be mapped onto the interlingua discussed earlier in this paper. The mapping grammar contains rules for each slot and subnet in the parsing gramar which correspond to either concepts or speech acts in the interlingua. These rules specify the relationship between a subnet and the subnets it calls which will be represented in the interlingua structure it will produce. Each rule potentially contains four parts. It need not contain all of them. The first part contains a default interlingua structure for the concept represented by a particular nile. If all else fails, this default representation will be returned. The next part contalns a skeletal interlingua representation for that rule. This is used in cases where a net calls multiple subnets which fill particular slots within the structure corresponding to the rule.</Paragraph> <Paragraph position="8"> A third part is used if the slot is filled by a terminal string of words. This part of the rule contains a context which can be placed around that string of words so that it can be attempted to be parsed and mapped by the LR system. It also contains in formaiton about where in the structure returned from the LR system to find the constituent corresponding to this rule. The final part contains rules for where in the skeletal structure to place interlingua structures returned from the subnets called by this net.</Paragraph> </Section> <Section position="4" start_page="214" end_page="214" type="sub_section"> <SectionTitle> 3.5. Conneetionist Parsing </SectionTitle> <Paragraph position="0"> The connectionist parsing system PARSEC \[12\] is used as a fall-back module if the symbolic high precision one fails to analyze the input. The important aspect of the PARSEC system is that it learns to parse sentences from a corpus of training examples. A connectionist approach to parse spontaneous speech offers the following advantages: 1. Because PARSEC learns and generalizes from the exampies given in the training set no explicit grammar rules have to be specified by hand. In particular, this is of importance when the system has to cope with spontaneous utterances which frequently are &quot;corrupted&quot; with disfluencies, restarts, repairs or ungrammatical constructions.</Paragraph> <Paragraph position="1"> To specify symbolic grammars capturing these phenomena has been proven to be very difficult. On the other side there is a &quot;build-in&quot; robustness against these phenomena in a connectionist system.</Paragraph> <Paragraph position="2"> 2. The connectionist parsing process is able to combine symbolic information (e.g. syntactic features of words) with non-symbolic information (e.g. statistical likelihood of sentence types). Moreover, the system can easily integrate different knowledge sources. For example, instead of just training on the symbolic input string we trained PARSEC on both the symbolic input string and the pitch contour. After training was completed the system was able to use the additional information to determine the sentence mood in cases where syntactic clues were not sufficient. We think of extending the idea of integrating prosodic information into the parsing process in order to increase the performance of the system when it is confronted with corrupted input. We hope that prosodic information will help to indicate restarts and repairs.</Paragraph> <Paragraph position="3"> The current PARSEC system comprises six hierarchically ordered (back-propagation) connectionist modules. Each module is responsible for a specific task. For example, there are two modules which determine phrase and clause boundaries.</Paragraph> <Paragraph position="4"> Other modules are responsible for assigning to phrases or clauses labels which indicate their function and/or relationship to other constituents. The top module determines the mood of the sentence.</Paragraph> <Paragraph position="5"> Recent Extensions: We applied a slightly modified PARSEC system to the domain of air travel information (ATIS).</Paragraph> <Paragraph position="6"> We could show that the system was able to analyze utterance like &quot;show me flights from boston to denver on us air&quot; and that the system's output representation could be mapped to a Semantic Query Language (SQL). In order to do this we included semantic information (represented as binary features) in the lexicon. By doing the same for the CR-task we hope to increase the overall parsing performance.</Paragraph> <Paragraph position="7"> We have also changed PARSEC to handle syntactic structures of arbitrary depth (both left and right branching) \[13\].</Paragraph> <Paragraph position="8"> the main idea of the modified PARSEC system is to make it auto recursive, i.e. in a recursion step n it will take its output of the previous step n-1 as its input. This offers the following advantages: i. Increased Expressive Power: The enhanced expressive power allows a much more natural mapping of linguistic intuitions to the specification of the training set.</Paragraph> <Paragraph position="9"> 2. Ease of learning: Learning difficulties can be reduced.</Paragraph> <Paragraph position="10"> Because PARSEC is now allowed to make more abstraction steps each individual step can be smaller and, hence, is easier to learn.</Paragraph> <Paragraph position="11"> 3. Compatibility: Because PARSEC is now capable of producing arbitrary tree structures as its output it can be more easily used as a submodule in NLP-systems (e.g.</Paragraph> <Paragraph position="12"> the JANUS system). For example, it is conceivable to produce as the parsing output f-structures which then can be mapped directly to the generation component \[11\].</Paragraph> </Section> </Section> <Section position="7" start_page="214" end_page="215" type="metho"> <SectionTitle> 4. SYSTEM INTEGRATION </SectionTitle> <Paragraph position="0"> The system accepts continuous speech speaker-independently in either input language, and produces synthetic speech output in near real-time. Our system can be linked to different language versions of the system or corresponding partner systems via ethernet or via telephone modem lines. This possibility has recently been tested between sites in the US, Japan and Germany to illustrate the possibility of international telephone speech translation.</Paragraph> <Paragraph position="1"> The minimal equipment for this system is a Gradient Deskiab 14 A/D-converter, an HP 9000/730 (64 Meg RAM) workstation for each input laguage, and a DECtalk speech synthesizer. Included in the processing are A/D conversion, signal processing, continuous speech recognition, language analysis and parsing (both syntactic and semantic) into a language independent interlingua, text generation from that interlingua, and speech synthesis.</Paragraph> <Paragraph position="2"> The amount of time needed for the processing of an utterance, depends on its length and acoustic quality, but also on the perplexity of the language model, on whether or not the first hypothesis is parsable and on the grammatical complexity and ambiguity of the sentence. While it can take the parser several seconds to process a long list of hypotheses for a complex utterance with many relative clauses (extremely rare in spoken language), the time consumed for parsing is usually negligible (0.1 second).</Paragraph> <Paragraph position="3"> For our current system, we have eliminated considerable amounts ofcornmunication delays by introducing socket communication between pipelined parts of the system. Thus the search can start before the preprocessing program is done, and the parser starts working on the first hypothesis while the N-best list is computed.</Paragraph> </Section> class="xml-element"></Paper>