File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1082_intro.xml
Size: 13,551 bytes
Last Modified: 2025-10-06 14:05:57
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1082"> <Title>Research on Architectures for Integrated Speech/Language Systems in Verbmobil</Title> <Section position="3" start_page="0" end_page="486" type="intro"> <SectionTitle> 2 Design and Implementation of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="484" type="sub_section"> <SectionTitle> Incremental Interactive Speech Parsers </SectionTitle> <Paragraph position="0"> In a Left Right Incremental architecture (LRI), higher level modules can work in parallel with lower level modules. The obvious benefits of such an arrangement are twofold: The system does not have to wait for a speaker to stop talking and top-down constraints from higher level to lower level modules can be used easily. 'lb achieve LR! behavior the singular modules must fulfil\] the following requirements: Processing proceeds incrementally along the time axis (&quot;left to right&quot;).</Paragraph> <Paragraph position="1"> Pieces of output have to be transferred to the next module as soon as possible.</Paragraph> <Paragraph position="2"> So far in INTARC-1.3 we have achieved an LRI style coupling of ibur different modules: Word recognition module, syntactic parser, semantic module and prosodic boundary module. Our word recognition module is a modified Viterbi decoder, where two changes in the algorithm design were made: We use only the forward search pass, and whenever a final ttMM state is reached for an active word model, a corresponding word hypothesis is sent to the parser. Itence backward search becomes a part of the parsing algorithm. The LRI parsing algorithm is a modified active chart parser with an agenda driven control mechanism. The chart vertices correspond to the .frames of the signal representation. Edges correspond to Word or phrase hypotheses, being partial in the case of ac- null tive edges. A parsing cycle corresponds to a new time point related to the utterance, in every cycle a new vertex is created and new word hypotheses ending at that time point are read and inserted into the chart. In one cycle, a ba(:kwar( |search is performed to the beginning of the utterance or to some designated time point in the past con~ stitnting a starting point for grammatical analysis. Search is guided by a weighted linear combination of acoustic score, bigram score, prosody score, grammar derivation seore and grammatical parsability. The search prodecure is a beam search implemented as an agenda access mechanism. The grammar is a probabilistic typed UG with separate rules for pauses and other spontanous speech phelnonmua.</Paragraph> </Section> <Section position="2" start_page="484" end_page="484" type="sub_section"> <SectionTitle> 2.1 Basic Objects </SectionTitle> <Paragraph position="0"> In the h)llowing we use record notation to refer to subcoml)oncnts of an object. A chart vertex Vt corresponds to frame number t. Vertices have four lists with pointers to edges ending in and starting in thai; vertex: inactive-out, inactive~ in, active-out and active-out. A word hypothesis W is a quadruple (from, to, key, score) with J}vm and to being the start and end frames of W.</Paragraph> <Paragraph position="1"> W.Key is the name of the lexical entry of W and W.score is the acoustic score of W for the frames spanned, given by a corresponding HMM acoustic word model. An edge i',' consists of flora, the start vertex and to, a list of end vertices. Note that after a Viterbi forward pass identical word hypotheses do always come in sequence, differing only in ending time. E.actual is the last vertex added to tPSto in an operation. Those &quot;families&quot; of hypotheses are represented as one edge with a set of end vertices. E.words keeps the covered string of word hypotheses while SCORE is a record keeping score components. Besides that an edge consists of a grammar rule E.rule and F,.next, a pointer to some element of the right hand side of lPSrule or NIL. As in standard active chart parsing an edge is passive, if E.ncxt = nil, otherwise it is active. E.eat points to the left hand side of the grammar rule. SCORE is a record with entries for inside and outside probabilities given to an edge by acoustic, bigram, prosody and grammar model: Inside-X Model scores for the spanned portion of an edge.</Paragraph> <Paragraph position="2"> Outside-X Optimistic estimates for the portion fi'om vertex 0 to the beginning of an edge.</Paragraph> <Paragraph position="3"> For every vertex we keep a best first store of scored edge pairs. We (:all that store Agenda/ in cycle i.</Paragraph> </Section> <Section position="3" start_page="484" end_page="485" type="sub_section"> <SectionTitle> 2.2 Basic Operations </SectionTitle> <Paragraph position="0"> 'I'here are tive basic operations to detine the operations of the parsing algorithm. The two operations Combine and Seek Down are similar to the well known Earley algorithm operations Completer arm Predictor. Furthermore, there are two operations to insert new word hypotheses, Insert and Inherit. All these operations can create new edges, so operations to calculate new scores from old ones are attached to them. hi order to implement our t)eam search method appropriately but sinq)ly, we define an operation Agenda-Pu~q~ , which selects pairs of active and passive edges to be prmn;d or to be processed in the future. The Ct (, notation for l)asic operations are given in -' ,' ' simplicity.</Paragraph> <Paragraph position="1"> For a t)air of active and passive edges (A, l), if A.next = I.cat and Lfivm ~ A.to, insert edge</Paragraph> <Paragraph position="3"> 'l)he operator ct) performs an addition o\[a nun> her to every element of a set. Trans(X,A,l) is the specilic transition penalty a model will give to two edges. In the ctLse of acoustic scores, the penMty is always zero and can be neglected. In the. bigram c~use it will be the transition from the last word covered by A to the tirst word covered by B.</Paragraph> <Paragraph position="4"> Whenever an active edge A is inserted, insert an edge E for every rule 1~ such that A.next - E.cat, I','. rule = If, F,.flvm - A. actual, t3. to = {A. actual}</Paragraph> <Paragraph position="6"> I~.inside-X. This reeursive operation of introducing new active edges is precompiled in our parser and extremely etlicient.</Paragraph> <Paragraph position="7"> For a new word hypothesis W = (a,i, key, score) such that no W' = (a,i-i,key, score') exists, insert an edge E with E.rule = lex(key), E.cat =</Paragraph> <Paragraph position="9"> grammar score of lex(keg).</Paragraph> <Paragraph position="10"> For a new word hypothesis W = (a,i, key, score) such that a W' = (a,i-l,key, score') exists: For all E in Vi_l.inactive-in or Vi_l.activein: If last(E.words) = key then add {~} to E.to, add (i,E.Inside-Acoustic\[i-l\]- score' + score) to E.Inside-Acoustic and add (i,E.Outside-</Paragraph> <Paragraph position="12"> If E is active, perform a Seek-Down on E in ~.</Paragraph> <Paragraph position="13"> Whenever an edge E is inserted into the chart, if E is active then for all passive A, such that A.from 6 E.to and combined-seore(E,A) > Beam-Value, insert (E,A, combined-score(E,A )) into the actual agenda. If E is passive then for all active A, such that E.f~vm 6 A.to and combined-score(A,E) > Beam- Value, insert (A,E, combined-score(A, E)) into the actual agenda. Combined-Score is a linear combination of the outside components of an edge C which would be created by A and E in a Combine operation. Beam-Value is calculated as a fixed offset from the maximum Combined-Score on an agenda. Since we process best-first inside the beam, the maximum is known when the first triple is inserted into an agenda. Agenda-Pop will remove the best triple from an actual agenda and return it.</Paragraph> </Section> <Section position="4" start_page="485" end_page="485" type="sub_section"> <SectionTitle> 2.3 A simple LRI lattice parser </SectionTitle> <Paragraph position="0"> The follwing control loop implements a simple LRI lattice parser.</Paragraph> <Paragraph position="1"> 1. T = 0. Create VT 2. Insert initial active edge E into VT, with E.next = S 3. Increment T. Create VT 4. For every W with W.end = 7': Insert(W) or Inherit(W) 5. Until Agenda\[T\] is empty: (a) Combine(Agenda-Pop) (b) When combination with initial edge is successful, send result to SEMANTICS 6. Communicate with PROSODY and go to 3</Paragraph> </Section> <Section position="5" start_page="485" end_page="485" type="sub_section"> <SectionTitle> 2.4 The Grammar Model </SectionTitle> <Paragraph position="0"> The UG used in our experiments consists of 700 lexical entries and 60 rules. We used a variant of inside-outside training to estimate a model of UG derivations. It is a rule bigram model similar to PCFG with special extensions for UG type operations. The probability of future unifications is made dependent from the result type of earlier unifications. The model is described in more detail in (Weber 1994a; Weber 1995); it is very similar to (Brew 1995).</Paragraph> </Section> <Section position="6" start_page="485" end_page="485" type="sub_section"> <SectionTitle> 2.5 LRI Coupling with Prosody </SectionTitle> <Paragraph position="0"> In INTARC we use three classes of boundaries, B0 (no boundary), B2 (phrase boundary), B3 (sentence boundary) and B9 (real break). The prosody module, developed at the University of Bonn, classifies time intervals according to these classes. A prosody hypothesis consists of a beginning and ending time and model probabilities for the boundary types which sum up to one. A prosodic transition penalty used in the Combine operation was taken to be the score of the best combination of bottom-up boundary hypothesis Bx and a trigram score (lword, Bx, rword). Here lword is the last word of the edge to the left and rword is the first word spanned by the edge to the right. Prosody hypotheses are consumed by the parser in every cycle and represented as attributes of vertices which fall inside a prosodic time interval. In a couple of tests we already achieved a reduction of edges of about 10% without change in recognition rate using a very simple trigram with only five word categories.</Paragraph> </Section> <Section position="7" start_page="485" end_page="486" type="sub_section"> <SectionTitle> 2.6 Experimental Results </SectionTitle> <Paragraph position="0"> In a system like INTARC-1.3, the analysis tree is of much higher importance than the recovered string; for the goal of speech translation an adequate semantic representation for a string with word errors is more important than a good string with a wrong reading. The grammar scores have only indirect influence on the string; their main function is picking the right tree. We cannot measure something like a &quot;tree recognition rate&quot; or &quot;rule accuracy&quot;, because there is no treebank for our grammar. The word accuracy results cannot be compared to word accuracy as usually applied to an acoustic decoder in isolation. We counted only those words as recognized which could be built into a valid parse from the beginning of the utterance. Words to the right which could not be integrated into a parse, were counted as deletions --- although they might have been correct in standard word accuracy terms. This evaluation method is much harder than standard word accuracy, but it appears to be a good approximation to &quot;rule accuracy&quot;. Using this strict method we achieved a word accuracy of 47%, which is quite promising.</Paragraph> <Paragraph position="1"> Results using top down prediction of possible word hypotheses by the parser work inspired by (Kita et. al. 1989) have already been published in (Hauenstein and Weber 1994a; ltmlenstein and Weber 1994b), (Weber 1994a), and (Weber 1995).</Paragraph> <Paragraph position="2"> Recognition rates had been improved there for read speech. In spontaneous speech we could not achieve the same effects.</Paragraph> </Section> <Section position="8" start_page="486" end_page="486" type="sub_section"> <SectionTitle> 2.7 Current Work </SectionTitle> <Paragraph position="0"> Our current work, which led to INTARC-2.0, uses a new approach for the interaction of syntax and semantics and a revision of the interaction of the parser with a new decoder. For the last case we implemented a precompiler for word-based prediction which to our current experience is clearly superior to the previous word-class based prediction. For the implementation of the interaction of syntax and semantics we proceed as follows: A new turn-based UG has been written, for which a context-sensitive stochastic traiuing is being performed. The resulting grammar is then stripped down to a pure type skeleton which is actually being used for syntactic parsing. Using full structure sharing in the syntactic chart, which contains only packed edges, we achieve a complexity of O(n3). In contrast to that, for semantic analysis a second, unpacked chart is used, whose edges are provided by an unpacker module which is the interface between the two analysis levels.</Paragraph> <Paragraph position="1"> The unpacker, which has exponential complexity, selects only the n best scored packed edges, where n is a parameter. Only if semantic analysis fails it requests further edges from the unpacker. In this way, the computational effort on the whole is kept as low as possible.</Paragraph> </Section> </Section> class="xml-element"></Paper>