File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2028_metho.xml
Size: 23,854 bytes
Last Modified: 2025-10-06 14:12:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2028"> <Title>Advisory Committee:</Title> <Section position="3" start_page="204" end_page="208" type="metho"> <SectionTitle> 2 DETAILED CONCEPTS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="204" end_page="205" type="sub_section"> <SectionTitle> 2.1 A Better Likelihood Function </SectionTitle> <Paragraph position="0"> The uniform search is inefficient because it delays extension of the longer theories while it extends the shorter (poorer) theories. Instead, an approximation to an &quot;A*&quot; search \[1,2\] will be used. This uses a likelihood function which gives much better comparisons between theories of varying lengths and results in a much more efficient search. If properly implemented, it is an admissible search (i.e., it is guaranteed to find the best path.) In practice, it may not be possible to compute some of the parameters so that the required approximations may compromise the guarantee. (In fact, intentionally using incorrect parameters can further reduce the search space and one may trade off computation for search error risk--see below.) One way of implementing the A* search is to use the difference between the actual log-probability of reaching a subgoal and an upper bound upon that log-probability as the search control function. A reasonably good upper bound may be computed for the CSR component, N-gram languages and, hopefully, also for NL grammars. (In practice, estimates for the upper bound might have to be used.) This likelihood function can be evaluated in a strictly left-to-right fashion and thus the search may begin before the end of the acoustic data is found.</Paragraph> <Paragraph position="1"> Thus, the basic costs used here will be log likelihoods (i.e., the difference between the upper bound log probability and the actual log probability). (The term cost as used here is more like value: high is good and low is bad.) The stack likelihood function should also include some extra control parameters: stack_likelihood = CSR_likelihood + a'length + fl*NLP_likelihood + gamma*nr_words where a is an acoustic length penalty length is the amount of acoustic data covered by the theory is a grammar weight 7 is a word insertion penalty and nr_words is the number of words in the theory.</Paragraph> <Paragraph position="2"> a controls the width of the search: a > 0 will encourage the longer theories and thus reduce the search and a < 0 will penalize the longer theories and thus increase the search. Since length of the entire acoustic input is is a constant across all theories, a cannot alter the relative likelihood of a complete theory--but it can prevent the best theory from being found first if it is too large. (This is, in effect, a pruning error.) fl controls the relative weights of the acoustic and grammatical evidence. 7 controls the relative number of insertion and deletion errors. In a perfect A* search, both CSR_likelihood and NLPJikelihood would be less than or equal to zero.</Paragraph> <Paragraph position="3"> By manipulating these parameters and the likelihoods returned by the CSR and NLP, it is possible to implement a wide variety of search strategies including uniform and A*. This interface is capable of operating with any of this range of strategies--the best one is a function of the CSR and NLP algorithmic sophistication and the allowable amount of computation. Finding the best set of likelihood function parameters is an optimization which can only be performed when the components are integrated into a complete SLS.</Paragraph> </Section> <Section position="2" start_page="205" end_page="206" type="sub_section"> <SectionTitle> 2.2 Partial Theory Memory </SectionTitle> <Paragraph position="0"> Memoryless CSR and NLP components as used in 1.1 are inefficient because they require recomputation of the embedded left sentence likelihoods. Thus, both the CSR and the NLP will cache the partial theories and the information required to efficiently compute any extensions of those theories. The theory identifiers will have a one-to-one correspondence with the theories.</Paragraph> <Paragraph position="1"> An alternative would be to store all partial theory information on the stack. This would allow an &quot;almost memoryless&quot; CSR and NLP. This scheme has been rejected for the present, due to its communications overhead. It might be useful in a later version for a loosly-coupled multi-processor environment. (See Sec. 3.2.)</Paragraph> </Section> <Section position="3" start_page="206" end_page="206" type="sub_section"> <SectionTitle> 2.3 Stochastic Grammars </SectionTitle> <Paragraph position="0"> Likelihoods (which are, of course, based upon probabilities) are the common language for communication between the two modules and the search control. Thus, grammars which give the probability of a full or partial sentence provide much more information to the combined CSR-NLP system than grammars which just accept or reject a sentence. The simple strategy of estimating the probability of a word as 1/(nr of possible words at this point) may or may not be useful. (It does not help the Resource Management word-pair grammar when used in our CSR.) A much better first cut at a stochastic grammar would be to use N-gram probabilities on top of an &quot;accept-reject&quot; grammar.</Paragraph> <Paragraph position="1"> In the long run, the probabilities should be integrated into the NL grammar, but the first-cut is a reasonable baseline. (Observe, for instance, IBM's success with purely N-gram grammars \[1\].) The control scheme used in this proposed specification is tolerant: it can handle full probabilities, branching factor based probabilities, or just acceptance-rejection &quot;probabilities&quot; (i.e., l's or O's). Presumably, the more accurate the probabilities, the better the overall performance.</Paragraph> </Section> <Section position="4" start_page="206" end_page="206" type="sub_section"> <SectionTitle> 2.4 Fast matches </SectionTitle> <Paragraph position="0"> To reduce the search space, both the CSR and NLP will provide fast matches. These matches take a partial theory and use computationally-efficient methods for providing a quick estimate of the probabilities of the words which may follow. The lists from both components are combined to give the stack a list of words for the slower detailed match. The goal here is just to get the correct word on a small list of candidates. The &quot;fast&quot; probabilities will be used in combining, ordering, and pruning the list, but not in the stack likelihood function.</Paragraph> <Paragraph position="1"> Methods for performing acoustic fast matches are currently known. NLP fast matches may or may not be possible. (Typically, neither will be available in the early stages of module development.) The interface will still be able to operate, but a wider search and more computation will generally be required.</Paragraph> </Section> <Section position="5" start_page="206" end_page="207" type="sub_section"> <SectionTitle> 2.5 Multiple Output Sentences: Top-N Mode </SectionTitle> <Paragraph position="0"> The stack controller can continue to output sentences in decreasing likelihood order. Thus, the user may be asked to choose from a short list of outputs if the system cannot choose one sufficiently reliably.</Paragraph> <Paragraph position="1"> This mode may also be used to allow non-left-to-right NLP search strategies. The SC-CSR can operate without a grammar or with a purely stochastic grammar (such as N-gram) to generate a list of sentences with (stack) likelihoods. The NLP can then add its likelihood contribution and the best sentence in the list is chosen. (In the case of an accept-reject grammar, the NLP can simply reject non-grammatical sentences in order until one is accepted.) This decoupled mode will reduce the overall computation over the coupled mode only if the NLP requires significantly more computation than the no (or limited) grammar CSR.</Paragraph> <Paragraph position="2"> This mode may also be used in the tradeoff of search width vs. risk of search error tradeoff. If the search is narrowed too much by increasing ~, the sentences may be recognized out of (likelihood function) order. It may be cheaper to run a narrower search and choose the winner later than to run an (empirically) admissible search where the best answer will be output first. Again, these tradeoffs can only be determined in the context of a complete system.</Paragraph> </Section> <Section position="6" start_page="207" end_page="207" type="sub_section"> <SectionTitle> 2.6 Second Stage Re-evaluation or Discrimination </SectionTitle> <Paragraph position="0"> If a second stage re-evaluation of the evidence for the top few sentences is desired, the system can be operated in Top-N mode. When the Top-N list is full, a re-evaluation may be performed and the chosen sentence output. This mode only makes sense if a more detailed but greedier or non-left-to-right acoustic matching algorithm or NLP is used. This is similar to the decoupled mode mentioned in 2.5, except (hopefully) more accurate re-evaluation is being performed after the initial evaluation, using the stack. The search then proceeds in three stages: fast-coarse, medium-medium, and slow-detailed.</Paragraph> </Section> <Section position="7" start_page="207" end_page="207" type="sub_section"> <SectionTitle> 2.7 Speech Understanding </SectionTitle> <Paragraph position="0"> Since in speech understanding more than one word sequence can have the same meaning, a mechanism has been considered for combining theories. However, such a combination is incompatible with the Top-N re-evaluations described in 2.5 and 2.6. Once two theories are combined, they cannot be separated, and it may be necessary to distinguish between them in the re-evaluation.</Paragraph> <Paragraph position="1"> Thus, such a mechanism is not being included in this version of the interface specification.</Paragraph> <Paragraph position="2"> The &quot;normal&quot; output of this system is the best word sequence which matches the acoustic and NL constraints. In addition, a mechanism is included for the NLP to output the meaning of the recognized sentence. This meaning will be expressed as text (i.e., ascii characters to make it machine-independent), but its format is undefined by this specification. This will allow the NLP to feed an interpretation or a parse tree to a later module for execution. For example, a database query SLS might output in a database query language or it might output a parse tree for later interpretation into a database query. (Of course, if the SLS is fully integrated into the task, its explicit output might be ignored--its output might be a change of state in the task which may be observable by the user via other modalities. For instance, a chess-playing system might move the chess piece and the user would just see the move on the game board.)</Paragraph> </Section> <Section position="8" start_page="207" end_page="208" type="sub_section"> <SectionTitle> 2.8 Features </SectionTitle> <Paragraph position="0"> Linguistic features which have acoustic expression (prosodics, beginning of sentence, end of sentence, etc.) may be attached to words by the NLP. Global features, i.e., features which apply for the entire sentence, must be stated at the beginning of the sentence (due to left-to-right evaluations). A global feature is treated as if it is attached to each and every word. There is a mechanism for the CSR and NLP to exchange feature lists to allow the systems to adapt to each other.</Paragraph> <Paragraph position="1"> The actual features are undefined by this specification. Only the syntax and mechanisms for transmission are defined here. The features themselves are just text strings--they have no meaning except as interpreted by the CSR and NLP.</Paragraph> </Section> <Section position="9" start_page="208" end_page="208" type="sub_section"> <SectionTitle> 2.9 Control </SectionTitle> <Paragraph position="0"> The stack is the sole controller of the system. It sends out a request to a slave and waits for a reply. Either slave (i.e., the CSR or NLP) may, in turn, make a request of a helper, but any such helpers must be slaves of the CSR or NLP. Neither the CSR or NLP nor any helpers may initiate any action involving the stack.</Paragraph> <Paragraph position="1"> 2.10 Integration of the Stack and the CSR.</Paragraph> <Paragraph position="2"> If the system were configured into three separate modules as shown in the figure, it would require excessive communications overhead. The communications with the NLP are simpler than with the CStt--time registration is not an issue for the NLP. Because the CSR must actually return time distributions (likelihood as a function of time), the stack and the CSR are integrated into a single stack-controller CSR module (the SC-CSR) to remove the higher bandwidth channel. This causes no change in the control structure: the stack is still the sole master and the CSR and the NLP are still its slaves. This also causes no change in the NLP interface.</Paragraph> <Paragraph position="3"> To allow efficient &quot;layered&quot; grammars, the NLP may request a search abort. This abort keeps the same acoustic data but re-initializes the stack to its initial state. Thus, a system which first tries a restrictive grammar and then decides that this grammar is unable to match the input, may abort the search and try again with a less restrictive grammar. The NLP may request as many aborts as necessary (although it may be necessary to place an upper limit enforced by the controller to prevent infinite loops).</Paragraph> </Section> </Section> <Section position="4" start_page="208" end_page="208" type="metho"> <SectionTitle> 2.12 Errors </SectionTitle> <Paragraph position="0"> Either the CSR or the NLP can make an error reply to the stack. Four responses are possible: ignore the error, abort this theory, abort this sentence, or abort the program. The first two responses have the option of reporting the error, the third and fourth must report the error. (For instance, in a demo one might wish to suppress error reporting, while in a debugging run, one might want to see all of the errors.) A possible cause for non-fatal errors might be features which are only implemented for some phones in the CSR.</Paragraph> </Section> <Section position="5" start_page="208" end_page="209" type="metho"> <SectionTitle> 2.13 Comments </SectionTitle> <Paragraph position="0"> Either the SC-CSR or the NLP may place comments onto the pipe interface. These comments will be ignored by the modules. Their only purpose is to place additional information into the communication streams for debugging or demonstration purposes.</Paragraph> </Section> <Section position="6" start_page="209" end_page="210" type="metho"> <SectionTitle> 3 THE ARCHITECTURE </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="209" end_page="209" type="sub_section"> <SectionTitle> 3.1 The Physical Connection </SectionTitle> <Paragraph position="0"> Logically, the architecture consists of the three parts listed above: the stack controller (SC), a CSR, and an NLP. (As described in 2.10, the stack controller will be combined in a single process with the CSR, but will have the same functionality as if the two were separate.) The SC-CSR process will communicate with the NLP process via UNIX pipes. (A complete interchange has been benchmarked at about 1 ms on a SUN 4/260.) Therefore, the two processes need not be written in the same language and need not even be running on the same machine. (This interchange has been benchmarked at about 4 ms between a SUN 4/260 and 4/110 on our local Ethernet. Network overhead would be prohibitive if any number of gateways were involved.) The NLP will receive its commands on the I/O channel &quot;stdin&quot; and reply on I/O channel &quot;stdout'. (Stderr will retain its usual function.) The specification as defined here, uses standard (unnamed) pipes. An easy way to make intermachine pipes is with the rsh (remote shell) command. (Rsh sets up stdin, stdout, and stderr such that the network between the machines is invisible,) An alternative is to use sockets. (Pipes are implemented on some machines using sockets.) Sockets have some advantages, but are more complex to use. Thus any attempt to include sockets in this specification will be delayed until a clear need is developed. Once the socket-based interconnection is established, the communication would be the same as in the pipe-based interconnection.</Paragraph> <Paragraph position="1"> To minimize the communications overhead, the request for detailed matches may be batched in groups which are extensions of the same theory. Thus the block of commands will be sent from the SC to to the NLP and the replies will be expected as a block (in corresponding order) when the NLP is finished. This will be particularly important when two separate machines are used.</Paragraph> </Section> <Section position="2" start_page="209" end_page="210" type="sub_section"> <SectionTitle> 3.2 Parallel Processing </SectionTitle> <Paragraph position="0"> If the CSR and NLP are implemented on separate machines, they may execute simultaneously--i.e., both may perform a fast match for the same theory, or both may perform the (possibly blocked) detailed analysis of a theory.</Paragraph> <Paragraph position="1"> Parallel execution of the CSR or NLP can be performed by removing several theories from the stack and sending each to a different processor. The difficulty centers on the cached theories which must be located and transmitted between processors on demand. (If all theory information were stored on the stack, the CSR and NLP modules would be memoryless and this would not be a problem.</Paragraph> <Paragraph position="2"> However, all partial theory information would have to be transferred from and onto the stack for every operation. The overhead would be prohibitive.) Only the form of parallelism described in the previous paragraph is supported in this version of the specification. If necessary, a later version could support the second form. Note that the system would eventually bottleneck on the stack controller.</Paragraph> </Section> </Section> <Section position="7" start_page="210" end_page="213" type="metho"> <SectionTitle> 4 THE DATA FORMAT SPECIFICATION </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="210" end_page="210" type="sub_section"> <SectionTitle> 4.1 The Messages </SectionTitle> <Paragraph position="0"> The messages will consist entirely of text in order to make them machine and language independent and easy to debug. (Appropriate use of hashing and special purpose I/O routines can be used to minimize the overhead of conversion to and from text.) Both processes will cache partial theories which will be identified by a positive unordered integer handle (label). Handle &quot;0&quot; will be the null theory. Communications between the stack-CSR and the NLP will be in a command-reply format.</Paragraph> <Paragraph position="1"> Features are expressed as &quot;word\{feature-value-pairl feature-value-pair2 .... \}&quot; and global features must be asserted at the beginning of the sentence before any words: &quot;\(global-feature-value-pairl global-feature-value-pair2\} wordl ..... &quot;. The features are not interpreted in any way by the stackthey are simply passed as (ascii) text between the CSR and the NLP. (Note: feature=value will be passed from the NLP to the CSR--it will not be interpreted by the stack.) Word features will override global features which will, in turn, override default features. The actual features are not defined by this specification.</Paragraph> </Section> <Section position="2" start_page="210" end_page="210" type="sub_section"> <SectionTitle> 4.2 Data Formats </SectionTitle> <Paragraph position="0"> Most of the messages are short and can be transmitted as a single line terminated by a &quot;new-line&quot; character (i.e., a standard UNIX single line). Lists are transmitted as a group of lines, one list item per line, and terminated by a blank line. White space shall separate items on a line. All probabilities and likelihoods are expressed in log base 10, and logl0(0) shall be expressed as &quot;-Inf&quot;.</Paragraph> <Paragraph position="1"> The numbers themselves will be written as \[-\]x.xxx (C language f format).</Paragraph> </Section> <Section position="3" start_page="210" end_page="213" type="sub_section"> <SectionTitle> 4.3 Top-N Mode Output Format </SectionTitle> <Paragraph position="0"> Top-N mode will output its sentences in the following format: a likelihood, white space, the sentence text, and a &quot;new line&quot; per sentence. A blank line terminates the output list. The list may or may not be in likelihood order. An ordered list will make further processing more convenient, but an unordered list can be output as soon as each sentence is found to allow parallelism with any later processing.</Paragraph> <Paragraph position="1"> fastmatch yes/no Does the NLP have a fast match? reset ok Reset NLP to start state old-id <new-id word list> <likelihood \[\end or \optend\] list> append word to old-id, assign to new-id respond with incremental log-likelihood the old id appears only on the 1st item if(\end) must be end of sentence if(\optend) optional end of sentence meaning id text-meaning give the meaning of the sentence (language of text-meaning undefined) fast id <wd likelihood-list> fast match purge id ok purge \[partial\] theory norm id likelihood get A* normalization prob for id (any) \abort abort search, restart with same input (any) \error react# \[explanation\] error return from any command react#=0 ignore following lines are a normal response react#=l delete theory react#=2 give up on sentence react#=3 abort program if present, the explanation is reported \# SC-CSR comment \# NLP comment A comment from either source has a '#' at the start of the line.</Paragraph> <Paragraph position="2"> &quot;V' is used to introduce anything which must be interpreted as a control word where it might be confused with a vocabulary word. (&quot;V' itself is written &quot;\\&quot;). All lists are terminated by a blank line. A typical session for the sentence &quot;who is he&quot; might be (the acoustic probabilities do not show at the interface): -1.1 \end theory &quot;who is she \end&quot; (stack now picks 5 and outputs &quot;who is he&quot;) reset ok ready for next sentence</Paragraph> </Section> </Section> <Section position="8" start_page="213" end_page="213" type="metho"> <SectionTitle> 5 SIMULATORS </SectionTitle> <Paragraph position="0"> To allow each group to work on its part of the task (CSR or NLP) independently of the other part, a set of simulators will be used. These simulators will communicate using the protocols specified above. Both would be designed to be computationally cheap to expedite the developmental work.</Paragraph> <Paragraph position="1"> The stack/CSR simulator will be text-driven, use a dictionary and acoustic phoneme models generated from real speech data to cause errors to be &quot;realistic&quot; and have controls to adjust the error rates. (NLP evaluation tests could use defined settings of the control parameters.) The NLP simulator would use an N-gram language model for efficiency. For the Resource Management database, the BBN word-pair or BBN class grammar could be used.</Paragraph> </Section> class="xml-element"></Paper>