File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/p91-1039_metho.xml

Size: 15,713 bytes

Last Modified: 2025-10-06 14:12:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="P91-1039">
  <Title>FACTORIZATION OF LANGUAGE CONSTRAINTS IN SPEECH RECOGNITION</Title>
  <Section position="4" start_page="299" end_page="300" type="metho">
    <SectionTitle>
2. Syntax Driven Continuous Speech
Recognition
</SectionTitle>
    <Paragraph position="0"> The general trend in large vocabulary continuous speech recognition research is that of building integrated systems (Huang, 1990; Murveit, 1990; Paul, 1990; Austin, 1990) in which all the relevant knowledge sources, namely acoustic, phonetic, lexical, syntactic, and semantic, are integrated into a unique representation. The speech signal, for the purpose of speech recognition, is represented by a sequence of acoustic patterns each consisting of a set of measurements taken on a small portion of signal (generally on the order of 10 reset). The speech recognition process is carried out by searching for the best path that interprets the sequence of acoustic patterns, within a network that represents, in its more detailed structure, all the possible sequences of acoustic configurations. The network, generally called a decoding network, is built in a hierarehical way.</Paragraph>
    <Paragraph position="1"> In current speech recognition systems, the syntactic structure of the sentence is represented generally by a regular grammar that is typically implemented as a finite state network (syntactic FSN). The ares of the syntactic FSN represent vocabulary items, that are again represented by FSN's (lexical FSN), whose arcs are phonetic units. Finally every phonetic unit is again represented by an FSN (phonetic FSN). The nodes of the phonetic FSN, often referred to as acoustic states, incorporate particular acoustic models developed within a statistical framework known as hidden Markov model (HMM). 1 The 1. The reader is referred to Rabiner (1989) for a tutorial introduction of HMM.</Paragraph>
    <Paragraph position="2"> model pertaining to an acoustic state allows computation of a likelihood score, which represents the goodness of acoustic match for the observation of a given acoustic patterns. The decoding network is obtained by representing the overall syntactic FSN in terms of acoustic states. Therefore the recognition problem can be stated as follows. Given a sequence of acoustic patterns, corresponding to an uttered sentence, find the sequence of acoustic states in the decoding network that gives the highest likelihood score when aligned with the input sequence of acoustic patterns. This problem can be solved efficiently and effectively using a dynamic programming search procedure. The resulting optimal path through the network gives the optimal sequence of acoustic states, which represents a sequence of phonetic units, and eventually the recognized string of words.</Paragraph>
    <Paragraph position="3"> Details about the speech recognition system we refer to in the paper can be found in Lee (1990/1). The complexity of such an algorithm consists of two factors. The first is the complexity arising from the computation of the likelihood scores for all the possible pairs of acoustic state and acoustic pattern. Given an utterance of fixed length the complexity is linear with the number of distinct acoustic states. Since a finite set of phonetic units is used to represent all the words of a language, the number of possible different acoustic states is limited by the number of distinct phonetic units. Therefore the complexity of the local likelihood computation factor does not depend either on the size of the vocabulary or on the complexity of the language.</Paragraph>
    <Paragraph position="4"> The second factor is the combinatorics or bookkeeping that is necessary for carrying out the dynamic programming optimization.</Paragraph>
    <Paragraph position="5"> Although the complexity of this factor strongly depends on the implementation of the search algorithm, it is generally true that the number of operations grows linearly with the number of arcs in the decoding network. As the overall number of arcs in the decoding network is a linear function of the number of ares in the syntactic network, the complexity of the bookkeeping factor grows linearly with the number of ares in the FSN representation of the grammar.</Paragraph>
    <Paragraph position="6">  The syntactic FSN that represents a certain task language may be very large if both the size of the vocabulary and the munber of syntactic constraints are large. Performing speech recognition with a very large syntactic FSN results in serious computational and memory problems. For example, in the DARPA resource management task (RMT) (Price, 1988) the vocabulary consists of 991 words and there are 990 different basic sentence structures (sentence generation templates, as explained later). The original structure of the language (RMT grammar), which is given as a non-deterministic finite state semantic grammar (Hendrix, 1978), contains 100,851 rules, 61,928 states and 247,269 arcs. A two step automatic optimization procedure (Brown, 1990) was used to compile (and minimize) the nondeterministic FSN into a deterministic FSN, resulting in a machine with 3,355 null arcs, 29,757 non-null arcs, and 5832 states. Even with compilation, the grammar is still too large for the speech recognizer to handle very easily. It could take up to an hour of cpu time for the recognizer to process a single 5 second sentence, running on a 300 Mflop Alliant supercomputer (more that 700 times slower than real time). However, if we use a simpler covering grammar, then recognition time is no longer prohibitive (about 20 times real time).</Paragraph>
    <Paragraph position="7"> Admittedly, performance does degrade somewhat, but it is still satisfactory (Lee, 1990/2) (e.g. a 5% word error rate). A simpler grammar, however, represents a superset of the domain language, and results in the recognition of word sequences that are outside the defined language. An example of a covering grammars for the RMT task is the so called word-pair (WP) grammar where, for each vocabulary word a list is given of all the words that may follow that word in a sentence. Another covering grammar is the so called null grammar (NG), in which a word can follow any other word. The average word branching factor is about 60 in the WP grammar. The constraints imposed by the WP grammar may be easily imposed in the decoding phase in a rather inexpensive procedural way, keeping the size of the FSN very small (10 nodes and 1016 arcs in our implementation (Lee, 1990/1) and allowing the recognizer to operate in a reasonable time (an average of 1 minute of CPU time per sentence) (Pieraccini, 1990). The sequence of words obtained with the speech recognition procedure using the WP or NG grammar is then used as input to a second stage that we call the semantic decoder.</Paragraph>
  </Section>
  <Section position="5" start_page="300" end_page="302" type="metho">
    <SectionTitle>
3. Semantic Decoding
</SectionTitle>
    <Paragraph position="0"> The RMT grammar is represented, according to a context free formalism, by a set of 990 sentence generation templates of the form:</Paragraph>
    <Paragraph position="2"> where a generic ~ may be either a terminal symbol, hence a word belonging to the 991 word vocabulary and identified by its orthographic transcription, or a non-terminal symbol (represented by sharp parentheses in the rest of the paper). Two examples of sentence generation templates and the corresponding production of non-terminal symbols are given in Table 1 in which the symbol e corresponds to the empty string.</Paragraph>
    <Paragraph position="3"> A characteristic of the the RMT grammar is that there are no reeursive productions of the kind:</Paragraph>
    <Paragraph position="5"> For the purpose of semantic decoding, each sentence template may then be represented as a FSN where the arcs correspond either to vocabulary words or to categories of vocabulary words. A category is assigned to a vocabulary word whenever that vocabulary word is a unique element in the tight hand side of a production.</Paragraph>
    <Paragraph position="6"> The category is then identified with the symbol used to represent the non-terminal on the left hand side of the production. For instance, following the example of Table 1, the words SHIPS, FRIGATES, CRUISERS, CARRIERS, SUBMARINES, SUBS, and VESSELS belong to the category &lt;SH/PS&gt;, while the word LIST belongs to the category &lt;LIST&gt;. A special word, the null word, is included in the vocabulary and it is represented by the symbol e.</Paragraph>
    <Paragraph position="7"> Some of the non-terminal symbols in a given sentence generation template are essential for the representation of the meaning of the sentence, while others just represent equivalent syntactic variations with the same meaning. For instance,  the correct detection by the recognizer of the words uttered in place of the non-terminals &lt;SHIPS&gt; and &lt;THREATS&gt;, in the former examples, is essential for the execution of the correct action, while an error introduced at the level of the nonterminals &lt;OPTALL&gt;, &lt;OP'ITHE&gt; and &lt;LIST&gt; does not change the meaning of the sentence, provided that the sentence generation template associated to the uttered sentence has been correctly identified. Therefore there are non-terminals associated with essential information for the execution of the action expressed by the sentence that we call semantic variables. An analysis of the 990 sentence generation templates allowed to define a set of 69 semantic variables.</Paragraph>
    <Paragraph position="8"> The function of the semantic decoder is that of finding the sentence generation template that most likely produced the uttered sentence and give the correct values to its semantic variables. The sequence of words given by the recognizer, that is the input of the semantic decoder, may have errors like word substitutions, insertions or deletions. Hence the semantic decoder should be provided with an error correction mechanism.</Paragraph>
    <Paragraph position="9"> With this assumptions, the problem of semantic decoding may be solved by introducing a distance criterion between a string of words and a sentence template that reflects the nature of the possible word errors. We defined the distance between a string of words and a sentence generation templates as the minimum Levenshtein 2 distance between the string of words and all the string of words that can be generated by the sentence generation template.</Paragraph>
    <Paragraph position="10"> The Levenshtein distance can be easily computed using a dynamic programming procedure. Once the best matching template has been found, a traceback procedure is executed to recover the modified sequence of words.</Paragraph>
    <Section position="1" start_page="301" end_page="302" type="sub_section">
      <SectionTitle>
3.1 Semantic Filter
</SectionTitle>
      <Paragraph position="0"> After the alignment procedure described above, a semantic check may be performed on the words that correspond to the non-terminals 2. The Levenshtein distance (Levenshtein, 1966) between two strings is defined as the minimum number of editing operations (substitutions, deletions, and insertions) for transforming one string into the other.  associated with semantic variables in the selected template. If the results of the check is positive, namely the words assigned to the semantic variables belong to the possible values that those variables may have, we assume that the sentence has been correctly decoded, and the process stops. In the case of a negative response we can perform an additional acoustic or phonetic verification, using the available constraints, in order to find which production, among those related to the considered nonterminal, is the one that more likely produced the acoustic pattern. There are different ways of carrying out the verification. In the current implementation we performed a phonetic verification rather than an acoustic one. The recognized sentence (i.e. the sequence of words produced by the recognizer) is transcribed in terms of phonetic units according to the pronunciation dictionary used in speech decoding. The template selected during semantic decoding is also transformed into an FSN in terms of phonetic units. The transformation is obtained by expanding all the non-terminals into the corresponding vocabulary words and each word in terms of phonetic units. Finally a matching between the string of phones describing the recognized sentence and the phone-transcribed sentence template is performed to find the most probable sequence of words among those represented by the template itself (phonetic verification). Again, the matching is performed in order to minimize the Levenshtein distance. An example of this verification procedure is shown in Table 2.</Paragraph>
      <Paragraph position="1"> The first line in the example of Table 2 shows the sentence that was actually uttered by the speaker. The second line shows the recognized sentence. The recognizer deleted the word WERE, substituted the word THERE for the word THE and the word EIGHT for the word DATE. The semantic decoder found that, among the 990 sentence generation templates, the one shown in the third line of Table 2 is the one that minimizes the criterion discussed in the previous section. There are three semantic variables in this template, namely &lt;NUMBER&gt;, &lt;SHIPS&gt; and &lt;YEAR&gt;. The backtracking procedure associated to them the words DATE, SUBMARINES, and EIGHTY TWO respectively. The semantic check gives a false response for the variable &lt;NUMBER&gt;. In fact there are no productions of the kind &lt;NUMBER&gt; := DATE. Hence the recognized string is translated into its phonetic representation. This representation is aligned with the phonetic representation of the template and gives the string shown in the last line of the table as the best interpretation.</Paragraph>
    </Section>
    <Section position="2" start_page="302" end_page="302" type="sub_section">
      <SectionTitle>
3.2 Acoustic Verification
</SectionTitle>
      <Paragraph position="0"> A more sophisticated system was also experimented allowing for acoustic verification after semantic postprocessing.</Paragraph>
      <Paragraph position="1"> For some uttered sentences it may happen that more than one template shows the very same minimum Levenshtein distance from the recognized sentence. This is due to the simple metric that is used in computing the distance between a recognized string and a sentence template. For example, if the uttered sentence is:</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="302" end_page="303" type="metho">
    <SectionTitle>
WHEN WILL THE PERSONNEL CASUALTY
REPORT FROM THE YORKTOWN BE
</SectionTitle>
    <Paragraph position="0"> and the recognized sentence is:</Paragraph>
  </Section>
  <Section position="7" start_page="303" end_page="303" type="metho">
    <SectionTitle>
WILL THE PERSONNEL CASUALTY REPORT
THE YORKTOWN BE RESOLVED
</SectionTitle>
    <Paragraph position="0"> there are two sentence templates that show a minimum Levenshtein distance of 2 (i.e. two words are deleted in both cases) from the recognized sentence, namely:  In this case both the templates are used as input to the acoustic verification system. The final answer is the one that gives the highest acoustic score. For computing the acoustic score, the selected templates are represented as a FSN in terms of the same word HMMs that were used in the speech recognizer. This FSN is used for constraining the search space of a speech recognizer that runs on the original acoustic representation of the uttered sentence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML