File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1074_metho.xml

Size: 10,230 bytes

Last Modified: 2025-10-06 14:13:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1074">
  <Title>Speech-Based Retrieval Using Semantic Co-Occurrence Filtering</Title>
  <Section position="3" start_page="373" end_page="375" type="metho">
    <SectionTitle>
3. N-Best Matching
</SectionTitle>
    <Paragraph position="0"> For each spoken word, the recognizer outputs the most likely corresponding phone sequence. As a result of recognition errors, the phonetic sequence may not match any entry in the phonetic dictionary, or worse, might match an incorrect word.</Paragraph>
    <Paragraph position="1"> For example, when a speaker intended the word &amp;quot;president&amp;quot; the recognizer output &amp;quot;P R EH S EH D EH N T&amp;quot; would incorrectly match the word &amp;quot;precedent&amp;quot;. We therefore employ a statistical model of the errors typically made by the recognizer and use it to determine what words were likely to have been said.</Paragraph>
    <Paragraph position="2"> Given the phonetic sequence produced by the recognizer, and a statistical model of recognition errors, we want to efficiently determine the n most likely entries in the phonetic dictionary to have been the actual spoken word. As will become apparent, our objective is to make sure the intended word is somewhere in the list. We have investigated two methods for producing the n-best word hypotheses. The first follows a generate-and-test strategy and the second, more successful approach involves an HMM-based search through the phonetic dictionary.</Paragraph>
    <Paragraph position="3"> In the remainder of this section we will discuss the characterization and estimation of error statistics, and then describe the n-best algorithms.</Paragraph>
    <Section position="1" start_page="374" end_page="374" type="sub_section">
      <SectionTitle>
3.1. Characterizing Recognizer Errors
</SectionTitle>
      <Paragraph position="0"> Errors made by the recognizer are described by matrices containing probabilities of various substitution, deletion and insertion errors. We produce the error matrices by using an alignment program that compares phonetic recognizer output for a set of spoken words with the correct phonetic transcriptions for those words. (This set comprises 1000 words).</Paragraph>
      <Paragraph position="1"> Speaker characteristics are also modelled in the error matrices, as axe systematic pronunciation differences between the phonetic dictionary and the speaker. For words that are generated automatically (as mentioned in Section 2.3) we would expect a separate distribution to be helpful because the characteristics of an automatic system are likely to have errors distributed in a different way.</Paragraph>
      <Paragraph position="2"> The results described in this paper are based on a context independent error model. However, recognition errors are strongly correlated with context, and an improved model would use context dependent statistics. Both of the n-best methods described below are easily adapted to the use of context dependent statistics.</Paragraph>
      <Paragraph position="3"> Given the relatively small amount of training data used for estimating error statistics, some form of smoothing is desirable. We employ a Laplacian estimator - if phone i occurs a total of Ni times and is recognized as phone j a total of Ni~ times, the estimated probability of such a substitution is</Paragraph>
      <Paragraph position="5"> where M is the size of the phonetic alphabet (M ---- 39 for our phone set.) 3.2. Generate and Test Our initial method for determining the n-best word hypotheses employed a best-first approach. The most likely phone substitutions/insertions/deletions are applied in a best-first order to the phone string produced by the recognizer. After each such modification if the resulting phone string is present in the phonetic index, it is added to the n-best list with its associated probability (being the product of the probabilities of the modifications applied to the original string in order to obtain it). Finite-state recognizers are used to determine whether a phone string is present in the index. This search method has the potential advantage that it does not require matching against every entry 'in the phonetic index to produce the n-best hypotheses.</Paragraph>
    </Section>
    <Section position="2" start_page="374" end_page="375" type="sub_section">
      <SectionTitle>
3.3. HMM Search
</SectionTitle>
      <Paragraph position="0"> This involves matching the recognizer output against a special HMM network for each phonetic entry in the index (n.b.</Paragraph>
      <Paragraph position="1"> these HMM's are quite separate from those used by the phonetic recognizer).</Paragraph>
      <Paragraph position="2"> Let p(wlyl, y2,...,y,) be the probability that word to was spoken given that the phonetic output produced by the recognizer is yl, y2,..., y,. It is necessary to find the n words for which p(to\[yl, y2,.-., y,) is greatest. By Bayes law: p(to\[y,, ~2 ..... Yr,) = p(zta, y2,..., y,dw)P(w) p(~,,y~ .... ,y.)  The prior probabilities P(to) are assumed uniform here and p(yl, y2,..., y,) is independent of w, so the problem is to find the n words for which P(yl, y2,-.., y,\[w) is maximum. If the phonetic dictionary entry for word to is zl, z2,..., z,~, then given the error statistics, the probability can be computed by adding the probability of every sequence of substitutions, deletions and insertions, which when applied to za, x2,..., z,n results in the sequence ya, y2, .... y,. Assuming that these types of errors are statistically independent, the calculation can be performed efficiently using dynamic programming. By defining a discrete HMM for w, in which the output symbols are phones, the calculation reduces to the computation of the probability that y~, y2,..., y, would be produced by the HMM (i.e. the &amp;quot;forward&amp;quot; probability). null For example, the structure of the HMM for the word ,go, consisting of phonemes /g/ and /ow/is shown in Figure 4.</Paragraph>
      <Paragraph position="3"> The large states represent the phones of the word, and have output probabilities determined from the substitution probabilities. The remaining output states (smaller and gray in the figure) model possible insertions. The output probabilities for these states are the estimated insertion probabilities, conditioned on the event that an insertion occurs. Self loops on these states allow for the possibility of multiple insertions. The null state underneath each large phone state models the deletion of that phone, and the null states underneath insertion states allow for the possibility that no insertion occurs at that position. The transition probabilities are determined from estimated insertion and deletion probabilities.</Paragraph>
      <Paragraph position="4"> The HMM structure shown in Figure 4 could be replaced by a structure having no null states. However the structure chosen is preferable for two reasons. First, the computation of p(yt, y2 .... , ynlza, z2 ..... zm) requires only O(mn) operations rather than O(m2n) which would be required without null states. Second, the computation for this structure is easily implemented using the phonetic pronunciation for each word to index a table of transition and output probabilities, so that an HMM does not need to be explicitly stored for each word.</Paragraph>
      <Paragraph position="5"> We have implemented the n-best search efficiently and a pass through 43,000 phonetic index entries takes a few seconds.</Paragraph>
      <Paragraph position="6"> Including the signal processing (also done in software) the system takes between 5-10 seconds to produce the n-best hypotheses per spoken word (running on a Sun SPARC-10, using a value of n = 30). After the HMM search is complete we have a list of the most likely matching words and their associated probabilities.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="375" end_page="375" type="metho">
    <SectionTitle>
4. Semantic Co-Occurrence Filtering
</SectionTitle>
    <Paragraph position="0"> Let us consider an example where the user speaks the words &amp;quot;president&amp;quot; and &amp;quot;kennedy&amp;quot; into the system. These might result in the following rank ordered lists of word hypotheses: president: (precedent, prescient, president...) kennedy: (kennerty, kennedy, kemeny, remedy...)  In neither case is the intended word the most likely, although both are present and near the tops of the lists. The next step effectively uses the text of the encyclopedia (accessed via the IR system) as a semantic filter. In the encyclopedia the only members of the above lists which co-occur in close proximity are =president&amp;quot; and &amp;quot;kennedy&amp;quot;. The intended words of the query are semantically related and thus co-occur close to each other (many times in this example) but the hypotheses that are the result of recognition errors do not.</Paragraph>
    <Paragraph position="1"> Each spoken word is represented by an OR term containing its most likely word hypotheses. These terms are combined with an AND operator having a proximity constraint on its members. For our example we might have: 17 unsuccessful queries all failed because the correct word was not present in the top 30 hypotheses. For each spoken word we compared the rank of the correct phonetic hypothesis output from the n-best component with that produced after semantic co-occurrence filtering. Table 1 shows that such filtering finds relevant documents while simultaneously improving recognition performance.</Paragraph>
    <Paragraph position="2"> Some of the successful queries are shown below (practically all of the test queries comprise two or three words).</Paragraph>
    <Paragraph position="3">  (OR kennerty, kennedy, kemeny, remedy... )) This query is submitted to the IR system and segments of text that satisfy the constraints are retrieved. The word hypotheses involved in each text segment are then identified. They are used to score the segments and also to rank the best word hypotheses. Scoring includes the phonetic likelihood of the hypotheses, the total number of occurrences of specific hypothesis combinations in all retrieved text segments, and typical IR word weighting.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML