File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/h91-1043_intro.xml

Size: 5,457 bytes

Last Modified: 2025-10-06 14:04:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1043">
  <Title>SOME RESULTS ON STOCHASTIC LANGUAGE MODELLING</Title>
  <Section position="3" start_page="0" end_page="225" type="intro">
    <SectionTitle>
ISLAND-DRIVEN PARSING
</SectionTitle>
    <Paragraph position="0"> search process that generates partial interpretations of a spoken sentence cared theoeies; theories are scored on the basis of a likellhood L = O(Pr(A I C/h) Pr(th)). We are interested in the computation of Pr(th) when th is a partial interpretation of a spoken sentence generated by a Stochastic Context-Free Grammar (SCFG) G,. A recent report \[2\] reviews this problem and gives interesting results.</Paragraph>
    <Paragraph position="1"> The most popular parsers used in Automatic Speech Recognition (ASR) generate new theories in a left-to-right fashion. To score the theories generated by these parsers, the probability of all parse trees generating the first p words of a sentence must be computed; the appropriate algorithms are given in \[8\].</Paragraph>
    <Paragraph position="2"> Parsers that are &amp;quot;island-driven&amp;quot; proceed outward in both directions from island8 of words that have been hypothesized with high acoustic evidence. Interesting island-driven parsers have been proposed by \[13\], \[12\], \[6\] who have also discussed the motivations for considering these parsers for ASU. None of these parsers uses a stochastic grammar.</Paragraph>
    <Paragraph position="3"> If island-driven parsers are used for generating partial interpretations of a spoken sentence, it is important to compute Pr(th), which is the probability that a SCFG generates sequences of words intermixed with gaps corresponding to portions of the acoustic signal that are still uninterpreted.</Paragraph>
    <Paragraph position="4"> Recent work provides a precise theoretical framework for this computation \[3\].</Paragraph>
    <Paragraph position="5"> Many different cases involving islands and gaps have been examined; space considerations do not permit us to give here the lengthy formulas obtained for each of these cases. Instead, this paper will list the cases along with the worst-case time complexity of the computation of Pr(th) for each. Perhaps the most striking result of this work was the sharp division between the cases where one must compute the probability that a partial tree generates substrings of a sentence intermixed with a gap of unknown length, and the cases where the gap has a known hngth. The former computation appears to have an unacceptable time complexity; the latter computation is quite tractable. For this reason a later section considers ways in which one might estimate the length of a gap.</Paragraph>
    <Section position="1" start_page="0" end_page="225" type="sub_section">
      <SectionTitle>
Definitions
</SectionTitle>
      <Paragraph position="0"> A SCFG is a quadruple Go = (N, E, P, S), where N is a finite set of no,terrainalsymbols, ~ is a finite set of terminal symbols disjoint from N, P is a finite set of productions of the form H ~ a, H 6 N, c~ 6 (SUN)*, and S 6 Nis a special symbol called Jtart symbol. Each production is associated with a probability, indicated with Pr(H --~ c~).</Paragraph>
      <Paragraph position="1"> If the grammar is proper the following relation holds:</Paragraph>
      <Paragraph position="3"> In the following we will always refer to SCFGs in CNF.</Paragraph>
      <Paragraph position="4"> In the adopted formalism u, v and t represent strings of already recognized terminals; i, \] and I are position indices;  p, q and r axe shift indices; m indicates a (known) gap length and k, h are used as running indices. Furthermore, z ('~') stands for a gap of unknown terminals with specified length ra, while a gap of unknown terminals with unknown length is represented by z ('). Finally, E* represents the set of all strings of finite length over E, while E',m _&gt; 0 is the set of all strings in E* of length m.</Paragraph>
      <Paragraph position="5"> The derivation of a string in G. is usually represented as a parse (or derivation) tree, which shows the rules employed. It is also possible to assodate with each derivation tree the probability that it was generated from a nonterminal symbol H by the grammar G,. This probability is the product of the probabilities of all the rules employed in the derivation. Given a string z 6 E*, the notation H &lt; z &gt;, H 6 N, indicates the set of all trees with root H generated by Go and spanning z. Therefore Pr(H &lt; z &gt;) is the sum of the probabilities of these subtrees, i.e. the probability that the string z ha, been generated by G, starting from symbol H.</Paragraph>
      <Paragraph position="6"> We assume that the grammar G, is consistent. This means that the following condition holds:</Paragraph>
      <Paragraph position="8"> From this hypothesis it follows that a similar condition holds for all nonterminals.</Paragraph>
      <Paragraph position="9"> We are concerned with the computation of probabilities of strings involving islands. The assumed model of computation is the Random Access Machine, taken under the uniform cost criterion (see \[1\]). We will indicate with IPI the size of set P, i.e. the number of productions in G,. We will also write f(z) = O(g(z)) whenever there exist constants c, ~ &gt; 0 such that f(z) &gt; c g(z) for every z &gt; ~. In the following section, we give the worst-case time complexity results we have derived.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML