XML Viewer - p05-1063

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1063_intro.xml
Size: 10,151 bytes
Last Modified: 2025-10-06 14:03:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1063">
  <Title>Discriminative Syntactic Language Modeling for Speech Recognition</Title>
  <Section position="4" start_page="507" end_page="509" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="507" end_page="508" type="sub_section">
      <SectionTitle>
2.1 Previous Work
Techniques for exploiting stochastic context-free
</SectionTitle>
      <Paragraph position="0"> grammars for language modeling have been explored for more than a decade. Early approaches included algorithms for efficiently calculating string prefix probabilities (Jelinek and Lafferty, 1991; Stolcke, 1995) and approaches to exploit such algorithms to produce n-gram models (Stolcke and Segal, 1994; Jurafsky et al., 1995). The work of Chelba and Jelinek (Chelba and Jelinek, 1998; Chelba and Jelinek, 2000; Chelba, 2000) involved the use of a shift-reduce parser trained on Penn treebank style annotations, that maintains a weighted set of parses as it traverses the string from left-to-right. Each word is predicted by each candidate parse in this set at the point when the word is shifted, and the conditional probability of the word given the previous words is taken as the weighted sum of the conditional probabilities provided by each parse. In this approach, the probability of a word is conditioned by the top two lexical heads on the stack of the par- null ticular parse. Enhancements in the feature set and improved parameter estimation techniques have extended this approach in recent years (Xu et al., 2002; Xu et al., 2003).</Paragraph>
      <Paragraph position="1"> Roark (2001a; 2001b) pursued a different derivation strategy from Chelba and Jelinek, and used the parse probabilities directly to calculate the string probabilities. This work made use of a left-to-right, top-down, beam-search parser, which exploits rich lexico-syntactic features from the left context of each derivation to condition derivation move probabilities, leading to a very peaked distribution. Rather than normalizing a prediction of the next word over the beam of candidates, as in Chelba and Jelinek, in this approach the string probability is derived by simply summing the probabilities of all derivations for that string in the beam.</Paragraph>
      <Paragraph position="2"> Other work on syntactic language modeling includes that of Charniak (2001), which made use of a non-incremental, head-driven statistical parser to produce string probabilities. In the work of Wen Wang and Mary Harper (Wang and Harper, 2002; Wang, 2003; Wang et al., 2004), a constraint dependency grammar and a finite-state tagging model derived from that grammar, were used to exploit syntactic dependencies. The processing advantages of the finite-state encoding of the model has allowed for the use of probabilities calculated off-line from this model to be used in the first pass of decoding, which has provided additional benefits. Finally, Och et al. (2004) use a reranking approach with syntactic information within a machine translation system.</Paragraph>
      <Paragraph position="3"> Rosenfeld et al. (2001) investigated the use of syntactic features in a Maximum Entropy approach.</Paragraph>
      <Paragraph position="4"> In their paper, they used a shallow parser to annotate base constituents, and derived features from sequences of base constituents. The features were indicator features that were either (1) exact matches between a set or sequence of base constituents with those annotated on the hypothesis transcription; or (2) tri-tag features from the constituent sequence.</Paragraph>
      <Paragraph position="5"> The generative model that resulted from their feature set resulted in only a very small improvement in either perplexity or word-error-rate.</Paragraph>
    </Section>
    <Section position="2" start_page="508" end_page="509" type="sub_section">
      <SectionTitle>
2.2 Global Linear Models
</SectionTitle>
      <Paragraph position="0"> We follow the framework of Collins (2002; 2004), recently applied to language modeling in Roark et al. (2004a; 2004b). The model we propose consists of the following components:  * GEN(a) is a set of candidate strings for an acoustic input a. In our case, GEN(a) is a set of 1000-best strings from a first-pass recognizer. * T (w) is the parse tree for string w.</Paragraph>
      <Paragraph position="1"> * Ph(a,w) [?] Rd is a feature-vector representation of an acoustic input a together with a string w. * -a [?] Rd is a parameter vector.</Paragraph>
      <Paragraph position="2"> * The output of the recognizer for an input a is defined as</Paragraph>
      <Paragraph position="4"> In principle, the feature vector Ph(a,w) could take into account any features of the acoustic input a together with the utterance w. In this paper we make a couple of restrictions. First, we define the first feature to be</Paragraph>
      <Paragraph position="6"> where Pl(w) and Pa(a|w) are language and acoustic model scores from the baseline speech recognizer. In our experiments we kept b fixed at the value used in the baseline recogniser. It can then be seen that our model is equivalent to the model in Eq. 2. Second, we restrict the remaining features Ph2(a,w) . . . Phd(a,w) to be sensitive to the string w alone.2 In this sense, the scope of this paper is limited to the language modeling problem. As one example, the language modeling features might take into account n-grams, for example through definitions such as</Paragraph>
      <Paragraph position="8"> Previous work (Roark et al., 2004a; Roark et al., 2004b) considered features of this type. In this paper, we introduce syntactic features, which may be sensitive to the parse tree for w, for example</Paragraph>
      <Paragraph position="10"> where S - NP VP is a context-free rule production. Section 3 describes the full set of features used in the empirical results presented in this paper.</Paragraph>
      <Paragraph position="11"> 2Future work may consider features of the acoustic sequence a together with the string w, allowing the approach to be applied to acoustic modeling.</Paragraph>
      <Paragraph position="12">  We now describe how the parameter vector -a is estimated from a set of training utterances. The training set consists of examples (ai,wi) for i = 1 . . .m, where ai is the i'th acoustic input, and wi is the transcription of this input. We briefly review the two training algorithms described in Roark et al. (2004b), the perceptron algorithm and global conditional log-linear models (GCLMs).</Paragraph>
      <Paragraph position="13"> Figure 1 shows the perceptron algorithm. It is an online algorithm, which makes several passes over the training set, updating the parameter vector after each training example. For a full description of the algorithm, see Collins (2004; 2002).</Paragraph>
      <Paragraph position="14"> A second parameter estimation method, which was used in (Roark et al., 2004b), is to optimize the log-likelihood under a log-linear model. Similar approaches have been described in Johnson et al. (1999) and Lafferty et al. (2001). The objective function used in optimizing the parameters is</Paragraph>
      <Paragraph position="16"> Here, each si is the member of GEN(ai) which has lowest WER with respect to the target transcription wi. The first term in L(-a) is the log-likelihood of the training data under a conditional log-linear model. The second term is a regularization term which penalizes large parameter values. C is a constant that dictates the relative weighting given to the two terms. The optimal parameters are defined as -a[?] = arg max-a L(-a) We refer to these models as global conditional log-linear models (GCLMs).</Paragraph>
      <Paragraph position="17"> Each of these algorithms has advantages. A number of results--e.g., in Sha and Pereira (2003) and Roark et al. (2004b)--suggest that the GCLM approach leads to slightly higher accuracy than the perceptron training method. However the perceptron converges very quickly, often in just a few passes over the training set--in comparison GCLM's can take tens or hundreds of gradient calculations before convergence. In addition, the perceptron can be used as an effective feature selection technique, in that Input: A parameter specifying the number of iterations over the training set, T. A value for the first parameter, a. A feature-vector representation Ph(a, w) [?] Rd. Training examples (ai, wi) for i = 1 . . . m. An n-best list GEN(ai) for each training utterance. We take si to be the member of GEN(ai) which has the lowest WER when compared to wi.</Paragraph>
      <Paragraph position="18"> Initialization: Set a1 = a, and aj = 0 for j =</Paragraph>
      <Paragraph position="20"> Output: Either the final parameters -a, or the averaged parameters -aavg defined as -aavg = Pt,i -at,i/mT where -at,i is the parameter vector after training on the i'th training example on the t'th pass through the training data.</Paragraph>
      <Paragraph position="21">  Roark et al. (2004a), the parameter a1 is set to be some constant a that is typically chosen through optimization over the development set. Recall that a1 dictates the weight given to the baseline recognizer score.</Paragraph>
      <Paragraph position="22"> at each training example it only increments features seen on si or yi, effectively ignoring all other features seen on members of GEN(ai). For example, in the experiments in Roark et al. (2004a), the perceptron converged in around 3 passes over the training set, while picking non-zero values for around 1.4 million n-gram features out of a possible 41 million n-gram features seen in the training set.</Paragraph>
      <Paragraph position="23"> For the present paper, to get a sense of the relative effectiveness of various kinds of syntactic features that can be derived from the output of a parser, we are reporting results using just the perceptron algorithm. This has allowed us to explore more of the potential feature space than we would have been able to do using the more costly GCLM estimation techniques. In future we plan to apply GLCM parameter estimation methods to the task.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML