File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1080_metho.xml

Size: 3,950 bytes

Last Modified: 2025-10-06 14:13:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1080">
  <Title>Applying SPHINX-II to the DARPA Wall Street Journal CSR Task</Title>
  <Section position="3" start_page="395" end_page="396" type="metho">
    <SectionTitle>
4. WSJ-CSR Experimental Setup
</SectionTitle>
    <Paragraph position="0"> The WSJ corpus consists of approximately 45-million words of text published by the Wall Street Journal between the years 1987 and 1989. This corpus was made available through the Association for Computation Linguistics/Data Collection Initiative (ACL/DCI) \[12\].</Paragraph>
    <Section position="1" start_page="396" end_page="396" type="sub_section">
      <SectionTitle>
4.1. Language Models
</SectionTitle>
      <Paragraph position="0"> For the purposes of the February dry run eight standard bigram language models were provided by D. Paul at Lincoln Labs \[13\]. The language models were trained only on the WSJ data that was not held out for acoustic training and testing. The language models are characterized along three dimensions, lexicon size (5k or 20k), closed or open vocabulary, and verbalized (vp) or non-verbalized pronunciation (nvp). The distinction between open closed vocabulary models is in the method used to chose the lexicon. For the open vocabulary the lexicon approximately consists of the N most common words in the corpus. For the closed vocabulary, a set of N words were selected in a manner that would allow the creation of a sub-corpus that would have 100% lexical coverage by this closed vocabulary. For further details see \[14\]. The development test set perplexities for the eight language models are given in table 2.</Paragraph>
    </Section>
    <Section position="2" start_page="396" end_page="396" type="sub_section">
      <SectionTitle>
4.2. Training and Evaluation
Acoustic Data Sets
</SectionTitle>
      <Paragraph position="0"> The base line speaker independent training data set provided by the National Institute of Standards and Technology (NIST) \[15\] consisted of 7240 utterances of read WSJ text equally divided among VP and NVP texts. The texts chosen to train the system were quality filtered to remove very long and very short short sentences as well as removing sentences containing words not among the 64k most frequently occurring words in the WSJ corpus \[13\].</Paragraph>
      <Paragraph position="1"> The data was collected from 843 speakers, equally divided among male and female persons. Data recording was performed at three different locations, MIT, SRI and TI. At all three locations the same close speaking, noise canceling microphone was used however envkonmental conditions vary from a sound both to a laboratory environment. At CMU we used a subset of the 7240 utterances, excluding 89 of the 7240 utterances because they contained cross talk or over-laying noise events as indicated by the detailed orthographic transcription (DOT) of the utterance.</Paragraph>
      <Paragraph position="2">  different sites and so this person is counted as two different speakers. The speaker independent evaluation data set consisted of eight data sets containing a total of 1200 utterances from 10 speakers. Again each data set was equally divided among male and female speakers. For further details on the evaluation test sets see \[14\].</Paragraph>
    </Section>
    <Section position="3" start_page="396" end_page="396" type="sub_section">
      <SectionTitle>
4.3 Acoustic Configuration
</SectionTitle>
      <Paragraph position="0"> The configuration of SPHINX-II for WSJ-CSR consists of 16,713 phonetic models that share 6255 semi-continuous distributions. For between word modeling only the left context is considered. There is no speaker normalization component or vocabulary adaptation component. The dictionary provided by Dragon Systems was programaticaUy converted into the CMU style phonetic baseforms with some additional manual post processing to fix problems with the transcription of flaps/dx/.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML