File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1075_intro.xml

Size: 2,578 bytes

Last Modified: 2025-10-06 14:05:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1075">
  <Title>Collection and Analyses of WSJ-CSR Data at MIT 1</Title>
  <Section position="2" start_page="0" end_page="367" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> One of the key ingredients that has contributed to the steady improvement in speech recognition technology in recent years is the availability of large speech corpora \[1,3,7,8\]. With the help of these corpora, researchers have been able to develop recognition systems and obtain reliable estimates of system parameters. Perhaps just as important, these corpora, together with standardized performance evaluation procedures and metrics, have encouraged objective comparison of different systems, leading to better understanding and cross fertilization of research ideas \[4\].</Paragraph>
    <Paragraph position="1"> The various speech corpora that the DARPA community has collected serve a wide range of purposes. The TIMIT corpus was designed with acoustic-phonetic research in mind. The Resource Management corpus addresses the needs for developing recognition systems with moderate vocabulary (1,000 words) and perplexity (60, with a word-pair language model). The VOYACER and ATIS corpora contain spontaneously generated speech, and are useful for spoken language system development.</Paragraph>
    <Paragraph position="2"> All the presently available corpora have moderate vocabulary sizes and perplexities, and thus cannot adequately support research and development of very large vocabulary continuous speech recognition (CSR) systems in American English 2. As a result, the DARPA community  recently initiated an effort towards the construction of a new corpus to meet these needs.</Paragraph>
    <Paragraph position="3"> The domain chosen by the community is the Wall Street Journal (WSJ), and the text prompts are selected from the CD-ROM distributed by ACL/DCI \[5\]. While the ultimate goal is to collect around 300 hours of speech from more than 100 speakers, it was thought that we should collect a pilot corpus of approximately 40 hours, partly to satisfy near term needs and partly to debug the text preprocessing and data collection processes. Since August 1991, our group is one of three that actively participated in the collection of the WSJ-CSR pilot corpus 3. The purpose of this paper is to document our involvement in this process, present some comparative analyses of the resulting data, and describe an experiment investigating the preparation of the prompt text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML