File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1078_intro.xml

Size: 3,274 bytes

Last Modified: 2025-10-06 14:05:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1078">
  <Title>DARPA FEBRUARY 1992 PILOT CORPUS CSR &amp;quot;DRY RUN&amp;quot; BENCHMARK TEST RESULTS</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Continuous speech recognition research activities within the DARPA Spoken Language community have, within the past several years, been focussed on the Resource Management (RM) and Air Travel Information System (ATIS) corpora. Within the past year, plans have been developed for a large, multi-component &amp;quot;general-purpose English, large vocabulary, natural language, high perplexity corpus&amp;quot; known as the DARPA \[Wall Street Journal-based\] Continuous speech Recognition (CSR) Corpus \[1\]. Doug Paul, of MIT Lincoln Laboratory (MIT/LL), and Janet Baker, of Dragon Systems, are responsible for many of the details of these plans. This corpus is intended to supplant the RM corpora and to supplement the ATIS corpora as resources for the DARPA speech recognition research community.</Paragraph>
    <Paragraph position="1"> Plans to coordinate the design and collection of the CSR Corpus have, since October 1991, been coordinated by the DARPA CSR Corpus Coordinating Committee (CCCC), chaired by George Doddington, following discussions held by an earlier group \[2\].</Paragraph>
    <Paragraph position="2"> In a meeting held at MIT Laboratory for Computer Science (MIT/LCS) in August of 1991, plans were developed for an initial &amp;quot;Pilot Corpus&amp;quot; comprising approximately 40 hours of recorded speech material, which was to be made available within the DARPA community in adequate time in order to permit reporting preliminary or &amp;quot;dry run&amp;quot; benchmark tests at the February 1992 meeting. null Following that meeting, NIST, acting as a DARPA &amp;quot;agent&amp;quot;, contracted with Texas Instruments and SRI International (SRI) \[3\] for collection of the Pilot Corpus and the spoken language group at MIT/LCS also agreed to collect a substantial amount of material for the Pilot Corpus \[4\]. NIST prepared the material for production on recordable CD-ROM media (at MIT/LCS) and screened the associated transcriptions for conformance to standards. The group at SRI was the only group that collected &amp;quot;spontaneous dictation&amp;quot; in addition to the &amp;quot;read speech&amp;quot; comprising the bulk of the Pilot CSR Cotpus. null More than 80 hours of material (per microphone channel) had been collected and distributed to several DARPA contractors. This material included Speaker-Dependent, Longitudinal Speaker- Dependent, and Speaker-Independent training components as well as specifically designated Development Test sets.</Paragraph>
    <Paragraph position="3"> On January 17th (approximately one month before the Speech and Natural Language Workshop), two CD-ROMs containing a selected portion of the Pilot Corpus's Evaluation Test set were were distributed by NIST to four sites: CMU, Dragon Systems, MIT/LL, and SRI International. These sites had indicated interest in participating in the initial &amp;quot;dry run&amp;quot; benchmark test associated with the CSR Pilot Corpus.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML