File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/h92-1073_intro.xml
Size: 5,223 bytes
Last Modified: 2025-10-06 14:05:17
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1073"> <Title>The Design for the Wall Street Journal-based CSR Corpus*</Title> <Section position="4" start_page="0" end_page="357" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> As spoken language technology progresses and goals expand, progressively larger, and more challenging corpora need to be created to support advanced research. The SLS DARPA 1994 goals are ambitious, focusing on cooperative speakers, generating goal-directed, spontaneous continuous speech, in speaker-adaptive and speaker-independent modes, for expandable vocabularies (5000 or more words active), moderate perplexity (100-200), with integrated speech and natural language processing, for speakers in a moderate noise environment, using multiple types of microphones, engaged in command/database and dictation applications. In contrast to typical command/database applications, dictation (i.e. interactive speech-driven word processing) tasks focus on cooperative speakers (e.g. speaker dependent/adaptlve sustained usage) who generate continuous speech (usually in a somewhat careful fashion to facilitate accurate transcription) verbalizing their words and sentence punctuation. The existing Resource Management\[15\] and subsequent Air Travel Information System\[16\] corpora target specific database inquiry tasks, characterized by medium vocabularies (<1500 words) with language model perplexities ranging from 9 to 60. The WSJ corpus described here is designed to advance CSR technology and support the 1994 SLS research goals. A similar read speech corpus in the French language has been success*This work was sponsored by the Defense Advanced Research Projects Agency. The views expressed are those of the authors and do not reflect the official policy or position of the U.S. Government. fully completed using text from the newspaper Le Monde\[5\].</Paragraph> <Paragraph position="1"> Commencing with serious contractor concerns regarding suitable CSR corpora\[12\] starting in the mid 1980's, the DARPA SLS Coordinating Committee started considering new corpora requirements in early 1990, with the subsequent formation of the CSR Corpus Committee, culminating in the WSJ Corpus design. The CSR Corpus Committee members include J.M. Baker (Dragon, chair), F. Kubala (BBN), D. Pallett (NIST), D. Paul (LL), M. Phillips (MIT), M.</Paragraph> <Paragraph position="2"> Picheny (IBM), R. Rajasekran (TI), B. Weide (CMU), M.</Paragraph> <Paragraph position="3"> Weintraub (SRI), and 3. Wilpon (ATT). A survey taken of the DARPA contractors for CSR research interests disclosed highly diverse, often opposing views of research interest. All contractors, however, cited a common interest in pursuing research on &quot;Domain-independent Acoustic Models&quot;, &quot;Domain-independent Language Models&quot;, and &quot;Speaker-adaptation&quot;.</Paragraph> <Paragraph position="4"> The outcome of lively meetings and discussions resulted in the definition and preliminary authorization of a major (>400 hrs.) corpus with materials based primarily on WSJ material (backed by WSJ text from 1987-89 provided by the ACL/DCI\[9\] to enable statistical language modeling) and supplemented by other material (spontaneous dictation, Hansard, etc., shown in Table 1). This corpus will provide a uniquely rich resource, in a carefully crafted structure designed to elicit a highly productive flow of diagnostic research information with an array of comparative test paradigms.</Paragraph> <Paragraph position="5"> Although this WSJ corpus is large relative to many other available corpora, it should be cautioned that insofar as most research experiments continue to show marked improvement with the increased availability of training data, it is likely that this corpus also will fail to allow us to find or achieve asymptotic performance. Most systems continue to be undertrained or constrained to work in suboptimal lower dimensional spaces, due to their data-starvation. Indeed, this result is not really surprising in light of the much larger amounts of speech data to which young children must be exposed before gaining recognition proficiency of even modest size vocabularies. null The structure, features, and dimensions of this corpus constitute the outcome of a heavily debated consensus process, which satisfies the basic (though certainly not all) different requirements of the different research loci of all parties involved. There are significant portions of this corpus which will be more heavily used by one or more research groups, and not at all by others. Nonetheless, the common basis and careful structuring of these materials should allow for highly informative intra- and inter-group comparisons. The members of this committee are to be commended and should take pride in their success in jointly exercising a rare &quot;statesmanlike&quot; cooperation to support the legitimate diversity of expert research interests in this field (often overcoming strong pressures of both personal and political convictions to support only their own narrower research interests).</Paragraph> </Section> class="xml-element"></Paper>