File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/h92-1075_concl.xml

Size: 1,350 bytes

Last Modified: 2025-10-06 13:56:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1075">
  <Title>Collection and Analyses of WSJ-CSR Data at MIT 1</Title>
  <Section position="7" start_page="372" end_page="372" type="concl">
    <SectionTitle>
SUMMARY
</SectionTitle>
    <Paragraph position="0"> This paper describes our involvement in the collection of the WSJ-CSR pilot corpus. By paying close attention to developing a computer interface that is easy to use, we were able to collect over 33 hours of speech from 64 subjects over a relatively short period. By using in-house equipment to produce CD-ROM-compatible WORM disks, we were able to distribute the data to interested researchers rapidly. Our analyzes of the collected data show that the WSJ-CSR corpus differs significantly from other corpora in the research community. We expect that it will have long-lasting impacts on speech recognition research within the DARPA community and around the world.</Paragraph>
    <Paragraph position="1"> The preliminary text preprocessing experiment that we conducted suggests that the current preprocessing scheme may not be adequate in capturing the ways people would naturally speak the sentences. Clearly, more extensive experiments must be performed. Whether one should preprocess the text at all is a decision that the community must decide collectively.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML