File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1081_intro.xml
Size: 2,280 bytes
Last Modified: 2025-10-06 14:01:25
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1081"> <Title>Data-driven Classification of Linguistic Styles in Spoken Dialogues</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Corpora </SectionTitle> <Paragraph position="0"> The task of a spoken dialogue system is to engage in spoken human-computer interaction. It is well known that spoken human-computer interaction differs from its human-human counterpart in various dimensions (Doran et al., 2001) including linguistic complexity. For the purpose of this investigation three sources were exploited: a corpus of task-dependent human-human interactions (negotiation dialogues), a corpus of free human-human conversations, and a corpus of human-computer interactions. For all corpora the part-of-speech information for each word was automatically annotated by the IMS tree tagger (Schmid, 1994) using the STTS tagset (Schiller et al., 1995).</Paragraph> <Paragraph position="1"> Verbmobil The Verbmobil (VM) corpus (Wahlster, 1993) is one of the largest spoken dialogue corpora available for German. It contains spontaneous speech human-human dialogues in the appointment negotiation and travel planning domain. The corpus used for this investigation has data from 837 speakers (24569 turns with 448737 words, av. 29.35 turns per speaker).</Paragraph> <Paragraph position="2"> CallHome The CallHome (CH) corpus (Linguistic Data Consortium, 1997) contains 80 dialogues of 10 minutes unconstrained conversation between two humans over the telephone. The corpus has utterances from 160 speakers (17744 turns with 145552 words, av. 110.9 turns per speaker).</Paragraph> <Paragraph position="3"> TABA The TABA corpus contains human-computer dialogues in the domain of train timetable information (Aust et al., 1995). The transcription was done automatically by the speech recognizer of the dialogue system. As the recognizer can only recognize words present in the pertinent recognition lexicon and may be subject to errors it is likely that the corpus sometimes does not contain the actual words uttered by the speaker contrary to the other corpora. The corpus consists of 5200 dialogues (33568 turns with 90377 words, av. 6.45 turns per speaker).</Paragraph> </Section> class="xml-element"></Paper>