File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/95/p95-1047_intro.xml
Size: 1,809 bytes
Last Modified: 2025-10-06 14:05:54
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1047"> <Title>Acquiring a Lexicon from Unsegmented Speech</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We are interested in how a lexicon of discrete words can be acquired from continuous speech, a problem fundamental both to child language acquisition and to the automated induction of computer speech recognition systems; see (Olivier, 1968; Wolff, 1982; Cartwright and Brent, 1994) for previous computational work in this area. For the time being, we approximate the problem as induction from phone sequences rather than acoustic pressure, and assume that learning takes place in an environment where simple semantic representations of the speech intent are available to the acquisition mechanism.</Paragraph> <Paragraph position="1"> For example, we approximate the greater problem as that of learning from inputs like Phon. Input: /~raebltslne~ b~W t/ Sem. Input: { BOAT A IN RABBIT THE BE } (The rabbit's in a boat.) where the semantic input is an unordered set of identifiers corresponding to word paradigms. Obviously the artificial pseudo-semantic representations make the problem much easier: we experiment with them as a first step, somewhere between learning language &quot;from a radio&quot; and providing an unambiguous textual transcription, as might be used for training a speech recognition system.</Paragraph> <Paragraph position="2"> Our goal is to create a program that, after training on many such pairs, can segment a new phonetic utterance into a sequence of morpheme identifiers.</Paragraph> <Paragraph position="3"> Such output could be used as input to many grammar acquisition programs.</Paragraph> </Section> class="xml-element"></Paper>