File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1047_metho.xml
Size: 7,831 bytes
Last Modified: 2025-10-06 14:14:08
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1047"> <Title>Acquiring a Lexicon from Unsegmented Speech</Title> <Section position="3" start_page="0" end_page="311" type="metho"> <SectionTitle> 2 A Simple Prototype </SectionTitle> <Paragraph position="0"> We have implemented a simple algorithm as an exploratory effort. It maintains a single dictionary, a set of words. Each word consists of a phone sequence and a set of sememes (semantic symbols). Initially, the dictionary is empty. When presented with an utterance, the algorithm goes through the following sequence of actions: * It attempts to cover (&quot;parse&quot;) the utterance phones and semantic symbols with a sequence of words from the dictionary, each word offset a certain distance into the phone sequence, with words potentially overlapping.</Paragraph> <Paragraph position="1"> * It then creates new words that account for uncovered portions of the utterance, and adjusts words from the parse to better fit the utterance.</Paragraph> <Paragraph position="2"> * Finally, it reparses the utterance with the old dictionary and the new words, and adds the new words to the dictionary if the resulting parse covers the utterance well.</Paragraph> <Paragraph position="3"> Occasionally, the program removes rarely-used words from the dictionary, and removes words which can themselves be parsed. The general operation of the program should be made clearer by the following two examples. In the first, the program starts with an empty dictionary, early in the acquisition process, and receives the simple utterance/nina/{ NINA } (a child's name). Naturally, it is unable to parse the input.</Paragraph> <Paragraph position="4"> Having successfully parsed the input, it adds the new word to the dictionary. Later in the acquisition process, it encounters the sentence you kicked off ~he sock, when the dictionary contains (among other words) /yu/ { YOU }, /~a/ { THE }, and /rsuk/ { On this basis, it adds/kIkt~f/{ KICK OFF } and /sak/ { SOCK } to the dictionary. /rsuk/ { SOCK }, not used in this analysis, is eventually discarded from the dictionary for lack of use. /klkt~f/{ KICK OFF } is later found to be parsable into two subwords, and also discarded.</Paragraph> <Paragraph position="5"> One can view this procedure as a variant of the expectation-maximization (Dempster et al., 1977) procedure, with the parse of each utterance as the hidden variables. There is currently no preference for which words are used in a parse, save to minimize mismatches and unparsed portions of the input, but obviously a word grammar could be learned in conjunction with this acquisition process, and used as a disambiguation step.</Paragraph> </Section> <Section position="4" start_page="311" end_page="312" type="metho"> <SectionTitle> 3 Tests and Results </SectionTitle> <Paragraph position="0"> To test the algorithm, we used 34438 utterances from the Childes database of mothers' speech to children (MacWhinney and Snow, 1985; Suppes, 1973).</Paragraph> <Paragraph position="1"> These text utterances were run through a publicly available text-to-phone engine. A semantic dictionary was created by hand, in which each root word from the utterances was mapped to a corresponding sememe. Various forms of a root (&quot;see&quot;, &quot;saw&quot;, &quot;seeing&quot;) all map to the same sememe, e.g., SEE .</Paragraph> <Paragraph position="2"> Semantic representations for a given utterance are merely unordered sets of sememes generated by taking the union of the sememe for each word in the utterance. Figure 1 contains the first 6 utterances from the database.</Paragraph> <Paragraph position="3"> We describe the results of a single run of the algorithm, trained on one exposure to each of the 34438 utterances, containing a total of 2158 different stems. The final dictionary contains 1182 words, where some entries are different forms of a common stem. 82 of the words in the dictionary have never been used in a good parse. We eliminate these words, leaving 1100. Figure 2 presents some entries in the final dictionary, and figure 3 presents all 21 (2%) of the dictionary entries that might be reasonably considered mistakes.</Paragraph> <Paragraph position="4"> 10 words used most frequently in good parses. The right 10 were selected randomly from the 1100 entries. null</Paragraph> <Paragraph position="6"> Some of them, like /J'iz/ are conglomerations that should have been divided. Others, like/t/, /wo/, and /don/ demonstrate how the system compensates for the morphological irregularity of English contractions. The /I~/problem is discussed in the text; misanalysis of the role of/I~/ also manifests itself on something.</Paragraph> <Paragraph position="7"> The most obvious error visible in figure 3 is the suffix -ing (/I~/), which should be have an empty sememe set. Indeed, such a word is properly hypothesized but a special mechanism prevents semantically empty words from being added to the dictionary.</Paragraph> <Paragraph position="8"> Without this mechanism, the system would chance upon a new word like ring,/rig/, use the/I~/{} to account for most of the sound, and build a new word /r/{ RINa } to cover the rest; witness something in figure 3. Most other semantically-empty affixes (plural/s/for instance) are also properly hypothesized and disallowed, but the dictionary learns multiple entries to account for them (/eg/ &quot;egg&quot; and /egz/ &quot;eggs&quot;). The system learns synonyms (&quot;is&quot;, &quot;was&quot;, &quot;am&quot;, ...) and homonyms (&quot;read&quot;, &quot;red&quot; ; &quot;know&quot;, &quot;no&quot;) without difficulty.</Paragraph> <Paragraph position="9"> Removing the restriction on empty semantics, and also setting the semantics of the function words a, an, the, that and of to {}, the most common empty words learned are given in figure 4. The ring problem surfaces: among other words learned are now /k/{ CAR } and/br/{ BRI/IG }. To fix such problems, it is obvious more constraint on morpheme order must be incorporated into the parsing process, perhaps in the form of a statistical grammar acquired simultaneously with the dictionary.</Paragraph> <Paragraph position="10"> words in the final dictionary.</Paragraph> </Section> <Section position="5" start_page="312" end_page="312" type="metho"> <SectionTitle> 4 Current Directions </SectionTitle> <Paragraph position="0"> The algorithm described above is extremely simple, as was the input fed to it. In particular, * The input was phonetically oversimplified, each word pronounced the same way each time it occurred, regardless of environment. There was no phonological noise and no cross-word effects.</Paragraph> <Paragraph position="1"> * The semantic representations were not only noise free and unambiguous, but corresponded directly to the words in the utterance.</Paragraph> <Paragraph position="2"> To better investigate more realistic formulations of the acquisition problem, we are extending our coverage to actual phonetic transcriptions of speech, by allowing for various phonological processes and noise, and by building in probabilistic models of morphology and syntax. We are further reducing the information present in the semantic input by removing all function word symbols and merging various content symbols to encompass several word paradigms. We hope to transition to phonemic input produced by a phoneme-based speech recognizer in the near future.</Paragraph> <Paragraph position="3"> Finally, we are instituting an objective test measure: rather than examining the dictionary directly, we will compare segmentation and morphemelabeling to textual transcripts of the input speech.</Paragraph> </Section> class="xml-element"></Paper>