File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0110_intro.xml

Size: 11,501 bytes

Last Modified: 2025-10-06 14:02:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0110">
  <Title>Segment Predictability as a Cue in Word Segmentation: Application to Modern Greek</Title>
  <Section position="2" start_page="0" end_page="3" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A substantial portion of research in child language acquisition focuses on the word segmentation problem--how children learn to extract words (or word candidates) from a continuous speech signal prior to having acquired a substantial vocabulary. While a number of robust strategies have been proposed and tested for infants learning English and a few other languages (discussed in Section 1.1), it is not clear whether or how these apply to all or most languages. In addition, experiments on infants often leave undetermined many details of how particular cues are actually used. Computational simulations of word segmentation have also focused mainly on data from English corpora, and should also be extended to cover a broader range of the corpora available.</Paragraph>
    <Paragraph position="1"> The line of research proposed here is twofold: on the one hand we wish to understand the nature of the cues present in Modern Greek, on the other we wish to establish a framework for orderly comparison of word segmentation algorithms across the desired broad range of languages.</Paragraph>
    <Paragraph position="2"> Finite-state techniques, used by e.g., Belz (1998) in modeling phonotactic constraints and syllable within various languages, provide one straightforward way to formulate some of these comparisons, and may be useful in future testing of multiple cues.</Paragraph>
    <Paragraph position="3"> Previous research (Rytting, 2004) examined the role of utterance-boundary information in Modern Greek, implementing a variant of Aslin and colleagues' (1996) model within a finite-state framework. The present paper examines more closely the proposed cue of segment predictability. These two studies lay the groundwork for examining the relative worth of various cues, separately and as an ensemble.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Infant Studies
</SectionTitle>
      <Paragraph position="0"> Studies of English-learning infants find the earliest evidence for word segmentation and acquisition between 6 and 7.5 months (Jusczyk and Aslin, 1995) although many of the relevant cues and strategies seem not to be learned until much later.</Paragraph>
      <Paragraph position="1"> Several types of information in the speech signal have been identified as likely cues for infants, including lexical stress, co-articulation, and phonotactic constraints (see e.g., Johnson &amp; Jusczyk, 2001 for a review). In addition, certain heuristics using statistical patterns over (strings of) segments have also been shown to be helpful in the absence of other cues.</Paragraph>
      <Paragraph position="2"> One of these (mentioned above) is extrapolation from the segmental context near utterance boundaries to predict word boundaries (Aslin et al., 1996). Another proposed heuristic utilizes the relative predictability of the following segment or syllable. For example, Saffran et al. (1996) have confirmed the usefulness of distributional cues for 8-month-olds on artificially designed micro- null Proceedings of the Workshop of the languages--albeit with English-learning infants only.</Paragraph>
      <Paragraph position="3"> The exact details of how infants use these cues are unknown, since the patterns in their stimuli fit several distinct models (see Section 1.2). Only further research will tell how and to what degree these strategies are actually useful in the context of natural language-learning settings--particularly for a broad range of languages. However, what is not in doubt is that infants are sensitive to the cues in question, and that this sensitivity begins well before the infant has acquired a large vocabulary.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
1.2 Implementations and Ambiguities
</SectionTitle>
      <Paragraph position="0"> While the infant studies discussed above focus primarily on the properties of particular cues, computational studies of word-segmentation must also choose between various implementations, which further complicates comparisons. Several models (e.g., Batchelder, 2002; Brent's (1999a) MBDP-1 model; Davis, 2000; de Marcken, 1996; Olivier, 1968) simultaneously address the question of vocabulary acquisition, using previously learned word-candidates to bootstrap later segmentations.</Paragraph>
      <Paragraph position="1"> (It is beyond the scope of this paper to discuss these in detail; see Brent 1999a,b for a review.) Other models do not accumulate a stored vocabulary, but instead rely on the degree of predictability of the next syllable (e.g., Saffran et al., 1996) or segment (e.g., Christiansen et al., 1998). The intuition here, first articulated by Harris (1954), is that word boundaries are marked by a spike in unpredictability of the following phoneme. The results from Saffran et al. (1996) show that English-learning infants do respond to areas of unpredictability; however, it is not clear from the experiment how this unpredictability is best measured. Two specific ambiguities in measuring (un)predictability are examined here.</Paragraph>
      <Paragraph position="2"> Brent (1999a) points out one type of ambiguity, namely that Saffran and colleagues' (1996) results can be modeled as favoring word-breaks at points of either low transitional probability or low mutual information. Brent reports results for models relying on each of these measures. It should be noted that these models are not the main focus of his paper, but provided for illustrative purposes; nevertheless, these models provide the best comparison to Saffran and colleagues' experiment, and may be regarded as an implementation of the same.</Paragraph>
      <Paragraph position="3"> Brent (1999a) compares these two models in terms of word tokens correctly segmented (see Section 3 for exact criteria), reporting approximately 40% precision and 45% recall for transitional probability (TP) and 50% precision and 53% recall for mutual information (MI) on the first 1000 utterances of his corpus (with improvements given larger corpora). Indeed, their performance on word tokens is surpassed only by Brent's main model (MBDP-1), which seems to have about 73% precision and 67% recall for the same range.</Paragraph>
      <Paragraph position="4">  Another question which Saffran et al. (1996) leave unanswered is whether the segmentation depends on local or global comparisons of predictability. Saffran et al. assume implicitly, and Brent (1999a) explicitly, that the proper comparison is local--in Brent, dependent solely on the adjacent pairs of segments. However, predictability measures for segmental bigrams (whether TP or MI) may be compared in any number of ways. One straightforward alternative to the local comparison is to compare the predictability measures compare to some global threshold. Indeed, Aslin et al. (1996) and Christiansen et al. (1998) simply assumed the mean activation level as a global activation threshold within their neural network framework.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
1.3 Global and Local Comparisons
</SectionTitle>
      <Paragraph position="0"> The global comparison, taken on its own, seems a rather simplistic and inflexible heuristic: for any pair of phonemes xy, either a word boundary is always hypothesized between x and y, or it never is. Clearly, there are many cases where x and y sometimes straddle a word boundary and sometimes do not. The heuristic also takes no account of lengths of possible words. However, the local comparison may take length into account too much, disallowing words of certain lengths. In order to see that, we must examine Brent's (1999a) suggested implementation of Saffran et al. (1996) more closely.</Paragraph>
      <Paragraph position="1"> In the local comparison, given some string ...wxyz..., in order for a word boundary to be inserted between x and y, the predictability measure for xy must be lower than both that of wx and of yz. It follows that neither wx nor yz can have word boundaries between them, since they cannot simultaneously have a lower predictability measure than xy. This means that, within an utterance, word boundaries must have at least two segments between them, so this heuristic will not correctly segment utterance-internal one-phoneme  The specific percentages are not reported in the text, but have been read off his graph. Brent does not report precision or recall for utterance boundaries; those percentages would undoubtedly be higher.</Paragraph>
      <Paragraph position="2">  These methodologies did not ignore local information, but encoded it within the feature vector. However, Rytting (2004) showed that this extra context, while certainly helpful, is not strictly necessary in the Greek corpus under question. A context of just one phoneme yielded better-than-chance results.</Paragraph>
      <Paragraph position="3"> words.</Paragraph>
      <Paragraph position="4">  Granted, only a few one-phoneme word types exist in either English or Greek (or other languages). However, these words are often function words and so are less likely to appear at edges of utterances (e.g., ends of utterances for articles and prepositions; beginnings for postposed elements). Neither Brent's (1999a) implementation of Saffran's et al. (1996) heuristic nor Aslin's et al. (1996) utterance-boundary heuristic can explain how these might be learned. Brent (1999a) himself points out another lengthrelated limitation--namely, the relative difficulty that the 'local comparison' heuristic has in segmenting learning longer words. The bigram MI frequencies may be most strongly influenced by-and thus as an aggregate largely encode--the most frequent, shorter words. Longer words cannot be memorized in this representation (although common ends of words such as prefixes and suffixes might be).</Paragraph>
      <Paragraph position="5"> In order to test for this, Brent proposes that precision for word types (which he calls &amp;quot;lexicon precision&amp;quot;) be measured as well as for word tokens. While the word-token metric emphasizes the correct segmentation of frequent words, the word-type metric does not share this bias. Brent defines this metric as follows: &amp;quot;After each block [of 500 utterances], each word type that the algorithm produced was labeled a true positive if that word type had occurred anywhere in the portion of the corpus processed so far; otherwise it is labeled a false positive.&amp;quot; Measured this way, MI yields a word type precision of only about 27%; transitional probability yields a precision of approximately 24% for the first 1000 utterances, compared to 42% for MBDP-1. He does not measure word type recall.</Paragraph>
      <Paragraph position="6"> This same limitation in finding longer, less frequent types may apply to comparisons against a global threshold as well. This is also in need of testing. It seems that both global and local comparisons, used on their own as sole or decisive heuristics, may have serious limitations. It is not clear a priori which limitation is most serious; hence both comparisons are tested here.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML