File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0110_metho.xml
Size: 12,495 bytes
Last Modified: 2025-10-06 14:09:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0110"> <Title>Segment Predictability as a Cue in Word Segmentation: Application to Modern Greek</Title> <Section position="3" start_page="3" end_page="9" type="metho"> <SectionTitle> 2 Constructing a Finite-State Model 2.1 Outline of current research </SectionTitle> <Paragraph position="0"> While in its general approach the study reported here replicates the mutual-information and transitional-probability models in Brent (1999a), it At the edges of utterances, this restriction will not apply, since word boundaries are automatically inserted at utterance boundaries, while still allowing the possibility of a boundary insertion at the next position. differs slightly in the details of their use. First, whereas Brent dynamically updated his measures over a single corpus, and thus blurred the line between training and testing data, our model precompiles statistics for each distinct bigram-type offline, over a separate training corpus.</Paragraph> <Paragraph position="1"> Secondly, we compare the use of a global threshold (described in more detail in Section 2.3, below) to Brent's (1999a) use of the local context (as described in Section 1.3 above).</Paragraph> <Paragraph position="2"> Like (Brent, 1999a), but unlike Saffran et al. (1996), our model focuses on pairs of segments, not on pairs of syllables. While Modern Greek syllabic structure is not as complicated as English's, it is still more complicated than the CV structure assumed in Saffran et al. (1996); hence, access to syllabification cannot be assumed.</Paragraph> <Section position="1" start_page="5" end_page="5" type="sub_section"> <SectionTitle> 2.2 Corpus Data </SectionTitle> <Paragraph position="0"> In addition to the technical differences discussed above, this replication breaks new ground in terms of the language from which the training and test corpora are drawn. Modern Greek differs from English in having only five vowels, generally simpler syllable structures, and a substantial amount of inflectional morphology, particularly at the ends of words. It also contains not only preposed function words (e.g., determiners) but postposed ones as well, such as the possessive pronoun, which cannot appear utterance-initially.</Paragraph> <Paragraph position="1"> For an in-depth discussion of Modern Greek, see (Holton et al., 1997). While it is not anticipated that Modern Greek will be substantially more challenging to segment than English, the choice does serve as an additional check on current assumptions.</Paragraph> <Paragraph position="2"> The Stephany corpus (Stephany, 1995) is a database of conversations between children and caretakers, broadly transcribed, currently with no notations for lexical stress, included as part of the CHILDES database (MacWhinney, 2000). In order to preserve adequate unseen data for future simulations and experiments, and also to use data most closely approximating children of a very While this difference is not intended as a strong theoretical claim, it can be seen as reflecting the fact that even before infants seem to begin the word segmentation process, they have already been exposed to a substantial amount of linguistic material. However, it is not anticipated to affect the general pattern of results.</Paragraph> <Paragraph position="3"> Furthermore, if Brent's 'local comparison' implementation were based on syllables to more closely coincide with Saffran's et al. (1996) experiment (not something Brent ever suggests), it would fail to detect any one-syllable words, clearly problematic for both Greek and English, and many languages besides.</Paragraph> <Paragraph position="4"> young age, files from the youngest child only were used in this study. However, since the heuristics and cues used are very simple compared to vocabulary-learning models such as Brent's MDLP-1, it is anticipated that they will require relatively little context, and so the small size of the training and testing corpora will not adversely effect the results to a great degree.</Paragraph> <Paragraph position="5"> As in other studies, only adult input was used for training and testing. In addition, non-segmental information such as punctuation, dysfluencies, parenthetical references to real-world objects, etc. were removed. Spaces were taken to represent word boundaries without comment or correction; however, it is worth noting that the transcribers sometimes departed from standard orthographic practice when transcribing certain types of wordclitic combinations. The text also contains a significant number of unrealized vowels, such as [ap] for /apo/ 'from', or [in] or even [n] for /ine/ 'is'. Such variation was not regularized, but treated as part of the learning task.</Paragraph> <Paragraph position="6"> The training corpus contains 367 utterance tokens with a total of 1066 word tokens (319 types). Whereas the average number of words per utterance (2.9) is almost identical to that in the Korman (1984) corpus used by Christiansen et al.</Paragraph> <Paragraph position="7"> (1998), utterances and words were slightly longer in terms of phonemes (12.8 and 4.4 phonemes respectively, compared to 9.0 and 3.0 in Korman).</Paragraph> <Paragraph position="8"> The test corpus consists of 373 utterance tokens with a total of 980 words (306 types). All utterances were uttered by adults to the same child as in the training corpus. As with the training corpus, dysfluencies, missing words, or other irregularities were removed; the word boundaries were kept as given by the annotators, even when this disagreed with standard orthographic word breaks.</Paragraph> </Section> <Section position="2" start_page="5" end_page="8" type="sub_section"> <SectionTitle> 2.3 Model Design </SectionTitle> <Paragraph position="0"> Used as a solitary cue (as it is in the tests run here), comparison against a global threshold may be implemented within the same framework as Brent's (1999) TP and MI heuristics. However, it may be implemented within a finite-state framework as well, with equivalent behavior. This section will describe how the 'global comparison' heuristic is modeled within a finite-state framework.</Paragraph> <Paragraph position="1"> While such an implementation is not technically necessary here, one advantage of the finite-state framework is the compositionality of finite state machines, which allows for later composition of this approach with other heuristics depending on other cues, analogous to Christiansen et al. (1998). Since the finite-state framework selects the best path over the whole utterance, it also allows for optimization over a sequence of decisions, rather than optimizing each local decision separately.</Paragraph> <Paragraph position="2"> Unlike Belz (1998), where the actual FSM structure (including classes of phonemes that could be group onto one arc) was learned, here the structure of each FSM is determined in advance.</Paragraph> <Paragraph position="3"> Only the weight on each arc is derived from data. No attempt is made to combine phonemes to produce more minimal FSMs; each phoneme (and phoneme-pair) is modeled separately.</Paragraph> <Paragraph position="4"> Like Brent (1999a) and indeed most models in the literature, this model assumes (for sake of convenience and simplicity) that the child hears each segment produced within an utterance without error. This assumption translates into the finite-state domain as a simple acceptor (or equivalently, an identity transducer) over the segment sequence for a given utterance.</Paragraph> <Paragraph position="5"> Word boundaries are inserted by means of a transducer that computes the cost of word boundary insertion from the predictability scores. In the MI model, the cost of inserting a word boundary is proportional to the mutual information. For ease in modeling, this was represented with a finite state transducer with two paths between every pair of phonemes (x,y), with zero-counts modeled with a maximum weight of 99. The direct path, representing a path with no word boundary inserted, costs [?]MI(x,y), which is positive for bigrams of low predictability (negative MI), where word boundaries are more likely. The other path, representing a word boundary insertion, carries the cost of the global threshold, in this case arbitrarily set to zero (although it could be optimized with held-out data). A small subset of the resulting FST, representing the connections over the alphabet {ab} is illustrated in Figure 1, below: See Rabiner (1989) for a discussion of choosing optimization criteria. It is worth noting that this distinction does not come into play in the one-cue model reported here, as all decisions are modeled as independent of one another. However, it is expected to take on some importance in models combining multiple cues, such as those proposed in Section 4 of this paper. While modeling the mishearing of segments would be more realistic and highly interesting, it is beyond the scope of this study. However, a weighted transducer representing a segmental confusion matrix could in principle replace the current identity transducer, without disrupting the general framework of the model.</Paragraph> <Paragraph position="6"> The best (least-cost) path over this subset model inserts boundaries between two adjacent a's and two adjacent b's, but not between ab or ba; thus the (non-Greek) string ...ababaabbaaa... would be segmented ...ababa#ab#ba#a#a... by the FSM.</Paragraph> <Paragraph position="7"> The FSM for transitional probability has the same structure as that of MI, but with different weights on each path. For each pair of phonemes xy, the cost for the direct path from x to y is [?]log(P(y|x)). The global threshold cost of inserting a word boundary was set (again, arbitrarily) as the negative log of the mean of all TP values. In the two-phoneme subset (shown in Figure 2), the only change is that the direct pathway from a to b is now more expensive than the threshold path, so the best path over the FSM will insert word boundaries between a and b as well. Hence our example string ...ababaabbaaa... would be segmented ...a#ba#ba#a#b#ba#a#a... by the FSM. (The stranded 'word' #b# would of course be an error, but this problem does not arise in actual Greek input, since two adjacent b's, like all geminate consonants, are ruled out by Greek phonotactics.) During testing each FST model was composed (separately) with the segment identity transducer for the utterance under consideration. A short sample section of such a composition, with the best path in bold, is shown in Figure 3.</Paragraph> <Paragraph position="8"> model and an utterance acceptor The output projection of the best path from the resulting FST was converted back into text and compared to the text of the original utterance. These compositions, best-path projections, and conversions were performed using the AT&T finite state toolkit (Mohri et al., 1998).</Paragraph> </Section> <Section position="3" start_page="8" end_page="9" type="sub_section"> <SectionTitle> 2.4 A Concrete Example </SectionTitle> <Paragraph position="0"> Take, for example, an utterance from the test corpus /tora#Telis#na#aniksume#afto/ 'now you want us to open this.' The mutual information and transitional probability figures for this utterance are given in Table 1.</Paragraph> <Paragraph position="1"> above threshold are bold; local maxima italicized. In this example, the correct boundaries fall between the pairs (a,T), (s,n), (a,a), and (e,a). Both the mutual information and the transitional probability for the first three of these pairs are above the global mean, so word boundaries are posited under both global models.</Paragraph> </Section> <Section position="4" start_page="9" end_page="9" type="sub_section"> <SectionTitle> (Since each of </SectionTitle> <Paragraph position="0"> these is also a local maximum, the local models also posit boundaries between these three pairs.) The pair (e,a) is above threshold for MI but not for Since all values are given in terms of negative MI and negative log probability, high values for both measures indicate relatively improbable pairings. TP, so the global TP model fails to posit a boundary here. Finally, the two local models posit a number of spurious boundaries at the other local maxima, shown by the italic numbers in the table. The resulting predictions for each model are:</Paragraph> </Section> </Section> class="xml-element"></Paper>