File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/j96-2001_abstr.xml
Size: 15,000 bytes
Last Modified: 2025-10-06 13:48:39
<?xml version="1.0" standalone="yes"?> <Paper uid="J96-2001"> <Title>Psycholinguistics</Title> <Section position="2" start_page="0" end_page="158" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> As a number of writers on morphology have noted (most recently and notably Beard \[1995\]), it is common to find that a particular affix or other morphological marker serves more than one function in a language. For example, in many morphologically complex languages it is often the case that several slots in a paradigm are filled with the same form; put in another way, it is common to find that a particular morphological form is in fact ambiguous between several distinct functions. This phenomenon-which in the domain of inflectional morphology is termed syncretism---can be illustrated by a Dutch example such as lopen 'walk', which can either be the infinitive form ('to walk') or the finite plural (present tense) form ('we, you, or they walk'). In some cases, syncretism is completely systematic: for example the case cited in Dutch, where the -en suffix can always function in the two ways cited; or in Latin, where the plural dative and ablative forms of nouns and adjectives are always identical, no matter what paradigm the noun belongs to. In other cases, a particular instance of syncretism may be displayed only in some paradigms: for example, Russian feminine norms, such as loshad' 'horse' (Cyrillic aoma~b), have the same form for both the genitive singular * Wundtlaan 1, 6525 XD, Nijmegen, The Netherlands. E-mail: baayen@mpi.nl t 600 Mountain Avenue, Murray Hill, NJ 07974, USA. E-mail: rws@research.att.com (D 1996 Association for Computational Linguistics Computational Linguistics Volume 22, Number 2 -- loshadi (Cyrillic aoma~i4) -- and the nominative plural, whereas masculine nouns typically distinguish these forms. In still other cases, the syncretism may be partial in that two forms may be identical at one level of representation -- say, orthography -but not another -- say, pronunciation. For example the written form goroda in Russian (Cyrillic ropo~a) may either be the nominative plural or the genitive singular of 'city'. In the genitive singular, the stress is on the first syllable (/g%rodA/), whereas in the nominative plural the stress resides on the final syllable (/gorAd~a/); note that the difference in stress results in very different vowel qualities for the two forms, as indicated in the phonetic transcriptions.</Paragraph> <Paragraph position="1"> Syncretism and related morphological ambiguities present a problem for many NL applications where lexical disambiguation is important; cases where the orthographic form is identical but the pronunciations of the various functions differ are particularly important for speech applications, such as text-to-speech, since appropriate word pronunciations must be computed from orthographic forms that underspecify the necessary information. Ideally one would like to build models that use contextual information to perform lexical disambiguation (Yarowsky 1992, 1994), but such models must be trained on specialized tagged corpora (either hand-generated or semi-automatically generated) and such training corpora are often not available, at least in the early phases of constructing a particular application. Lacking good contextual models, one is forced to fall back on estimates of the lexical prior probabilities for the various functions of a form. Following standard terminology, a lexical prior can be defined as follows: Imagine that a given form is n-ways ambiguous; the lexical prior probability of sense i of this form is simply the probability of sense i independent of the context in which the particular instantiation of the form occurs. Assuming one has a tagged corpus, one can usually get reasonable estimates of the lexical priors for the frequent forms (such as Dutch lopen 'walk') by simply counting the number of times the form occurs in each of its various functions and dividing by the total number of instances of the form (in any function). This yields the Maximum Likelihood Estimate (MLE) for the lexical prior probability. But for infrequent or unseen forms, it is less clear how to compute the estimate. Consider another Dutch example like aanlokken 'entice, appeal'. This form occurs only once, as an infinitive, in the Uit den Boogaart (henceforth UdB) corpus (Uit den Boogaart 1975); in other words it is a hapax legomenon (< Greek hapax 'once', legomenon 'said') in this corpus. Obviously the lexical prior probability of this form expressing the finite plural is not zero, the MLE is a poor estimate in such cases. When one considers forms that do not occur in the training corpus (e.g., bedraden 'to wire') the situation is even worse. The problem, then, is to provide a more reasonable estimate of the relative probabilities of the various potential functions of such forms. 1 2. Estimating the Lexical Priors for Rare Forms For a common form such as lopen 'walk' a reasonable estimate of the lexical prior probabilities is the MLE, computed over all occurrences of this form. So, in the UdB corpus, lopen occurs 92 times as an infinitive and 43 times as a finite plural, so the MLE 1 Even models of disambiguation that make use of context, such as statistical n-gram taggers, often presume some estimate of lexical priors, in addition to requiring estimates of the transition probabilities of sequences of lexical tags (Church 1988; DeRose 1988; Kupiec 1992), and this again brings up the question of what to do about unseen or low-frequency forms. In working taggers, a common approach is simply to apply a uniform small probability to the various senses of unseen or low-frequency forms: this was done in the tagger discussed in Church (1988), for example.</Paragraph> <Paragraph position="2"> a function of the (natural) log of the frequency of the word forms. The horizontal solid line represents the overall MLE, the relative frequency of the infinitive as computed over all tokens; the horizontal dashed line represents the relative frequency of the infinitive among the hapax legomena. The solid curve represents a locally weighted regression smoothing (Cleveland 1979).</Paragraph> <Paragraph position="3"> estimate of the probability of the infinitive is 0.68. For low-frequency forms such as aanlokken or bedraden, one might consider basing the MLE on the aggregate counts of all ambiguous forms in the corpus. In the UdB corpus, there are 21,703 infinitive tokens, and 9,922 finite plural tokens, so the MLE for aanlokken being an infinitive would be 0.69. Note, however, that the application of this overall MLE presupposes that the relative frequencies of the various functions of a particular form are independent of the frequency of the form itself. For the Dutch example at hand, this presupposition predicts that if we were to classify -en forms according to their frequency, and then for each frequency class thus defined, plot the relative frequency of infinitives and finite plurals, the regression line should have a slope of approximately zero.</Paragraph> <Section position="1" start_page="156" end_page="157" type="sub_section"> <SectionTitle> 2.1 Dutch Verb Forms in -en </SectionTitle> <Paragraph position="0"> Figure 1 shows that this prediction is not borne out. This scatterplot shows the relative frequency of the infinitive versus the finite plural, as a function of the log-frequency of the -en form. At the left-hand edge of the graph, the relative frequency of the infinitives for the hapax legomena is shown. This proportion is also highlighted by the dashed horizontal line. As we proceed to the right, we observe that there is a general downward curvature representing a lowering of the proportion of infinitives for the Computational Linguistics Volume 22, Number 2 higher-frequency words. This trend is captured by the solid nonparametric regression line; an explanation for this trend will be forthcoming in Section 3. (It will be noted that in Figure 1 the variance is fairly small for the lower-frequency ranges, higher for the middle ranges, and then small again for the high-frequency ranges; anticipating somewhat, we note the same trends in Figures 2 and 3. This variance pattern follows from the high variability in the absolute numbers of types realized, especially in the middle log-frequency classes, in combination with the assumption that for any log-frequency class, the proportion for that class is itself a random variable.) The solid horizontal line represents the proportion of infinitives calculated over all frequency classes, and the dashed horizontal line represents the proportion of infinitives calculated over just the hapax legomena. The two horizontal lines can be interpreted as MLEs for the probability of an -en form being an infinitive: the solid line or overall MLE clearly provides an estimate based on the whole population, whereas the dashed line or hapax-based MLE provides an estimate for the hapaxes. The overall MLE computes a lower relative frequency for the infinitives, compared to the hapax-based MLE. The question, then, is: Which of these MLEs provides a better estimate for low-frequency types? In particular, for types that have not been seen in the training corpus, and for which we therefore have no direct estimate of the word-specific prior probabilities, we would like to know whether the hapax-based or overall MLE provides a better estimate.</Paragraph> <Paragraph position="1"> To answer this question we compared the accuracy of the overall and hapax-based MLEs using tenfold cross-validation. We first randomized the list of -en tokens from the UdB corpus, then divided the randomized list into ten equal-sized parts. Each of the ten parts was held out as the test set, and the remaining nine-tenths was used as the training set over which the two MLE estimates were computed. The results are shown in Table 1. In this table, No(inf) and No(pl) represent the observed number of tokens of infinitives and plurals in the held-out portion of the data, representing types that had not been seen in the training data. The final four rows compare the estimates for these numbers of tokens given the overall MLE (EoINo(infl\] and Eo\[No(pl)\]), versus the hapax-based MLE (Eh\[No(inf)\] and Eh\[No(pl)\]). For all ten runs, the hapax-based MLE is clearly a far better predictor than the overall MLE. 2</Paragraph> </Section> <Section position="2" start_page="157" end_page="158" type="sub_section"> <SectionTitle> 2.2 English Verb Forms in -ed </SectionTitle> <Paragraph position="0"> The pattern that we have observed for the Dutch infinitive-plural ambiguity can be replicated for other cases of morphological ambiguity. Consider the case of English verbs ending in -ed, which are systematically ambiguous between being simple past tenses and past participles. The upper panel of Figure 2 shows the distribution of the relative frequencies of the two functions, plotted against the natural log of the frequency for the Brown corpus (Francis and Kucera 1982). (All lines, including the nonparametric regression line are interpretable as in Figure 1.) Results of a tenfold cross-validation are shown in Table 2. Clearly, in this case the magnitude of the difference between the overall MLE and the hapax-based MLE is smaller than in the previous example: indeed in cross validations 6, 8, and 9, the overall MLE is superior.</Paragraph> <Paragraph position="1"> Nonetheless, the hapax-based MLE remains a significantly better predictor overall. 3 2 A paired t-test on the ratios No(inf)/No(pl) versus Eo\[No(inf)\]/Eo\[No(pl)\] reveals a highly significant difference (t9 = 13.4, p < 0.001 ); conversely a comparison of No (inf)/No (pl) and E h \[No (inf)\]/Eh \[No (pl)\] reveals no difference (t9 = 0.96,p > 0.10).</Paragraph> <Paragraph position="2"> 3 A paired t-test on the ratios No(vbn)/No(vbd) versus Eo\[No(vbn)\]/Eo\[No(vbd)\] reveals a significant difference (t9 -~ 2.47,p < 0.05); conversely a comparison of No(vbn)/No(vbd ) and Eh\[No(vbn)\]/E h\[No (vbd)\] reveals no difference (t 9 = 0.48, p > 0.10).</Paragraph> <Paragraph position="3"> Results of tenfold cross-validation for Dutch -en verb forms from the Uit den Boogaart corpus. Columns represent different cross-validation runs. N(inf) and N(pl) are the number of tokens of the infinitives and finite plurals, respectively, in the training set. N1 (in J9 and N1 (pl) are the number of tokens of the infinitives and finite plurals, respectively, among the hapaxes in the training set. OMLE and HMLE are, respectively, the overall and hapax-based MLEs. No(inf) and No(pl) denote the number of tokens in the held-out portion that have not been observed in the training set. The expected numbers of tokens of infinitives and plurals for types unseen in the training set, using the overall MLE are denoted as Eo\[No(inf)\] and Eo\[No(pl)\]; the corresponding estimates using the hapax-based MLE are denoted as Eh \[No(irlf)\] and Eh \[No(pl)\].</Paragraph> </Section> <Section position="3" start_page="158" end_page="158" type="sub_section"> <SectionTitle> 2.3 Dutch Words in -en: A More General Problem </SectionTitle> <Paragraph position="0"> In the two examples we have just considered, the hapax-based MLE, while being a better predictor of the a priori lexical probability for unseen cases than the overall MLE, does not actually yield a different prediction as to which function of a form is more likely. This does not hold generally, however, and the bottom panel of Figure 2 presents a case where the hapax-based MLE does yield a different prediction as to which function is more likely. In this plot we consider Dutch word forms from the UdB corpus ending in -en. As we have seen, Dutch -en is used as a verb marker: it marks the infinitive, present plural, and for strong verbs, also the past plural; it is also used as a marker of noun plurals. The case of noun plurals is somewhat different from the preceding two cases since it is not, strictly speaking, a case of morphological syncretism. However, it is a potential source of ambiguity in text analysis, since a low frequency form in -en, where one may not have seen the stem of the word, could potentially be either a noun or a verb. Also, systematic ambiguity exists among cases of noun-verb conversion: for examplefluiten is either a noun meaning 'flutes' or a verb meaning 'to play the flute'; spelden means either 'pins' or 'to pin'; and ploegen means either 'ploughs' or 'to plough'. Results for a tenfold cross-validation for these data are shown in Table 3. 4 Lrl this case, the overall MLE would lead one to predict that for an unseen form in -en, the verbal function would be more likely. Contrariwise, the hapax-based MLE predicts that the nominal function would be more likely. Again, it is the hapax-based MLE that proves to be superior.</Paragraph> </Section> </Section> class="xml-element"></Paper>