File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1096_metho.xml
Size: 18,843 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1096"> <Title>Wordformand class-based prediction of the components of German nominal compounds in an AAC system</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Compounding in German </SectionTitle> <Paragraph position="0"> Compounding is an extremely common and productive mean to form words in German.</Paragraph> <Paragraph position="1"> In an analysis of the APA newswire corpus (a corpus of over 28 million words), we found that almost half (47%) of the word types were compounds.</Paragraph> <Paragraph position="2"> However, the compounds accounted for a small portion of the overall token count (7%). This suggests that, as expected, many of them are productively formed hapax legomena or very rare words (83% of the compounds had a corpus frequency of 5 or lower).</Paragraph> <Paragraph position="3"> By far the most common type of German compound is the N+N type, i.e., a sequence of two nouns (62% of the compounds in our corpus have this shape). Thus, we decided to limit ourselves, for now, to handling compounds of this shape.</Paragraph> <Paragraph position="4"> In German, nominal compounds, including the N+N type, are right-headed, i.e., the rightmost element of the compound determines its basic semantic and morphosyntactic properties.</Paragraph> <Paragraph position="5"> Thus, the context of a compound is often more informative about its right element (the head) than about its left element (the modifier).</Paragraph> <Paragraph position="6"> In modifier context, nouns are sometimes followed by a linking suffix (Krott, 2001; Dressler et al., 2001), or they take other special inflectional shapes.</Paragraph> <Paragraph position="7"> As a consequence of the presence of linking suffixes and related patterns, the forms that nouns take in modifier position are sometimes specific to this position only, i.e., they are bound forms that do not occur as independent words.</Paragraph> <Paragraph position="8"> We did not parse special modifier forms in order to reconstruct their independent nominal forms. Thus, we treat all inflected modifier forms, including bound forms, as unanalyzed primitive nominal wordforms.</Paragraph> <Paragraph position="9"> 4 The split compound prediction model In Baroni et al. (2002), we present and evaluate a split compound model in which N+N compounds are predicted by treating them as the sequence of a modifier and a head.</Paragraph> <Paragraph position="10"> Modifiers are predicted on the basis of weighed probabilities deriving from the following three terms: the unigram and bigram training corpus frequency of nominal wordforms as modifiers or independent words, and the training corpus type frequency of nominal wordforms as modifiers:2</Paragraph> <Paragraph position="12"> The type frequency of nouns as modifiers is determined by the number of distinct compounds in which a noun form occurs as modifier.</Paragraph> <Paragraph position="13"> Heads are predicted on the basis of weighted probabilities deriving from three terms analogous to the ones used for modifiers: the unigram and bigram frequency of nouns as heads or independent words, and the type frequency of nouns as heads:</Paragraph> <Paragraph position="15"> The type frequency of nouns as heads is determined by the number of distinct compounds in which a noun form occurs as head.</Paragraph> <Paragraph position="16"> Given that compound heads determine the syntactic properties of compounds, bigrams for head prediction are collected by considering not the immediate left context of heads (i.e., their modifiers), but the word preceding the compound (e.g., die Abendsitzung is counted as an instance of the bi-gram die Sitzung).</Paragraph> <Paragraph position="17"> For reasons of size and efficiency, single uni- and bigram count lists are used for predicting modifiers and heads.3 For the same reasons, and to minimize the chances of over-fitting to the training corpus, all n-gram/frequency tables are trimmed by removing elements that occur only once in the training corpus. We currently use a simple interpolation model, in which all terms are assigned equal weight.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Improving head prediction </SectionTitle> <Paragraph position="0"> While we obtained encouraging results with it (Baroni et al., 2002), we feel that a particularly unsatisfactory aspect of the model described in the previous section is that information on the modifier is not 2Here and below, c stands for the last word in the left context of w; w is the suffix of the word to be predicted minus the (possibly empty) prefix typed by the user up to the current point.</Paragraph> <Paragraph position="1"> 3This has a distorting effect on the bigram counts (words occurring before compounds are counted twice, once as the left context of the modifier and once as the left context of the head). However, preliminary experiments indicated that the empirical effect of this distortion is minimal.</Paragraph> <Paragraph position="2"> exploited when trying to predict the head of a compound. Intuitively, knowing what the modifier is should help us in guessing the head of a compound.</Paragraph> <Paragraph position="3"> However, constructing a plausible head-prediction term based on modifier-head dependencies is not straightforward.</Paragraph> <Paragraph position="4"> The word-form-based compound-bigram frequency of a head, i.e., the number of times a specific head occurs after a specific modifier, is not a very useful measure: Counting how often a modifier-head pair occurs in the training corpus is equivalent to collecting statistics on unanalyzed compounds, and it will not help us to generalize beyond the compounds encountered in the training corpus.</Paragraph> <Paragraph position="5"> Moreover, if a specific modifier-head bigram is frequent, i.e., the corresponding compound is a frequent word, it is probably better to treat the whole compound as an unanalyzed lexical unit anyway.</Paragraph> <Paragraph position="6"> POS-based head-modifier bigrams are not going to be of any help either, since we are considering only N+N compounds, and thus we would collect a single POS bigram (N N) with probability 1.4 We decided instead to try to exploit a semantically-driven route. It seems plausible that modifiers that are semantically related will tend to co-occur with heads that are, in turn, semantically related. Consider for example the relationship between the class of fruits and the class of sweets in English compounds. It is easy to think of compounds in which a member of the class of fruits (bananas, cherries, apricots...) modifies a member of the class of sweets (pies, cakes, muffins...). Thus, if you have to predict the head of a compound given a fruit modifier, it would be reasonable, all else being equal, to guess some kind of sweet.</Paragraph> <Paragraph position="7"> While semantically-driven prediction makes sense in principle, clustering nouns into semantic classes is certainly not a trivial job, and, if a large input lexicon must be partitioned, it is not a task that could be accomplished by a human expert. Drawing inspiration from Brown et al. (1990), we constructed instead semantic classes using a clustering algorithm extracting them from a corpus, on the basis of the average mutual information (MI) between pairs of words (Rosenfeld, 1996).5 4Even if the model handled other compound types, very few POS combinations are attested within compounds.</Paragraph> <Paragraph position="8"> 5We are aware of the fact that other measures of lexical association have been proposed (Evert and Krenn, 2001, and MI values were computed using Adam Berger's trigger toolkit (Berger, 1997).6 The same training corpus of about 25.5M words (and with N+N compounds split) that we describe below was used to collect MI values for noun pairs. All modifiers and heads of N+N compounds and all corpus words that were parsed as nouns by the Xerox morphological analyzer (Karttunen et al., 1997) were counted as nouns for this purpose.</Paragraph> <Paragraph position="9"> MI was computed only for pairs that co-occurred at least three times in the corpus (thus, only a subset of the input nouns appears in the output list). Valid co-occurrences were bound by a maximal distance between elements of 500 words, and a minimal distance of 2 words (to avoid lexicalized phrases, such as proper names or phrasal loanwords).</Paragraph> <Paragraph position="10"> Having obtained a list of pairs from the toolkit, the next step was to cluster them into classes, by grouping together nouns with a high MI. For space reasons, we do not discuss our clustering algorithm in detail here (we motivate and analyze the algorithm in a paper currently in preparation).</Paragraph> <Paragraph position="11"> In short, the algorithm starts by building classes out of nouns that occur with very few other nouns in the MI pair list, and thus their assignment to classes is relatively unambiguous, and it then adds progressively more ambiguous nouns (ambiguous in the sense that they occur in a progressively larger number of MI pairs, and thus it becomes harder to determine with which other nouns they should be clustered). Each input word is assigned to a single class (thus, we do not try to capture polysemy). Moreover, not all words in the input are clustered (see step 5 below).7 Schematically, the algorithm works as follows (the input vocabulary of step 1 is simply a list of all the words that occur at least once in the MI pair references quoted there) and are sometimes claimed to be more reliable than MI, and we are planning to run our clustering algorithm using alternative measures.</Paragraph> <Paragraph position="12"> gorithm that tried to cluster all words, through multiple passes. The classes generated by the non-iterative procedure described in the text, however, gave better results, when integrated in the head prediction task, than those generated with the iterative ver- null list indicates that most classes constructed by our algorithm are intuitively reasonable, while there are also, undoubtedly, classes that contain heterogeneous elements, and missed generalizations. Table The algorithm generated 3744 classes, containing a total of 14059 nouns (about one third of the nouns in the training corpus).</Paragraph> <Paragraph position="13"> Class-based modifier-head bigrams were then collected by labeling all the modifiers and heads in the training corpus with their semantic classes, and counting how often each combination of modifier and head class occurred.</Paragraph> <Paragraph position="14"> Like the other tables, class-based bigrams were We compute the class-based probability of a compound head given its modifier in the following way:</Paragraph> <Paragraph position="16"> The latter term assigns equal probability to all members of a class, but lower probability to members of larger classes.</Paragraph> <Paragraph position="17"> Class-based probability is added to the wordform-based terms of equation 3 obtaining the following formula to compute head probability:</Paragraph> <Paragraph position="19"/> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> The new split compound model and a baseline model with no compound processing were evaluated in a series of simulations, using the APA newswire articles from January to September 1999 (containing 25,466,500 words) as the training corpus, and all the 90,643 compounds found in the Frankfurter Rundschau newspaper articles from June 29 to July 12 of 1992 (in bigram context) as the testing targets.8 In order to train and test the split compound model, all words in both sets were run though the morphological analyzer, and all N+N compounds were split into their modifier and head surface forms.</Paragraph> <Paragraph position="1"> We first ran simulations in which compound heads were predicted using each of the terms in equation 7 separately. The results are reported in table 2.</Paragraph> <Paragraph position="2"> As an independent predictor, the class-based term performs slightly worse than wordform-based bi-gram prediction.</Paragraph> <Paragraph position="3"> We then simulated head and compound prediction using the head prediction model of equation 7. 8In other experiments, including those reported in Baroni et al. (2002), we tested on another section of the APA corpus from the same year. Not surprisingly, ksr's in the experiments with the APA corpus were overall higher, and the difference between the split compound and baseline models was less dramatic (because many compounds in the test set were already in the training corpus).</Paragraph> <Paragraph position="4"> The results of this simulation are reported in table 3, together with the results of a simulation in which class-based prediction was not used, and the results obtained with the baseline no-split-compound model.</Paragraph> <Paragraph position="5"> Model split split no split class bigrams lead to an improvement in head prediction of more than 2% over the split compound model without class-based prediction. This translates into an improvement of 1.3% in the prediction of whole compounds. Overall, the split compound model with class bigrams leads to an improvement of more than 15% over the baseline model.</Paragraph> <Paragraph position="6"> The results of these experiments confirm the usefulness of the split compound model, and they also show that the addition of class-based prediction improves the performance of the model, even if this improvement is not dramatic. Clearly, future research should concentrate on whether alternative measures of association, clustering techniques and/or integration strategies can make class-based prediction more effective.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Preliminary experiments in integration </SectionTitle> <Paragraph position="0"> In a working word prediction system, compounds are obviously not the only type of words that the user needs to type. Thus, the predictions provided by the compound model must be integrated with predictions of simple words. In this section, we report preliminary results we obtained with a model limited to the integration of N+N compound prediction with simple noun prediction.</Paragraph> <Paragraph position="1"> In our approach to compound/simple prediction integration, candidate modifiers are presented together and in competition with simple word solutions as soon as the user starts typing a new word. The user can distinguish modifiers from simple words in the prediction window because the former are suffixed with a special symbol (for example an underscore). If the user selects a modifier, the head prediction model is activated, and the user can start typing the prefix of the desired compound head, while the system suggests completions based on the head prediction model.</Paragraph> <Paragraph position="2"> For example, if the user has just typed Abe, the prediction window could contain, among other things, the candidates Abend and Abend . If the user selects the latter, possible head completions for a compound having Abend as its modifier are presented. null Modifier candidates are proposed on the basis of Pmod(w) computed as in equation 2 above. Simple noun candidates are proposed on the basis of their unigram and bigram probabilities (interpolated with equal weights).</Paragraph> <Paragraph position="3"> We experimented with two versions of the integrated model.</Paragraph> <Paragraph position="4"> In one, modifier and simple noun candidates are ranked directly on the basis of their probabilities. This risks to lead to over-prediction of modifier candidates (recall that, from the point of view of token frequency, compounds are much rarer than simple words; the prediction window should not be cluttered by too many modifier candidates when, most of the time, users will want to type simple words).</Paragraph> <Paragraph position="5"> Thus, we constructed a second version of the integrated model in whichPmod(w) is multiplied by a penalty term. This term discounts the probability of modifier candidates built from nominal wordforms that occur more frequently in the training corpus as independent nouns than as modifiers (forms that are equally or more frequent in modifier position are not affected by the penalty).</Paragraph> <Paragraph position="6"> The same training corpus and procedures described in section 5 above were used to train the two versions of the integrated model, and the baseline model that does not use compound prediction.</Paragraph> <Paragraph position="7"> These models were tested by treating all the nouns in the test corpus as prediction targets. The integrated test set contained 90,643 N+N tokens and 395,731 more nouns. The results of the simulations are reported in table 4.</Paragraph> <Paragraph position="8"> Model integrated integrated simple pred no penalty w/ penalty only compound ksr 47.6 45.9 34.9 simple n ksr 40.5 42.5 45.6 combined ksr 42.5 43.5 42.6 Because of the simple noun predictions getting in the way, the integrated models perform compound prediction worse than the non-integrated split compound model of table 3. However the integrated models still perform compound prediction considerably better than the baseline model.</Paragraph> <Paragraph position="9"> The integrated model with modifier penalties performs worse than the model without penalties when predicting compounds. This is expected, since the modifier penalties make this model more conservative in proposing modifier candidates.</Paragraph> <Paragraph position="10"> However, the model with penalties outperforms the model without penalties in simple noun prediction. Given that in our test set (and, we expect, in most German texts) simple noun tokens greatly outnumber compound tokens, this results in an overall better performance of the model with penalties.</Paragraph> <Paragraph position="11"> The integrated model with penalties achieves an overall ksr that is about 1% higher than that achieved by the baseline model.</Paragraph> <Paragraph position="12"> Thus, these preliminary experiments indicate that an approach to integrating compound and simple word predictions along the lines sketched at the beginning of this section, and in particular the version of the model in which modifier predictions are penalized, is feasible. However, the model is clearly in need of further refinement, given that the improvement over the baseline model is currently minimal.</Paragraph> </Section> class="xml-element"></Paper>