File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3210_metho.xml
Size: 8,497 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3210"> <Title>A Naive Theory of Affixation and an Algorithm for Extraction</Title> <Section position="3" start_page="80" end_page="82" type="metho"> <SectionTitle> 3 An Algorithm for Affix Extraction </SectionTitle> <Paragraph position="0"> The key question is, if words in natural languages are constructed as W explained above, can we recover the segmentation? That is, can we find B and S, given only W? The answer is yes, we can partially decide this. To be more specific, we can compute a score Z such that Z(x) > Z(y) if x [?] SW and y /[?] SW. In general, the converse need not hold, i.e if both x,y [?] SW, or both x,y /[?] SW, then it may still be that Z(x) > Z(y). This is equivalent to constructing a ranked list of all possible segments, where the true members of SW appear at the top, and somewhere down the list the junk, i.e nonmembers of SW, start appearing and fill up the rest of the list. Thus, it is not said where on the list the true-affixes/junk border begins, just that there is a consistent such border.</Paragraph> <Paragraph position="1"> Now, howshouldthislistbecomputed? Giventhe FFA, it's tempting to look at frequencies alone, i.e just go through all words and make a list of all segments, ranking them by frequency? This won't do it because 1. it doesn't compensate between segments of different length; naturally, short segments will be more frequent than long ones, solely by virtue of their shortness 2. it overcounts ill-segmented true affixes, e.g -ng will invariably get a higher (or equal) count than -ing. What we will do is a modification of this strategy, because 1. can easily be amended by subtracting estimated prior frequencies (under ACA) and there is a clever way of tackling 2. Note that, to amend 2., when going through w and each striangleleftw, it would be nice if we could count s only when it is well-segmented in w. We are given only W so this information is not available to us, but, the FFA assumption let's us make a local guess of it.</Paragraph> <Paragraph position="2"> We shall illustrate the idea with an example of an evolving frequency curve of a word &quot;playing&quot; and its segmentations &quot;playing&quot;, &quot;aying&quot;, &quot;ying&quot;, &quot;ing&quot;, &quot;ng&quot;, &quot;g&quot; (W being the set of words from an English bible corpus (Eng, 1977)). Figure 1 shows a frequency for s triangleleft w = playing.</Paragraph> <Paragraph position="3"> frequency curve fW(s) and its expected frequency curve eW(s). The expected frequency of a suffix s doesn't depend on the actual characters of s and is defined as:</Paragraph> <Paragraph position="5"> tion that its characters are uniformly distributed. We don't simply use 26 in the case of lowercase English since not all characters are equally frequent. Instead we estimate the size of a would-be uniform distribution from the entropy of the distribution of the characters in W. This gives r [?] 18.98 for English and other languages with a similar writing practice.</Paragraph> <Paragraph position="6"> Next, define the adjusted frequency as the difference between the observed frequency and the expected frequency:</Paragraph> <Paragraph position="8"> It is the slope of this curve that predicts the presence of a good split. Figure 2 shows the appearance of this curve again exemplified by &quot;playing&quot;.</Paragraph> <Paragraph position="9"> After these examples, we are ready to define the segmentation score of a suffix relative to a word Z :</Paragraph> <Paragraph position="11"> Table 2 shows the evolution of exact values from the running example.</Paragraph> <Paragraph position="12"> To move from a Z-score for a segment that is relative to a word we simply sum over all words to get Input: A text corpus C Step 1. Extract the set of words W from C (thus all contextual and word-frequency information is discarded) Step 2. Calculate ZW(s,w) for each w [?] W and s triangleleft w Step 3. Accumulate ZW(s) =summationtextw[?]W Z(s,w) the final score Z : SW -Q:</Paragraph> <Paragraph position="14"> To be extra clear, the FFA assumption is &quot;exploited&quot; in two ways. On the one hand, frequent affixes get many oppurtunities to get a score (which could, however, be negative) in the final sum over w [?] W. On the other hand, the frequency is what make up the appearance of the slope that predicts the segmentation point.</Paragraph> <Paragraph position="15"> The final Z-score in equation 1 is the one that purports to have the property that Z(x) > Z(y) if x [?] SW and y /[?] SW - at least if purged (see below). A summary of the algorithm described in this section is displayed in table 3.</Paragraph> <Paragraph position="16"> The time-complexity bounding factor is the number of suffixes, i.e the cardinality of SW, which is linear (in the size of the input) if words are bounded in length by a constant and quadratic in the (really) worst case if not.</Paragraph> </Section> <Section position="4" start_page="82" end_page="83" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> For a regular English 1 million token newspaper corpus we get the top 30 plus bottom 3 suffixes as shown in table 4.</Paragraph> <Paragraph position="1"> English has little affixation compared to e.g Turkish which is at the opposite end of the typological scale (Dryer, 2005). The corresponding results for Turkish on a bible corpus (Tur, 1988) is shown in table 5.</Paragraph> <Paragraph position="2"> The results largely speak for themselves but some comments are in order. As is easily seen from the lists, some suffixes are suffixes of each other so one could purge the list in some way to get only the most &quot;competitive&quot; suffixes. One purging strategy would be to remove x from the list if there is a z for Turkish. 56881 unique words yielded a total of 175937 ranked suffixes.</Paragraph> <Paragraph position="3"> such that x = yx and Z(z) > Z(x) (this would remove e.g -ting if -ing is above it on the list). A more sophisticated purging method is the following, which does slightly more. First, for a word w [?] W define its best segmentation as: Segment(w) = argmaxstriangleleftwZ(s). Thenpurge bykeeping onlythose suffixes which are the best parse for at least one word: SprimeW = {s [?] SW|[?]wSegment(w) = s}.</Paragraph> <Paragraph position="4"> Such purging kicks out the bulk of &quot;junk&quot; suffixes. Table 4 shows the numbers for English, Turkish and the virtually affixless Maori (Bauer et al., 1993). It should noted that &quot;junk&quot; suffixes still remain after purging - typically common stem-final characters - and that there is no simple relation between the number of suffixes left after purging and the amount of morphology of the language in question. Otherwise we would have expected the morphology-less Maori to be left with no, or 28-ish, fects on the size of the suffix list after purging. suffixes or at least less than English.</Paragraph> <Paragraph position="5"> A good sign is that the purged list and its order seems to be largely independent of corpus size (as long as the corpus is not very small) but we do get some significant differences between bible English and newspaper English.</Paragraph> <Paragraph position="6"> We have chosen to illustrate using affixes but the method readily generalizes to prefixes as well and even prefixes and suffixes at the same time. As an example of this, we show top-10 purged prefix-suffix scores in the same table also for some typologically differing languages in table 7. Again, we use bible corpora for cross-language comparability (Swedish (Swe, 1917) and Swahili (Swa, 1953)).</Paragraph> <Paragraph position="7"> The scores have been normalized in each language to allow cross-language comparison - which, judging from the table, seems meaningful. Swahili is an exclusively prefixing language but verbs tend to end in -a (whose status as a morpheme is the linguistic sense can be doubted), whereas Swedish is suffixing, although some prefixes are or were productive in word-formation.</Paragraph> <Paragraph position="8"> A full discussion of further aspects such as a more informed segmentation of words, peeling of multiple suffix layers and purging of unwanted affixes requires, is beyond the scope of this paper.</Paragraph> </Section> class="xml-element"></Paper>