File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/p99-1037_concl.xml
Size: 3,806 bytes
Last Modified: 2025-10-06 13:58:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1037"> <Title>Memory-Based Morphological Analysis</Title> <Section position="6" start_page="289" end_page="290" type="concl"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"> We have demonstrated the applicability of memory-based learning to morphological analysis, by reformulating the problem as a classification task in which letter sequences are classifted as marking different types of morpheme boundaries. The generalization performance of memory-based learning algorithms to the task is encouraging, given that the tests are done on held-out (dictionary) words. Estimates of free-text performance give indications of high accuracies: 84.6% correct fully-analyzed words (64.6% on unseen words), and 96.7% correctly segmented and coarsely-labeled words (about 90% for unseen words). Precision and recall of fully-labeled morphemes is estimated in real texts to be over 93% (about 84% for unseen words). Finally, the prediction of (possibly ambiguous) syntactic classes of unknown word-forms in the test material was shown to be 91.2% correct; the corresponding free-text estimate is 97.2% correctly-tagged wordforms.</Paragraph> <Paragraph position="1"> In comparison with the traditional approach, which is not immune to costly hand-crafting and spurious ambiguity, the memory-based learning approach applied to a reformulation of the problem as a classification task of the segmentation type, has a number of advantages: (i.e., it does not retry analysis generation) and fast, and is only linearly related to the length of the wordform being processed.</Paragraph> <Paragraph position="2"> The language-independence of the approach can be illustrated by means of the following partial results on MBMA of English. We performed experiments on 75,745 English wordforms from CELEX and predicted the lower-granularity tasks of predicting morpheme boundaries (Van den Bosch et al., 1996). Experiments yielded 88.0% correctly segmented test words when deciding only on the location of morpheme boundaries, and 85.6% correctly segmented test words discerning between derivational and inflectional morphemes. Both results are roughly comparable to the 90% reported here (but note the difference in training set size).</Paragraph> <Paragraph position="3"> A possible limitation of the approach may be the fact that it cannot return more than one possible segmentation for a wordform. E.g.</Paragraph> <Paragraph position="4"> the compound word kwartslagen can be interpreted as either kwart+slagen (quarter turns) or kwarts+lagen (quartz layers). The memory-based approach would select one segmentation. However, true segmentation ambiguity of this type is very rare in Dutch. Labeling ambiguity occurs more often (3.6% of all morphemes), and the current approach simply produces ambiguous tags. However, it is possible for our approach to return distributions of possible classes, if desired, as well as it is possible to &quot;unpack&quot; ambiguous labeling into lists of possible morphological analyses of a wordform. If, for example, MBMA's output for the word bakken (bake, an infinitive or plural verb form, or bins, a plural noun) would be \[bak\]v/N\[en\]tm/i/m, then this output could be expanded unambiguously into the noun analysis \[bak\]N\[en\]m (plural) and the two verb readings \[bak\]y\[en\]i (infinitive) and \[bak\]y\[en\]tm (present tense plural). null Points of future research are comparisons with other morphological analyzers and lemmatizers; applications of MBMA to other languages (particularly those with radically different morphologies); and qualitative analyses of MBMA's output in relation with linguistic predictions of errors and markedness of exceptions.</Paragraph> </Section> class="xml-element"></Paper>