XML Viewer - w04-0108

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0108_metho.xml
Size: 24,757 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0108">
  <Title>A Comparison of Two Different Approaches to Morphological Analysis of Dutch</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Dutch Morphology: Issues and
Resources
</SectionTitle>
    <Paragraph position="0"> Dutch can be situated between English and German if we define a scale of morphological richness in Germanic languages. It lacks certain aspects of the rich inflectional system of German, but features a more extensive inflection, conjugation and derivation system than English. Contrary to English, Dutch for instance includes a set of diminutive suffixes: e.g. appel+tje (little apple) and has a larger set of suffixes to handle conjugation.</Paragraph>
    <Paragraph position="1"> Compounding in Dutch can occur in different ways: words can simply be concatenated (e.g. plaats+bewijs (seat ticket)), they can be conjoined using the 's' infix (e.g. toegang+s+bewijs (entrance ticket)) or the 'e(n)' infix (e.g. fles+en+mand (bottle basket)). In Dutch affixes are used to produce derivations: e.g. aanvaard+ing (accept-ance).</Paragraph>
    <Paragraph position="2"> Morphological processes in Dutch account for a wide range of spelling alternations. For instance: a syllable containing a long vowel is written with two vowels in a closed syllable (e.g. poot (paw)) or with one vowel in an open syllable (e.g. poten (paws)). Consonants in the coda of a syllable can become voiced (e.g. huis huizen (house(s)) or doubled (e.g. kip - kippen (chicken(s))). These and other types of spelling alternations make morphological segmentation of Dutch word forms a challenging task. It is therefore not surprising to find that only a handful of research efforts have been attempted.</Paragraph>
    <Paragraph position="3"> (Heemskerk, 1993; Dehaspe et al., 1995; Van den Bosch et al., 1996; Van den Bosch and Daelemans, 1999; Laureys et al., 2002). This limited number may to some extent also be due to the limited amount of Dutch morphological resources available.</Paragraph>
    <Paragraph position="4"> The Morphological Database of CELEX Currently, CELEX is the only extensive and publicly available morphological database for Dutch (Baayen et al., 1995). Unfortunately, this database is not readily applicable as an information source in a practical system due to both a considerable amount of annotation errors and a number of practical considerations.</Paragraph>
    <Paragraph position="5"> Since both of the systems described in this paper are data-driven in nature, we decided to semi-automatically make some adjustments to allow for more streamlined processing. A full overview of these adjustments can be found in (Laureys et al., 2004) but we point out some of the major problems that were rectified: * Annotation of diminutive suffix and unanalyzed plurals and participles was added (e.g. appel+tje).</Paragraph>
    <Paragraph position="6"> * Inconsistent treatment of several suffixes was resolved (e.g. acrobaat+isch (acrobat+ic) vs. agnostisch (agnostic)).</Paragraph>
    <Paragraph position="7"> * Truncation operations were removed (e.g. filosoof+(isch+)ie (philosophy)).</Paragraph>
    <Paragraph position="8"> The Task: Morphological Segmentation The morphological database of CELEX contains hierarchically structured and fully tagged morphological analyses, such as the following analysis for the word 'arbeidsfilosofie' (labor philosophy):</Paragraph>
    <Paragraph position="10"> The systems described in this paper deal with the most important subtask of morphological analysis: segmentation, i.e. breaking up a word into its respective morphemes. This type of analysis typically also requires the modeling of the previously mentioned spelling changes, exemplified in the above example (arbeidsfilosofie - arbeid+s+filosoof+ie). In the next 2 sections, we will describe two different approaches to the segmentation/alternation task: one using a machine-learning method, the other using finite state techniques. Both systems however were trained and tested on the same data, i.e.</Paragraph>
    <Paragraph position="11"> the Dutch morphological database of CELEX.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A Machine Learning Approach
</SectionTitle>
    <Paragraph position="0"> One of the most notable research efforts modeling Dutch morphology can be found in Van den Bosch and Daelemans (1999). Van den Bosch and Daelemans (1999) define morphological analysis as a classification task that can be learned from labeled data. This is accomplished at the level of the grapheme by recording a local context of five letters before and after the focus letter and associating this context with a morphological classification which not only predicts a segmentation decision, but also a graphemic (alternation) and hierarchical mapping.</Paragraph>
    <Paragraph position="1"> The system described in Van den Bosch and Daelemans (1999) employs the ib1-ig memory-based learning algorithm, which uses information-gain to attribute weighting to the features. Using this method, the system is able to achieve an accuracy of 64.6% of correctly analyzed word forms. On the segmentation task alone, the system achieves a 90.7% accuracy of correctly segmented words. On the morpheme level, a 94.5% F-score is observed.</Paragraph>
    <Paragraph position="2"> Towards a Cascaded Alternative The machine learning approach to morphological analysis described in this paper is inspired by the method outlined in Van den Bosch and Daelemans (1999), but with some notable differences. The first difference is the data set used: rather than using the extended morphological database, we concentrated on the database excluding inflections and conjugated forms. These morphological processes are to a great extent regular in Dutch. As derivation and compounding pose the most challenging task when modeling Dutch morphology, we have opted to concentrate on those processes first. This allows us to evaluate the systems with a clearer insight into the quality of the morphological analyzers with respect to the hardest tasks.</Paragraph>
    <Paragraph position="3"> Further, the systems described in this paper use the adjusted version of CELEX described in section 2, instead of the original dataset. The main reason for this can be situated in the context of the FLaVoR project: since our morphological analyzer needs to operate within a speech recognition engine, it is paramount that our analyzers do not have to deal with truncated forms, as it would require us to hypothesize unrealized forms in the acoustic input stream.</Paragraph>
    <Paragraph position="4"> Even though using the modified dataset does not affect the general applicability of the morphological analyzer itself, it does entail that a direct comparison with the results in Van den Bosch and Daelemans (1999) is not possible.</Paragraph>
    <Paragraph position="5"> The overall design of our memory-based system for morphological analysis differs from the one described in Van den Bosch and Daelemans (1999) as our approach takes a more traditional stance with respect to classification. Rather than encoding different types of classification in conglomerate classes, we have set up a cascaded approach in which each classification task (spelling alternation, segmentation) is handled separately. This allows us to identify problems at each point in the task and enables us to optimize each classification step accordingly. To avoid percolation of bad classification decisions at one point throughout the entire classification cascade, we ensure that all solutions are retained throughout the entire process, effectively turning later classification steps into re-ranking mechanisms.</Paragraph>
    <Paragraph position="6"> Alternation The input of the morphological analyzer is a word form such as 'arbeidsfilosofie'. As a first step to arrive at the desired segmented output 'arbeid+s+filosoof+ie', we need to account for the spelling alternation. This is done as a precursor to the segmentation task, since preliminary experiments showed that segmentation on a word form like 'arbeidsfilosoofie' is easier to model accurately than segmentation on the initial word form.</Paragraph>
    <Paragraph position="7"> First, we record all possible alternations on the training set. These range from general alternations like doubling the vowel of the last syllable (e.g. arbeidsfilosoof) to very detailed, almost word-specific alternations (e.g. Europa -euro). Next, these alternations in the training set are annotated and an instance base is extracted. Table 1 shows an example of instances for the word 'aanbidder' (admirer). In this example we see that alternation number 3 is associated with the double 'd', denoting a deletion of that particular letter.</Paragraph>
    <Paragraph position="8">  These instances were used to train the memory-based learner TiMBL (Daelemans et al., 2003). Table 2 displays the results for the alternation task on the test set. Even though these appear quite modest, the only restriction we face with respect to consecutive processing steps lies in the recall value. The results show that 255 out of 2,146 alternations in the test set were not retrieved. This means that we will not be able to correctly analyze 2.27% of the test set (which contains 11,256 items).</Paragraph>
    <Paragraph position="9">  Left Context Focus Right Context Combined Class - - - - - a a n b i d -a -aa aan 0 - - - - a a n b i d d -aa aan anb 0 - - - a a n b i d d e aan anb nbi 0 - - a a n b i d d e r anb nbi bid 0 - a a n b i d d e r - nbi bid idd 0 a a n b i d d e r - - bid idd dde 0 a n b i d d e r - - - idd dde der 3 n b i d d e r - - - - dde der er- 0 b i d d e r - - - - - der er- r- 0</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Segmentation
</SectionTitle>
      <Paragraph position="0"> A memory-based learner trained on an instance base extracted from the training set constitutes the segmentation system. An initial feature set was extracted using a windowing approach similar to the one described in Van den Bosch and Daelemans (1999). Preliminary experiments were however unable to replicate the high segmentation accuracy of said method, so that extra features needed to be added. Table 3 shows an example of instances extracted for the word 'rijksontvanger' (state collector). Experiments on a held-out validation set confirmed both left and right context sizes determined in Van den Bosch and Daelemans (1999)1 . The last two features are combined features from the left and right context and were shown to be beneficial on the validation set. They denote a group containing the focus letter and the two consecutive letters and a group containing the focus letter and the three previous letters respectively.</Paragraph>
      <Paragraph position="1"> A numerical feature ('Dist' in Table 3) was added that expresses the distance to the previous morpheme boundary. This numerical feature avoids overeager segmentation, i.e. a small value for the distance feature has to be compensated by other strong indicators for a morpheme boundary. We also consider the morpheme that was compiled since the last morpheme boundary (features in the column 'Current Morpheme').</Paragraph>
      <Paragraph position="2"> A binary feature indicates whether or not this morpheme can be found in the lexicon extracted from the training set. The next two features consider the morpheme formed by adding the next letter in line.</Paragraph>
      <Paragraph position="3"> Note however that the introduction of these features makes it impossible to precompile the instance base for the test set, since for instance 1Context size was restricted to four graphemes for reasons of space in Table 3.</Paragraph>
      <Paragraph position="4"> the distance to the previous morpheme boundary can obviously not be known before actual segmentation takes place. We therefore set up a server application and generated instances on the fly.</Paragraph>
      <Paragraph position="5"> 1,141,588 instances were extracted from the training set and were used to power a TiMBL server. The optimal algorithmic parameters were determined with cross-validation on the training set2. A client application extracted instances from the test set and sent them to the server on the fly, using the previous output of the server to determine the value of the above-mentioned features. We also adjusted the verbosity level of the output so that confidence scores were added to the classifier's decision.</Paragraph>
      <Paragraph position="6"> A post-processing step generated all possible segmentations for all possible alternations. The possible segmentations for the word 'apotheker' (pharmacist) for example constituted the following set: {(apotheek)(er), (apotheker), (apotheeker), (apothek)(er)}. Next, the confidence scores of the classifier's output were multiplied for each possible segmentation to express the overall confidence score for the morpheme sequence. Also, a lexicon extracted from the training set with associated probabilities was used to compute the overall probability of the morpheme sequence (using a Laplace-type smoothing process to account for unseen morphemes). Finally, a bigram model computed the probability of the possible morpheme sequences as well.</Paragraph>
      <Paragraph position="7"> Table 4 describes the results at different stages of processing and expresses the number of words that were correctly segmented. Only using the confidence scores output by the memory-based learner (equivalent to using a non-ranking 2ib1-ig was used with Jeffrey divergence as distance metric, no weighting, considering 11 nearest neighbors using inverse linear weighting.</Paragraph>
      <Paragraph position="8">  - - - - r i j k s 0 r 1 ri 0 rij --r 0 - - - r i j k s o 1 ri 0 rij 1 ijk -ri 0 - - r i j k s o n 2 rij 1 rijk 1 jks -rij 0 - r i j k s o n t 3 rijk 1 rijks 0 kso rijk 1</Paragraph>
      <Paragraph position="10"> s o n t v a n g e 3 ontv 0 ontva 0 van ontv 0 o n t v a n g e r 4 ontva 0 ontvan 0 ang ntva 0 n t v a n g e r - 5 ontvan 0 ontvang 1 nge tvan 0 t v a n g e r - - 6 ontvang 1 ontvange 0 ger vang 1 v a n g e r - - - 0 e 1 er 1 er- ange 0 a n g e r - - - - 1 er 1 er- 0 r- nger 1  approach) achieves a low score of 81.36%. Using only the lexical probabilities yields a better score, but the combination of the two achieves a satisfying 86.37% accuracy. Adding bigram probabilities to the product further improves accuracy to 87.57%. In Section 5 we will look at the results of the memory-based morphological analyzer in more detail.</Paragraph>
      <Paragraph position="11">  state technology has been the dominant framework for computational morphological analysis. In the FLaVoR project a finite state morphological analyzer for Dutch is being developed. We have several motivations for this. First, until now no finite state implementation for Dutch morphology was freely available. In addition, finite state morphological analysis can be considered a kind of reference for the evaluation of other analysis techniques. In the current project, however, most important is the inherent bidirectionality of finite state morphological processing. This bidirectionality should allow for a flexible integration of the morphological model in the speech recognition engine as it leaves open a double option: either the morphological system acts in analysis mode on word hypotheses offered by the recognizer's search algorithm, or the system works in generation mode on morpheme hypotheses. Only future practical implementation of the complete recognition system will reveal which option is preferable.</Paragraph>
      <Paragraph position="12"> After evaluation of several finite state implementations it was decided to implement the current system in the framework of the Xerox finite state tools, which are well described and allow for time and space efficient processing (Beesley and Karttunen, 2003). The output of the finite state morphological analyzer is further processed by a filter and a probabilistic score function, as will be detailed later.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Morphotactics and Orthographic
Alternation
</SectionTitle>
      <Paragraph position="0"> The morphological system design is a composition of finite state machines modeling morphotactics and orthographic alternations. For morphotactics a lexicon of 29,890 items was generated from the training set (118 prefixes, 189 suffixes, 3 infixes and 29,581 roots). The items were divided in 23 lexicon classes, each of which could function as an item's continuation class.</Paragraph>
      <Paragraph position="1"> The resulting finite state network has 24,858 states and 61,275 arcs.</Paragraph>
      <Paragraph position="2"> The Xerox finite state tools allow for a specification of orthographical alternation by means of (conditional) replace rules. Each replace rule compiles into a single finite state transducer. These transducers can be put in cascade or in parallel. In the case at hand, all transducers were put in cascade. The resulting finite state transducer contains 3,360 states and 81,523 arcs. The final transducer (a composition of the lexical network and the orthographicaltransducer)contains 29,234states and 106,105 arcs.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Dealing with Overgeneration
</SectionTitle>
      <Paragraph position="0"> As the finite state machine has no memory stack3, the use of continuation classes only allows for rather crude morphotactic modeling. For example, in 'on-ont-vlam-baar' (un-inflame-able) the noun 'vlam' first combines with the prefix 'ont' to form a verb. Next, the suffix 'baar' is attached and an adjective is built. Finally, the prefix 'on' negates the adjective. This example shows that continuation classes cannot be strictly defined: the suffix 'baar' combines with a verb but immediately follows a noun root, while the prefix 'on' requires an adjective but is immediately followed by another prefix.</Paragraph>
      <Paragraph position="1"> Obviously, such a model leads to overgeneration. In practice, the average number of analyses per test set item is 7.65. The maximum number of analyses is 1,890 for the word 'belastingadministratie' (tax administration).</Paragraph>
      <Paragraph position="2"> In section 3 the numerical feature 'Dist' was used to avoid overeager segmentation. We apply a similar kind of filter to the segmentations generated by the morphological finite state transducer. A penalty function for short morphemes is defined: 1- and 2-character morphemes receive penalty 3, 3-character morphemes penalty 1. Both an absolute and relative4 penalty threshold are set. Optimal threshold values (11 and 2.5 respectively) were determined on the basis of the training set. Analyses exceeding one of both thresholds are removed. This filter proves quite effective as it reduces the average number of analyses per item with 36.6% to 4.85.</Paragraph>
      <Paragraph position="3"> Finally, all remaining segmentation hypotheses are scored and ranked using an N-gram morpheme model. We applied a bigram and trigram model, both using Katz back-off and modified Kneser-Ney smoothing. The bigram slightly 3Actually, the Xerox finite state tools do allow for a limited amount of 'memory' by a restricted set of unification operations termed flag diacritics. Yet, they are insufficient for modeling long distance dependencies with hierarchical structure.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Monomorphemic Items
</SectionTitle>
      <Paragraph position="0"> The biggest remaining problem at this stage of development is the scoring of monomorphemic test items which are not included as word roots in the lexical finite state network. Sometimes these items do not receive any analysis at all, in which case we correctly consider them monomorphemic. Mostly however, monomorphemes are wrongly analyzed as morphologically complex. Scoring all test items as potentially monomorphemic does not offer any solution, as the items at hand were not in the training data and thus receive just the score for unknown items. This problem of spurious analyses accounts for 57.23% of all segmentation errors made by the finite state system.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Comparing the Two Systems
</SectionTitle>
    <Paragraph position="0"> tation task To evaluate both morphological systems, we defined a training and test set. Of the 124,136 word forms in CELEX, 110,627 constitute the training set. The test set is further split up into words with only one possible analysis (11,256 word forms) and words with multiple analyses (2,253). Since the latter set requires morphological processes beyond segmentation, we focus our evaluation on the former set in this paper. For the machine learning experiments, we also defined a held-out validation set of 5,000 word forms, which is used to perform parameter optimization and feature selection.</Paragraph>
    <Paragraph position="1"> Tables 5, 6 and 7 show a comparison of the results5. Table 5 describes the percentage of words in the test set that have been segmented correctly. We defined a baseline system which considers all words in the test set as monomorphemic. Obviously this results in a very low 5MBM: the memory-based morphological analyzer, FSM: the finite state morphological analyzer full word score (which shows us that 18.64% of the words in the test set are actually monomorphemic). The finite state system seems to have a slight edge over the memory-based analyzer when we looking at the single best solution. Yet, when we consider 2-best and 3-best scores, the memory-based analyzer in turn beats the finite state analyzer.</Paragraph>
    <Paragraph position="2">  phemes) on the segmentation task We also calculated Precision and Recall on morpheme boundaries. The results are displayed in Table 6. This score expresses how many of the morpheme boundaries have been retrieved. We make a distinction between word-internal morpheme boundaries and all morpheme boundaries. The former does not include the morpheme boundaries at the end of a word, while the latter does. We provide the latter in reference to Van den Bosch and Daelemans (1999), but the focus lies on the results for word-internal boundaries as these are nontrivial. We notice that the memory-based system outperforms the finite state system, but the difference is once again small. However, when we look at Table 7 in which we calculate the amount of full morphemes that have been correctly retrieved (meaning that both the start and end-boundary have been correctly placed), we see that the finite state method has the advantage. null Slight differences in accuracy put aside, we find that both systems achieve similar scores on this dataset. When we look at the output, we do notice that these systems are indeed performing quite well. There are not many instances where the morphological analyzer cannot be said to have found a decent alternative analysis to the gold standard one. In many cases, both systems even come up with a more advanced morphological analysis: e.g. 'gekwetst' (hurt) is featured in the database as a monomorphemic artefact.</Paragraph>
    <Paragraph position="3"> Both systems described in this paper correctly segment the word form as 'ge+kwets+t', even though they have not yet specifically been designed to handle this type of inflection.</Paragraph>
    <Paragraph position="4"> When performing an error analysis of the output, one does notice a difference in the way the systems have tried to solve the erroneous analyses. The finite state method often seems to generate more morpheme boundaries than necessary, while the reverse is the case for the memory-based system, which seems too eager to revert to monomorphemic analyses when in doubt. This behavior might also explain the reversed situation when comparing Table 6 to Table 7. Also noteworthy is the fact that almost 60% of the errors is made on wordforms that both systems were not able to analyze correctly. Work is currently also underway to improve the performance by combining the rankings of both systems, as there is a large degree of complementarity between the two systems.</Paragraph>
    <Paragraph position="5"> Each system is able to uniquely find the correct segmentation for about 5% of the words in the test set, yielding an upperbound performance of 98.75% on the full word score for an optimally combined system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML