File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0704_metho.xml
Size: 12,048 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0704"> <Title>The Role of Algorithm Bias vs Information Source in Learning Algorithms for Morphosyntactic Disambiguation</Title> <Section position="4" start_page="20" end_page="21" type="metho"> <SectionTitle> 3 Experimental Setup </SectionTitle> <Paragraph position="0"> We have set up the experiments in such a way that neither tagger is given an unfair advantage over the other. The output of the actual taggers (MBT and MXPOST) is not suitable to study the proper effect of the relevant issues of information source and algorithmic parameterisation, since different information sources are used for each tagger. Therefore the taggers need to be emulated using symbolic learners and a preprocessing front-end to translate the corpus data into feature value vectors.</Paragraph> <Paragraph position="1"> The tagging experiments were performed on the LOB-corpus (Johansson et al, 1986). The corpus was divided into 3 partitions: an 80% training partition, consisting of 931.062 words, and two 10% partitions: the VALIDATION SET (114.479 words) and the TEST SET (115.101 words) on which the learning algorithms were evaluated.</Paragraph> <Paragraph position="2"> The comparison was done in both directions: we compared both systems using information sources as described in Daelemans et al. (1996) as well as those described in Ratnaparkhi (1996).</Paragraph> <Section position="1" start_page="20" end_page="20" type="sub_section"> <SectionTitle> Corpus Preprocessing </SectionTitle> <Paragraph position="0"> Since the implementations of both learning algorithms take propositional data as their input (feature-value pairs), it is necessary to translate the corpora into this format first. This can be done in a fairly straightforward manner, as is illustrated in Tables 1 and 2 for the sentence She looked him up and down.</Paragraph> <Paragraph position="1"> The disambiguation of known words is usually based on contextual features. A word is considered to be known when it has an ambiguous tag (henceforth ambitag) attributed to it in the LEXICON, which is compiled in the same way as for the MBT-tagger (Daelemans et al., 1996). A lexicon entry like telephone for example carries the ambitag NN-VB, meaning that it was observed in the training data as a noun or a verb and that it has more often been observed as a noun (frequency being expressed by order).</Paragraph> <Paragraph position="2"> Surrounding context for the focus word (fi are disambiguated tags (a 0 on the left-hand side and ambiguous tags (a) on the right-hand side.</Paragraph> <Paragraph position="3"> In order to avoid the unrealistic situation that all disambiguated tags assigned to the left context of the target word are correct, we simulated a realistic situation by tagging the validation and test set with a trained memory-based or maximum entropy tagger (trained on the training set), and using the tags predicted by this tagger as left context tags.</Paragraph> <Paragraph position="4"> word p s s s c h She S S h e T F looked 1 k e d F F him h h i m F F up u * u p F F and a a n d F F down d o w n F F</Paragraph> <Paragraph position="6"> Unknown words need more specific word-form information to trigger the correct disambiguation. Prefix-letters (p), suffix-letters (s), the occurrence of a hyphen (h) or a capital (c) are all considered to be relevant features for the disambiguation of unknown words.</Paragraph> <Paragraph position="7"> 4 Using MBT-type features This section describes tagging experiments for both algorithms using features as described in Daelemans et al. (1996). A large number of experiments were done to find the most suitable feature selection for each algorithm, the most relevant results of which axe presented here.</Paragraph> </Section> <Section position="2" start_page="20" end_page="21" type="sub_section"> <SectionTitle> Validation Phase </SectionTitle> <Paragraph position="0"> In the validation phase, both learning algorithms iteratively exhaust different feature combinations on the VALIDATION SET, as well as leaxner-specific parameterisations. For each algorithm, we try all feature combinations that hardware restrictions allow: we confined ourselves to a context of maximum 6 surrounding tags or less, since we already noticed performance degradation for both systems when using a context of more than 5 surrounding tags. For unknown words, we have to discern between 2 different tuning phases. First, we find the optimal contextual feature set, next the optimal morphological features, presupposing both types of features operate independently.</Paragraph> <Paragraph position="1"> We investigate seven of the variations of Memory-Based Learning available in TIMBL (see Daelemans et al. (1999b) for details) and one instantiation of maccent, since the current version does not implement many variations.</Paragraph> <Paragraph position="2"> A summary of the most relevant results of the validation phase can be found in Table 3.</Paragraph> <Paragraph position="3"> The result of the absolute optimal feature set for each algorithm is indicated in bold. For some contexts, we observe a big difference between IGTREE and IBi-IG and IB1-MVDM. For unknown words, the abstraction made by the mWREE-algorithm seems to be quite harmful compared to the true lazy learning of the other variants (see Daelemans et al. (1999b) for a possible explanation for this type of behaviour). Of all algorithms, Maximum Entropy has the highest tagging accuracy for known words, outperforming TIMBL-algorithms however by only a very small margin. The overall optimal context for the algorithms turned out to be dfa and ddfaa respectively, while enlarging the context on either side of the focus word resulted in a lower tagging accuracy.</Paragraph> <Paragraph position="4"> Overall, we noticed a tendency for TIMBL to perform better when the information source is rather limited (i.e. when few features are used), while MACCENT seems more robust when dealing with a more elaborate feature space.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> Test Phase </SectionTitle> <Paragraph position="0"> The Test Phase of the experiment consists of running the optimised subalgorithm paired with the optimal feature set on the test set. TIMBL, augmented with the Modified Value Difference Metric and k set to 5, was used to disambiguate known words with a dfa feature value, unknown words with the features ddaapss. MACCENT used the same features for unknown words, but used more elaborate features (ddfaa) to disambiguate known words. The results of the optimised algorithms on the test set can be found in Table 4.</Paragraph> </Section> </Section> <Section position="5" start_page="21" end_page="22" type="metho"> <SectionTitle> TIMBL MACCENT </SectionTitle> <Paragraph position="0"> Overall tagging accuracy is similar for both algorithms, indicating that for the overall tagging problem, the careful selection of optimal information sources in a validation phase, has a bigger influence on accuracy than inherent properties or bias of the two learning algorithms</Paragraph> <Section position="1" start_page="22" end_page="22" type="sub_section"> <SectionTitle> Beam Search </SectionTitle> <Paragraph position="0"> Note that MACCENT does not include the beam search over N highest probability tag sequence candidates at sentence level, which is part of the MXPOST tagger (but not part of maximum entropy-based learning proper; it could be combined with MBL as well). To make sure that this omission does not affect maximum entropy learning adversely for this task, we implemented the beam search, and compared the results with the condition in which the most probable tag is used, for different beam sizes and different amounts of training data. The differences in accuracy were statistically not significant (beam search even turned out to be significantly worse for small training sets). The beam search very rarely changes the probability order suggested by MACCENT, and when it does, the number of times the suggested change is correct is about equal to the number of times the change is wrong. This is in contrast with the results of Ratnaparkhi (1996), and will be investigated further in future research.</Paragraph> <Paragraph position="1"> 5 Using MXPOST-type features In order to complete the systematic comparison, we compared maximum entropy (again using the MACCENT implementation) with MBL when using the features suggested in (Ratnaparkhi, 1996). Due to the computational expense of the iterative scaling method that is inherent to maximum entropy learning, it was not tractable to incorporate an extensive validation phase for feature selection or algorithmic variant selection. We simply took the features suggested in that paper, and 2 different settings for our MBL implementation, IGTREE and MVDM K----5, the latter being the optimal algorithm for the previous experiments. The results on the test set are shown in Table 5.</Paragraph> <Paragraph position="2"> Beam search Notice that again, the sentence level beam search does not add significantly to accuracy.</Paragraph> <Paragraph position="3"> Also note that the results report in Table 5 differ significantly from those reported for MXPOST in (van Halteren et al., 1998). The difference in tagging accuracy is most likely due to the problematic translation of MXPOST'S binary features to nominal features. This involves creating instances with a fixed number of features (not just the active features for the instance as is the case in MXPOST), resulting in a bigger, less manageable instance space. When IGTREE compresses the elaborate instance space, we consequently notice a significant improvement over a MVDM approach.</Paragraph> </Section> </Section> <Section position="6" start_page="22" end_page="23" type="metho"> <SectionTitle> 6 Error Analysis </SectionTitle> <Paragraph position="0"> The following table contains some more detailed information about the distribution of the er- null In 87% of the cases where both algorithms are wrong, they assign the same tag to a word. This indicates that about 55% of the errors can either be attributed to a general shortcoming present in both algorithms or to an inadequate information source. We can also state that 97.8% of the time, the two algorithms agree on which tag to assign to a word (even though they both agree on the wrong tag 1.7% of the time).</Paragraph> <Paragraph position="1"> We also observed the same (erroneous) tagging behavior in both algorithms for lower-frequency tags, the interchanging of noun tags and adjective tags, past tense tags and past participle tags and the like.</Paragraph> <Paragraph position="2"> Another issue is the information value of the ambitag. We have observed several cases where the correct tag was not in the distribution specified by the ambitag, which has substantial information value. In our test set, this is the case for 1235 words (not considering unknown words). 553 times, neither algorithm finds the correct tag. Differences can be observed in the way the algorithms deal with the information value of the ambitag, with Maximum Entropy exhibiting a more conservative approach with respect to the distribution suggested by the ambitag, more reluctant to break free from the ambitag. It only finds the correct part-of-speech tag 507 times, whereas TiMBL performs better at 594 correct tags. There is a downside to this: sometimes the correct tag is featured in the ambitag, but the algorithm breaks free from the ambitag nevertheless. This happens to TiMBL 267 times, and 288 times to MACCENT.</Paragraph> <Paragraph position="3"> In any case, the construction of the ambitag seems to be a problematic issue that needs to be resolved, since its problematic nature accounts for almost 40% of all tagging errors. This is especially a problem for MBT as it relies on ambitags in its representation.</Paragraph> </Section> class="xml-element"></Paper>