File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0704_intro.xml
Size: 4,704 bytes
Last Modified: 2025-10-06 14:00:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0704"> <Title>The Role of Algorithm Bias vs Information Source in Learning Algorithms for Morphosyntactic Disambiguation</Title> <Section position="3" start_page="19" end_page="20" type="intro"> <SectionTitle> 2 Algorithms and Implementation </SectionTitle> <Paragraph position="0"> In this Section, we provide a short description of the two learning methods we used and their associated implementations.</Paragraph> <Section position="1" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 2.1 Memory-Based Learning </SectionTitle> <Paragraph position="0"> Memory-Based Learning is based on the assumption that new problems are solved by direct reference to stored experiences of previously solved problems, instead of by reference to rules or other knowledge structures extracted from those experiences (Stanfill and Waltz, 1986). A memory-based (case-based) approach to tagging has been investigated in Cardie (1994) and Daelemans et al. (1996).</Paragraph> </Section> <Section position="2" start_page="19" end_page="19" type="sub_section"> <SectionTitle> Implementation </SectionTitle> <Paragraph position="0"> For our experiments we have used TIMBL 2 (Daelemans et al., 1999a). TIMBL includes a number of algorithmic variants and parameters.</Paragraph> <Paragraph position="1"> The base model (ISl) defines the distance between a test item and each memory item as the number of features for which they have a different value. Information gain can be introduced (IBi-IG) to weigh the cost of a feature value mismatch. The heuristic approximation of computationally expensive pure MBL variants, (IGTREE), creates an oblivious decision tree with features as tests, ordered according to information gain of features. The number of nearest neighbors that are taken into account for extrapolation, can be determined with the parameter K.</Paragraph> <Paragraph position="2"> For typical symbolic (nominal) features, values are not ordered. In the previous variants, mismatches between values are all interpreted as equally important, regardless of how similar (in terms of classification behavior) the values are. We adopted the modified value difference metric (MVDM) to assign a different distance between each pair of values of the same feature.</Paragraph> </Section> <Section position="3" start_page="19" end_page="20" type="sub_section"> <SectionTitle> 2.2 Maximum Entropy </SectionTitle> <Paragraph position="0"> In this classification-based approach, diverse sources of information are combined in an exponential statistical model that computes weights (parameters) for all features by iteratively maximizing the likelihood of the training data. The binary features act as constraints for the model.</Paragraph> <Paragraph position="1"> The general idea of maximum entropy modeling is to construct a model that meets these constraints but is otherwise as uniform as possible. A good introduction to the paradigm of maximum entropy can be found in Berger et al.</Paragraph> <Paragraph position="2"> (1996).</Paragraph> <Paragraph position="3"> MXPOST (Ratnaparkhi, 1996) applied maximum Entropy learning to the tagging problem.</Paragraph> <Paragraph position="4"> The binary features of the statistical model are defined on the linguistic context of the word to be disambiguated (two positions to the left, two positions to the right) given the tag of the word. Information sources used include the words themselves, the tag of the previous words, and for unknown words: prefix letters, suffix letters, and information about whether a word contains a number, an upcase character, or a hyphen. These are the primitive information sources which are combined during feature generation. null In tagging an unseen sentence, a beam search is used to find the sequence of tags with the highest probability, using binary features extracted from the context to predict the most probable tags for each word.</Paragraph> <Paragraph position="5"> Implementation For our experiments, we used MACCENT, an implementation of maximum entropy modeling that allows symbolic features as input. 3 The package takes care of the translation of symbolic values to binary feature vectors, and implements the iterative scaling approach to find the probabilistic model. The only parameters that are available in the current version are the maximum number of iterations and a value frequency threshold which is set to 2 by default (values occurring only once are not taken into account).</Paragraph> <Paragraph position="6"> aDetails on how to obtain MACCENT can be found on: http://www.cs.kuleuven.ac.be/-ldh/</Paragraph> </Section> </Section> class="xml-element"></Paper>