File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0724_metho.xml
Size: 13,561 bytes
Last Modified: 2025-10-06 14:07:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0724"> <Title>Comparison Naive Bayes</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 The WPDV algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Weighted Probability Distribution Voting </SectionTitle> <Paragraph position="0"> (WPDV) is a supervised learning approach to classification. A case which is to be classified is represented as a feature-value pair set: Fcase -- {{fl : Vl},..., {fn :Vn}} An estimation of the probabilities of the various classes for the case in question is then based on the classes observed with similar feature-value pair sets in the training data. To be exact, the probability of class C for Fcase is estimated as a weighted sum over all possible subsets Fsub of Fcase:</Paragraph> <Paragraph position="2"> with the frequencies (freq) measured on the training data, and N(C) a normalizing factor such that ~/5(C) = 1.</Paragraph> <Paragraph position="3"> In principle, the weight factors WF,~,~ can be assigned per individual subset. For the time being, however, they are assigned for groups of subsets. First of all, it is possible to restrict the subsets that are taken into account in the model, using the size of the subset (e.g. Fsub contains at most 4 elements) and/or its frequency (e.g. Fsub occurs at least twice in the training material). Subsets which do not fulfil the chosen criteria are not used. For the sub-sets that are used, weight factors are not assigned per individual subset either, but rather per &quot;family&quot;, where a family consists of those subsets which contain the same combination of feature types (i.e. the same f/).</Paragraph> <Paragraph position="4"> The two components of a WPDV model, distributions and weights, are determined separately. In this paper, I will use the term training set for the data on which the distributions are based and tuning set for the data on the basis of which the weights are selected. Whether these two sets should be disjunct or can coincide is one of the subjects under investigation.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="119" type="metho"> <SectionTitle> 2 Family weights </SectionTitle> <Paragraph position="0"> The various family weighting schemes can be classified according to the type of use they make of the tuning data. Here, I use a very rough classification, into weighting scheme orders.</Paragraph> <Paragraph position="1"> With 0 th order weights, no information whatsoever is used about the data in tuning set. Examples of such rudimentary weighting schemes are the use of a weight of k! for all sub-sets containing k elements, as has been used e.g. for wordclass tagger combination (van Halteren et al., To appear), or even a uniform weight for all subsets.</Paragraph> <Paragraph position="2"> With 1 st order weights, information is used about the individual feature types, i.e.</Paragraph> <Paragraph position="4"> First order weights ignore any possible interaction between two or more feature types, but have the clear advantage of corresponding to a reasonably low number of weights, viz. as many as there are feature types.</Paragraph> <Paragraph position="5"> With n th order weights, interaction patterns are determined of up to n feature types and the family weights are adjusted to compensate for the interaction. When n is equal to the total number of feature types, this corresponds to weight determination per individual family, n th order weighting generally requires much larger numbers of weights, which can be expected to lead to much slower tuning procedures. In this paper, therefore, I focus on first order weighting.</Paragraph> </Section> <Section position="5" start_page="119" end_page="121" type="metho"> <SectionTitle> 3 First order weight determination </SectionTitle> <Paragraph position="0"> As argumented in an earlier paper (van Halteren, 2000a), a theory-based feature weight determination would have to take into account each feature's decisiveness and reliability. However, clear definitions of these qualities, and hence also means to measure them, are as yet sorely lacking. As a result, a more pragmatic approach will have to be taken. Reliability is ignored altogether at the moment, 1 and decisiveness replaced by an entropy-related measure.</Paragraph> <Section position="1" start_page="119" end_page="119" type="sub_section"> <SectionTitle> 3.1 Initial weights </SectionTitle> <Paragraph position="0"> The weight given to each feature type fi should preferably increase with the amount of information it contributes to the classification process. A measure related to this is Information Gain, which represents the difference between the entropy of the choice with and without knowledge of the presence of a feature (cf. Quinlan (1986)).</Paragraph> <Paragraph position="1"> As do Daelemans et al. (2000), I opt for a factor proportional to the feature type's Gain Ratio, a normalising derivative of the Information Gain value. The weight factors W/~ are set to an optimal multiplication constant C times the measured Gain Ratio for fi- C is determined by calculating the accuracies for various values of C on the tuning set 2 and selecting the C which yields the highest accuracy.</Paragraph> <Paragraph position="2"> lit may still be present, though, in the form of the abovementioned frequency threshold for features.</Paragraph> <Paragraph position="3"> 2If the tuning set coincides with the training set, all parts of the tuning procedure are done in leave-one-out mode: in the WPDV implementation, it is possible to (virtually) remove the information about each individual instance from the model when that specific instance has to be classified.</Paragraph> </Section> <Section position="2" start_page="119" end_page="121" type="sub_section"> <SectionTitle> 3.2 Hill-climbing </SectionTitle> <Paragraph position="0"> Since the initial weight determination is based on pragmatic rather than theoretical considerations, it is unlikely that the resulting weights are already the optimal ones. For this reason, an attempt is made to locate even better weight vectors in the n-dimensional weight space. The navigation mechanism used in this search is hillclimbing. This means that systematic variations of the currently best vector are investigated. If the best variation is better than the currently best vector, that variation is taken as the best vector and the process is repeated. This repetition continues until no better vector is found.</Paragraph> <Paragraph position="1"> In the experiments described here, the variation consists of multiplication or division of each individual W/i by a variable V (i.e. 2n new vectors are tested each time), which is increased if a better vector is found, and otherwise decreased.</Paragraph> <Paragraph position="2"> The process is halted as soon as V falls below some pre-determined threshold.</Paragraph> <Paragraph position="3"> Hill-climbing, as most other optimalization techniques, is vulnerable to overtraining. To lessen this vulnerability, the WPDV hill-climbing implementation splits its tuning material into several (normally five) parts. A switch to a new weight vector is only taken if the accuracy increases on the tuning set as a whole and does not decrease on more than one part, i.e. some losses are accepted but only if they are localized.</Paragraph> <Paragraph position="4"> 4 Quality of the first order weights In order to determine the quality of the WPDV system, using first order weights as described above, I run a series of experiments, using tasks introduced by Daelemans et al. (1999): 3 The Part-of-speech tagging task (POS) is to determine a wordclass tag on the basis of disambiguated tags of two preceding tokens and undisambiguated tags for the focus and two following tokens. 4 5 features with 170-480 values; 169 classes; 837Kcase training; 2xl05Kcase test.</Paragraph> <Paragraph position="5"> The Grapheme-to-phoneme conversion with stress task (GS) is to determine the pronunciation of an English grapheme, including aI only give a rough description of the tasks here. For the exact details, I refer the reader to Daelemans et al. (1999).</Paragraph> <Paragraph position="7"/> <Paragraph position="9"> presence of stress, on the basis of the focus grapheme, three preceding and three following graphemes. 7 features with 42 values each; 159 classes; 540Kcase training; 2x68Kcase test.</Paragraph> <Paragraph position="10"> The PP attachment task (PP) is prepositional phrase attachment to either a preceding verb or a preceding noun, on the basis of the verb, the noun, the preposition in question and the head noun of the prepositional complement.</Paragraph> <Paragraph position="11"> 4 features with 3474, 4612, 68 and 5780 values;</Paragraph> <Paragraph position="13"> The NP chunking task (NP) is the deterruination of the position of the focus token in a base NP chunk (at beginning of chunk, in chunk, or not in chunk), on the basis of the words and tags for two preceding tokens, the focus and one following token, and also the predictions by three newfirst stage classifiers for the task. 5 11 features with 3 (first stage classifiers), 90 (tags) and 20K (words) values; 3 classes; 201Kcase training; 2x25Kcase test. 6 For each of the tasks, sections a to h of the data set are used as the training set and sections i 5For a WPDV approach to a more general chunking task, see my contribution to the CoNLL shared task, elsewhere in these proceedings.</Paragraph> <Paragraph position="14"> ~The number of feature combinations for the NP task is so large that the WPDV model has to be limited. For the current experiments, I have opted for a maximum size for fsub of four features and a threshold frequency of two observations in the training set.</Paragraph> <Paragraph position="15"> and j as (two separate) test sets. All three are also used as tuning sets. This allows a comparison between tuning on the training set itself and on a held-out tuning set. For comparison with some other well-known machine learning algorithms, I complement the WPDV experiments with accuracy measurements for three other systems: 1) A system using a Naive Bayes probability estimation; 2) TiMBL, using memory based learning and probability estimation based on the nearest neighbours (Daelemans et al., 2000), 7 for which I use the parameters which yielded the best results according to Daelemans et al. (1999); and 3) Maccent, a maximum entropy based system, s for which I use both the default parameters, viz. a frequency threshold of 2 for features to be used and 150 iterations of improved iterative scaling, and a more ambitious parameter setting, viz. a threshold of 1 and 300 iterations.</Paragraph> <Paragraph position="16"> The results for various WPDV weights, and the other machine learning techniques are listed in Tables 1 to 4. 9 Except for one case (PP with tune on j and test on i), the first order weight WPDV results are all higher than those for the</Paragraph> <Paragraph position="18"> comparison systems. 1deg 0 th order weights generally do not reach this level of accuracy.</Paragraph> <Paragraph position="19"> Hill-climbing with the tuning set equal to the training set produces the best results overall.</Paragraph> <Paragraph position="20"> It always leads to an improvement over initial weights of the accuracies on both test sets, although sometimes very small (GS). Equally important, the improvement on the test sets is comparable to that on the tuning/training set.</Paragraph> <Paragraph position="21"> This is certainly not the case for hill-climbing with the tuning set equal to the other test set, which generally does not reach the same level of accuracy and may even be detrimental (climbing on PPj).</Paragraph> <Paragraph position="22"> Strangely enough, hill-climbing with the tuning set equal to the test set itself sometimes does not even yield the best quality for that test set (POS with test set i and especially NP with j).</Paragraph> <Paragraph position="23"> This shows that the weight-+accuracy function does have local maxima~ and the increased risk for smaller data sets to run into a sub-optimal one is high enough that it happens in at least two of the eight test set climbs.</Paragraph> <Paragraph position="24"> 1degThe accuracies for TiMBL are lower than those found by Daelemans et ai. (1999): POSi 97.95, POSj 97.90, GS~ 93.75, GSj 93.58, PP~ 83.64, PPj 82.51, NP~ 98.38 and NPj 98.25. This is due to the use of eight part training sets instead of nine. The extreme differences for the GS task show how much this task depends on individual observations rather than on generalizations, which probably also explains why Naive Bayes and Maximum Extropy (Maccent) handle this task so badly.</Paragraph> <Paragraph position="25"> In summary, hill-climbing should preferably be done with the tuning set equal to the training set. This is not surprising, as the leave-one-out mechanism allows the training set to behave as held-out data, while containing eight times more cases than a test set turned tuning set.</Paragraph> <Paragraph position="26"> The disadvantage is a much more time-intensive hill-climbing procedure, but when developing an actual production model, the weights only have to be determined once and the results appear to be worth it most of the time.</Paragraph> </Section> </Section> class="xml-element"></Paper>