File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0710_metho.xml

Size: 13,652 bytes

Last Modified: 2025-10-06 14:07:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0710">
  <Title>Learning Distributed Linguistic Classes</Title>
  <Section position="4" start_page="55" end_page="55" type="metho">
    <SectionTitle>
2 Dichotomizer Ensembles
</SectionTitle>
    <Paragraph position="0"> Dichotomizer ensembles must be diverse apart from accurate. Diversity is necessary in order to decorrelate the predictions of the various dichotomizers. This is a consequence of the voting mechanism underlying ECOC, where bit functions can only outvote other bit functions if they do not make similar predictions. Selecting different features per dichotomizer was proposed for this purpose (Ricci and Aha, 1997). Another possibility is to add limited non-locality to a local classifier, since classifiers that use global information such as class probabilities during classification, are much less vulnerable to correlated predictions. The following ideas were tested empirically on a suite of natural language learning tasks.</Paragraph>
    <Paragraph position="1"> * A careful feature selection approach, where every dichotomizer is trained to select (possibly) different features.</Paragraph>
    <Paragraph position="2"> * A careless feature selection approach, where every bit is predicted by a voting committee of dichotomizers, each of which randomly selects features (akin in spirit to the Multiple Feature Subsets approach for non-distributed classifiers (Bay, 1999).</Paragraph>
    <Paragraph position="3"> * A careless feature selection approach, where blocks of two adjacent bits are predicted by a voting committee of quadrotomizers, each of which randomly selects features. Learning blocks of two bits allows for bit codes that are twice as long (larger error-correction), but with half as many classifiers. Assuming a normal distribution of errors and bit values in every 2 bits-block, there is a 25% chance that both bits in a 2-bit block are wrong. The other 75% chance of one bit wrong would produce performance equal to voting per bit.</Paragraph>
    <Paragraph position="4"> Formally, this implies a switch from N two-class problems to N/2 four-class problems, where separate regions of the class landscape are learned jointly.</Paragraph>
    <Paragraph position="5"> * Adding non-locality to 1-3 in the form of larger values for k.</Paragraph>
    <Paragraph position="6"> * The use of the Modified Value Difference Metric, which alters the distribution of instances over the hyperspace of features, yielding different class boundaries.</Paragraph>
  </Section>
  <Section position="5" start_page="55" end_page="56" type="metho">
    <SectionTitle>
3 Memory-based learning
</SectionTitle>
    <Paragraph position="0"> The memory-based learning paradigm views cognitive processing as reasoning by analogy.</Paragraph>
    <Paragraph position="1"> Cognitive classification tasks are carried out by  matching data to be classified with classified data stored in a knowledge base. This latter data set is called the training data, and its elements are called instances. Every instance consists of a feature-value vector and a class label. Learning under the memory-based paradigm is lazy, and consists only of storing the training instances in a suitable data structure. The instance from the training set which resembles the most the item to be classified determines the classification of the latter. This instance is called the nearest neighbor, and models based on this approach to analogy are called nearest neighbor models (Duda and Hart, 1973). So-called k-nearest neighbor models select a winner from the k nearest neighbors, where k is a parameter and winner selection is usually based on class frequency. Resemblance between instances is measured using distance metrics, which come in many sorts. The simplest distance metric is the overlap metric:  k (3) 5(vi, vj) = 0 if vi = vj 5(vi, vj) = 1 if vi C/ vj (~ri(I) is the i-th projection of the feature vector I.) Another distance metric is the Modified Value Difference Metric (MVDM) (Cost and Salzberg, 1993). The MVDM defines similarity between two feature values in terms of posterior probabilities: 5(vi, vj) = ~ I P(c I vi) - P(c Ivj) l (4) cEClasses  When two values share more classes, they are more similar, as 5 decreases. Memory-based learning has fruitfully been applied to natural language processing, yielding state-of-the-art performance on all levels of linguistic analysis, including grapheme-to-phoneme conversion (van den Bosch and Daelemans, 1993), PoS-tagging (Daelemans et al., 1996), and shallow parsing (Cardie et al., 1999). In this study, the following memory-based models are used, all available from the TIMBL package (Daelemans et al., 1999). IBi-IG is a k-nearest distance classifier which employs a weighted over-</Paragraph>
    <Paragraph position="3"> In stead of drawing winners from the k-nearest neighbors pool, IBi-IG selects from a pool of instances for k nearest distances. Features are separately weighted based on Quinlan's information gain ratio (Quinlan, 1993), which measures the informativity of features for predicting class labels. This can be computed by subtracting the entropy of the knowledge of the feature values from the general entropy of the class labels. The first quantity is normalized with the a priori probabilities of the various feature values of feature F:</Paragraph>
    <Paragraph position="5"> H(C\[F=v\] ) is the class entropy computed over the subset of instances that have v as value for Fi. Normalization for features with many values is obtained by dividing the information gain for a feature by the entropy of its value set (called the split info of feature Fi.</Paragraph>
    <Paragraph position="7"> IGTREE is a heuristic approximation of IB1-IG which has comparable accuracy, but is optimized for speed. It is insensitive to k-values larger than 1, and uses value-class cooccurrence information when exact matches fail.</Paragraph>
  </Section>
  <Section position="6" start_page="56" end_page="57" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The effects of a distributed class representation on generalization accuracy were measured using an experimental matrix based on 5 linguistic datasets, and 8 experimental conditions, addressing feature selection-based ECOC vs. voting-based ECOC, MVDM, values of k larger than 1, and dichotomizer weighting. The following linguistic tasks were used. DIMIN is a Dutch diminutive formation task derived from the Celex lexical database for Dutch (Baayen et al., 1993). It predicts Dutch nominal diminutive suffixes from phonetic properties (phonemes and stress markers) of maximally the  last three syllables of the noun. The STRESS task, also derived from the Dutch Celex lexP cal database, assigns primary stress on the basis of phonemic values. MORPH assigns morphological boundaries (a.o. root morpheme, stress-changing affix, inflectional morpheme), based on English CELEX data. The WSJ-NPVP task deals with NP-VP chunking of PoS-tagged Wall Street Journal material. GRAPHON, finally, is a grapheme-to-phoneme conversion task for English based on the English Celex lexical database. Numeric characteristics of the different tasks are listed in table 1. All tasks with the exception of GRAPHON happened to be five-class problems; for GRAPHON, a five-class subset was taken from the original training set, in order to keep computational demands manageable. The tasks were subjected to the  8 different experimental situations of table 2.</Paragraph>
    <Paragraph position="1"> For feature selection-based ECOC, backward sequential feature elimination was used (Raaijmakers, 1999), repeatedly eliminating features in turn and evaluating each elimination step with 10-fold cross-validation. For dichotomizer weighting, error information of the dichotomizers, determined from separate unweighted 10-fold cross-validation experiments on a separate training set, produced a weighted Hamming distance metric. Error-based weights were based on raising a small constant ~ in the interval \[0, 1) to the power of the number of errors made by the dichotomizer (Cesa-Bianchi et al., 1996).</Paragraph>
    <Paragraph position="2"> Random feature selection drawing features with replacement created feature sets of both different size and composition for every dichotomizer.</Paragraph>
  </Section>
  <Section position="7" start_page="57" end_page="57" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> Table 3 lists the generalization accuracies for the control groups, and table 4 for the ECOC algorithms. All accuracy results are based on 10-fold cross-validation, with p &lt; 0.05 using paired t-tests. The results show that dis-</Paragraph>
  </Section>
  <Section position="8" start_page="57" end_page="59" type="metho">
    <SectionTitle>
ALGORITHM DESCRIPTION
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> tributed class representations can lead to statistically significant accuracy gains for a variety of linguistic tasks. The ECOC algorithm based on feature selection and weighted Hamming distance performs best. Voting-based ECOC performs poorly on DIMIN and STRESS with voting per bit, but significant accuracy gains are achieved by voting per block, putting it on a par with the best performing algorithm. Regression analysis was applied to investigate the effect of the Modified Value Difference Metric on ECOC accuracy. First, the accuracy gain of MVDM as a function of the information gain ratio of the features was computed. The results show a high correlation (0.82, significant at p &lt; 0.05) between these variables, indicating a linear relation. This is in line with the idea underlying MVDM: whenever two feature values are very predictive of a shared class, they contribute to the similarity between the instances they belong to, which will lead to more accurate classifiers.</Paragraph>
    <Paragraph position="3"> Next, regression analysis was applied to determine the effect of MVDM on ECOC, by relating the accuracy gain of MVDM (k=3) compared to  control group (in round brackets) , and x deterioration at p &lt; 0.05 using paired t-tests). A 1&amp;quot; indicates 25 voters for performance reasons.</Paragraph>
    <Paragraph position="4"> control group II to the accuracy gain of ECOC (algorithm $6, compared to control group IV).</Paragraph>
    <Paragraph position="5"> The correlation between these two variables is very high (0.93, significant at p &lt; 0.05), again indicative of a linear relation. From the perspective of learning class boundaries, the strong effect of MVDM on ECOC accuracy can be understood as follows. When the overlap metric is used, members of a training set belonging to the same class may be situated arbitrarily remote from each other in the feature hyperspace. For instance, consider the following two instances taken from DIMIN: ......... d,A,k,je ......... d,A,x,je (Hyphens indicate absence of feature values.) These two instances encode the diminutive formation of Dutch dakje (little roo\]~ from dak (roo\]~, and dagje (lit. little day, proverbially used) from dag (day). Here, the values k and x, corresponding to the velar stop 'k' and the velar fricative 'g', are minimally different from a phonetic perspective. Yet, these two instances have coordinates on the twelfth dimension of the feature hyperspace that have nothing to do with each other. The overlap treats the k-x value clash just like any other value clash. This phenomenon may lead to a situation where inhabitants of the same class are scattered over the feature hyperspace. In contrast, a value difference metric like MVDM which attempts to group feature values on the basis of class cooccurrence information, might group k and x together if they share enough classes. The effect of MVDM on the density of the feature hyperspace can be compared with the density obtained with the overlap metric as follows. First, plot a random numerical transform of a feature space. For expository reasons, it is adequate to restrict attention to a low-dimensional (e.g.</Paragraph>
    <Paragraph position="6"> two-dimensional) subset of the feature space, for a specific class C. Then, plot an MVDM transform of this feature space, where every coordinate (a, b) is transformed into (P(Cla), P(C I b)). This idea is applied to a subset of DIMIN, consisting of all instances classified as j e (one of the five diminutive suffixes for Dutch). The features for this subset were limited to the last two, consisting of the rhyme and coda of the last syllable of the word, clearly the most informative features for this task. Figure 2 displays the two scatter plots. As can be seen, instances are widely scattered over the feature space for the numerical transform, whereas the MVDM-based transform forms many clusters and produces much higher density. In a condensed fea- null ues based on the overlap metric (left) vs. numerical transform of feature values based on MVDM (right), for a two-features-one-class subset of DIMIN.</Paragraph>
    <Paragraph position="7"> ture hyperspace the number of class boundaries to be learned per bit function reduces. For instance, figures 3 displays the class boundaries for a relatively condensed feature hyperspace, where classes form localized populations, and a scattered feature hyperspace, with classes distributed over non-adjacent regions. The number of class boundaries in the scattered feature space is much higher, and this will put an addi- null tional burden on the learning problems constituted by the various bit functions.</Paragraph>
    <Paragraph position="8">  feature space (right).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML