File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0409_intro.xml

Size: 14,762 bytes

Last Modified: 2025-10-06 14:01:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0409">
  <Title>Exceptionality and Natural Language Learning</Title>
  <Section position="2" start_page="0" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Our paper is a follow-up of the study done by Daelemans et al. (1999) in which the authors show that keeping exceptional training instances is useful for increasing generalization accuracy when natural language learning tasks are involved. The tasks used in their experiments are: grapheme-phoneme conversion, part of speech tagging, prepositional phrase attachment and base noun phrase chunking. Their study provides empirical evidence that editing exceptional instances leads to a decrease in memory-based learner performance. Next, the memory-based learner is compared on the same tasks with a decision-tree learner and their results favor the memory-based learner. Moreover, the authors provide evidence that the performance of their memory-based learner is linked to its property of holding all instances (including exceptional ones) and general properties of language learning tasks (difficultness in discriminating between noise and valid exceptions and sub-regularities for those tasks).</Paragraph>
    <Paragraph position="1"> We continue on the same track by investigating if their results hold on a different set of tasks. Our tasks come from the area of spoken dialog systems and have smaller datasets and more features (with many of the features being numeric, in contrast with the previous study that had none). We observe in our experiments with these tasks a much smaller exceptionality measure range compared with the previous study. Our results indicate that the previous results do not generalize to all our tasks.</Paragraph>
    <Paragraph position="2"> An additional goal of our research is to investigate a new topic by looking into whether exceptionality measures can be used to characterize the performance of our learners: a memory-based learner (IB1-IG) and a rule-based learner (Ripper). Our results indicate that for some of the exceptionality measures we will examine, IB1-IG is better for predicting typical instances while Ripper is better for predicting exceptional instances.</Paragraph>
    <Paragraph position="3"> We will use the following conventions throughout the paper. The term &amp;quot;exceptional&amp;quot; will be used to label instances that do not follow the rules that characterize the class they are part of (in language learning terms, they are &amp;quot;bad&amp;quot; examples of their class rules). We will use &amp;quot;typical&amp;quot; as the antonym of this term; it will label instances that are good examples of their class rules.</Paragraph>
    <Paragraph position="4"> The fact that an instance is typical should not be confused with an exceptionality measure we will use that has the same name (typicality measure).</Paragraph>
    <Paragraph position="5"> Learning methods We will use in our study the same memory-based learner that was used in the previous study: IB1-IG. The abstraction-based learner used in the previous study was C5.0 (a commercial implementation of the C4.5 decision tree learner). In our study we will use a rule-based learner, Ripper. Although the two abstraction-based learners are different, they share many features (many techniques used in rule-based learning have been adapted from decision tree learning (Cohen, 1995))</Paragraph>
    <Paragraph position="7"> We used Ripper because its implementation was available and previous studies on our language learning tasks were per-</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
formed using Ripper
2.1 IB1-IG
</SectionTitle>
      <Paragraph position="0"> Our memory-based learner is called IB1-IG and is part of TiMBL, a software package developed by the ILK Research Group, Tilburg University and the CNTS Research Group, University of Antwerp. TiMBL is a collection of memory-based learners that sit on top of the classic k-NN classification kernel with added metrics, algorithms, and extra functions.</Paragraph>
      <Paragraph position="1"> Memory-based reasoning is based on the hypothesis that humans, in order to react to a new situation, first compare the new situation with previously encountered situations (which reside in their memory), pick one or more similar situations, and react to the new one based on how they reacted to those similar situations. This type of learning is also called lazy learning because the learner does not build a model from the training data.</Paragraph>
      <Paragraph position="2"> Instead, typically, the whole training set is stored. To predict the class for a new instance, the lazy learner compares it with stored instances using a similarity metric and the new instance class is determined based on the classes of the most similar training instances. At the algorithm level, lazy learning algorithms are versions of k-nearest neighbor (k-NN) classifiers.</Paragraph>
      <Paragraph position="3"> IB1-IG is a k-NN classifier that uses a weighted overlap metric, where a feature weight is automatically computed as the Information Gain (IG) of that feature.</Paragraph>
      <Paragraph position="4"> The weighted overlap metric for two instances X and Y is defined as:  Information gain is computed for every feature in isolation by computing the difference in uncertainty between situations with or without knowledge of the feature value (for more information, see Daelemans et al., 2001). These values describe the importance of that feature in predicting the class of an instance and are used as feature weights.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Ripper
</SectionTitle>
      <Paragraph position="0"> Ripper is a fast and effective rule-based learner developed by William Cohen (Cohen, 1995). The algorithm has an overfit-and-simplify learning strategy: first an initial rule set is devised by overfitting a part of the training set (called the growing set) and then this rule set is repeatedly simplified by applying pruning operators and testing the error reduction on another part of the training set (called the pruning set). Ripper produces a model consisting of an ordered set of if-then rules.</Paragraph>
      <Paragraph position="1"> There are several advantages to using rule-based learners. The most important one is the fact that people can understand relatively easy the model learned by a rule-based learner compared with the one learned by a decision-tree learner, neural network or memory-based learner. Also, domain knowledge can be incorporated in a rule-based learner by altering the type of rules it can learn. Finally, rule-based learners are relatively good at filtering the potential noise from the training set. But in the context of natural language learning tasks where distinguishing between noise and exceptions and sub-regularities is very hard, this filtering may result in a decrease in accuracy. In contrast, memory-based learners, by keeping all instances around (including exceptional ones), may have higher classification accuracy for such tasks.</Paragraph>
      <Paragraph position="2"> Exceptionality measures One of the main disadvantages of memory-based learning is the fact that the entire training set is kept. This leads to serious time and memory performance drawbacks if the training set is big enough. Moreover, to improve accuracy, one may want to have noisy instances present in the training set pruned. To address these problems there has been a lot of work on trying to edit part of the training set without hampering the accuracy of the predictor. Two types of editing can be done. One can edit redundant regular instances (because the training set contains a lot of similar instances for that class) and/or unproductive instances (the ones that present irregularities with respect to the training set space). There are many measures that capture both types of instances. We will use the ones from the previous study (typicality and class prediction strength) and a new one called local typicality. Even though these measures were devised with the purpose of editing part of the training set, they are used in our study and the previous study to point out instances that should not be removed, at least for language learning tasks.</Paragraph>
      <Paragraph position="3"> Typicality We will use the typicality definition from Daelemans et al. (1999) which is similar to the definition from Zhang (1992). In both cases, a typicality function is defined whose extremes correspond to exceptional and typical instances. The function requires a similarity measure which is defined in both cases as the inverse of the distance between two instances. The difference between the two implementations of typicality is that Zhang (1992) defines the distance as the Euclidian distance while Daelemans et al. (1999) use the normalized weighted Manhattan distance from (1). Thus, our similarity measure will be defined as:</Paragraph>
      <Paragraph position="5"> For every instance X, a subset of the dataset called family of X, Fam(X), is defined as being all instances from the dataset that have the same class as X. All remaining instances form the unrelated instances subset, Unr(X). Then, intra-concept similarity is defined as the average similarity between X and instances from Fam(X) and inter-concept similarity as the average similarity between X and instances from Unr(X).</Paragraph>
      <Paragraph position="7"> Finally, typicality of an instance X is defined as the ratio of its intra-concept and inter-concept similarity.</Paragraph>
      <Paragraph position="9"> The typicality values are interpreted as follows: if the value is higher than 1, then that instance has an intra-concept similarity higher than inter-concept similarity, thus one can say that the instance is a good example of its class (it is a typical instance). A value less than 1 implies the opposite: the instance is not a good example of its class (it is an exceptional instance). Values around 1 are called by Zhang boundary instances since they seem to reside at the border between concepts.</Paragraph>
      <Paragraph position="10">  Class prediction strength Another measure used in the previous study is the class prediction strength (CPS). This measure tries to capture the ability of an instance to predict correctly the class of a new instance. We will employ the same CPS definition used in the previous study (the one proposed by Salzberg (1990)). In the context of k-NN, predicting the class means, typically, that the instance is the closest neighbor for a new instance. Thus the CPS function is defined as the ratio of the number of times our instance is the closest neighbor for an instance of the same class and the number of times our instance is the closest neighbor for another instance regardless of its class. A CPS value of 1 means that if our instance is to influence another instance class (by being its closest neighbor) its influence is good (in the sense that predicting the class using our instance class will result in an accurate prediction). Thus our instance is a good predictor for our class, i.e. it is a typical instance. In contrast, a value of 0 indicates a bad predictor for the class and thus labels an exception instance. A value of 0.5 will correspond to instances at the border between concepts.</Paragraph>
      <Paragraph position="11"> Unlike typicality, when computing CPS, we can encounter situations when its value is undefined (zero divided by zero). This means that the instance is not the closest neighbor for any other instance. Since there is no clear interpretation of instance properties in this case, we will set its CPS value to a constant higher than 1 (no particular meaning of the value, just to recognize it in our graphs).</Paragraph>
      <Paragraph position="12"> Local typicality While CPS captures information very close to an instance, typicality as defined by Zhang captures information from the entire dataset. But this may not be the most desirable measure in cases such as those when a concept is made of at least two disjunctive clusters.</Paragraph>
      <Paragraph position="13"> Consider the example from Figure 1. For an instance in the center of cluster A  , its similarity with instances from the same cluster is very high but very low with instances from cluster A  . At the same time, its similarity with instances from class B is somewhere between above t instan ave com around 1 even i  wo values. When everything is averaged, ce intra-concept and inter-concept similarity h parable values thus leading to a typicality value f the instance is highly typical for the  To address this problem, we changed the definition (X) and Unr(X). Instead of considering all inces from the dataset when building the two subsets, e using only instances from a vicinity of our ce. The typicality computed using these new subwill be called local typicality. To define the vicinsed again the similarity metric. When two similarity has the maximum e which is the sum of all feature weights. An inof another instance if and only if r similarity has a value higher than a given percent aximum similarity value (using this definition of ecified number of nearest makes our exceptionality measure adaptive e density of the local neighborhood). For our data, a percent value of 90% yields the best results furing a measure that is different from both typicality PS.</Paragraph>
      <Paragraph position="14"> Like CPS, division by zero can appear when comg local typicality. This means that inter-concept  causes flattening in typicality distribution similarity is zero and this can only happen if there is no instance with a different class in the vicinity of our instance. In this case, if the intra-concept similarity is higher than 0 (there is at least one instance from the same class in the vicinity) we set the local typicality to a maximum value, while if the intra-concept similarity is 0, then we set the typicality to a minimum value (no one in the vicinity of this instance is a good indication of an exceptional instance). When inter-concept similarity is higher than 0, we will set the local typicality to a minimum value if its intra-concept similarity is 0 (so that we will not have a big gap between local typicality values). Minimum and maximum values are computed as values to the left and right of the local typicality interval for non-exceptional cases.</Paragraph>
      <Paragraph position="15"> We can rank our exceptionality measures by the level of information they capture (from most general to most local): typicality, local typicality and CPS.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML