XML Viewer - p02-1063

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1063_metho.xml
Size: 20,364 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1063">
  <Title>Revision Learning and its Application to Part-of-Speech Tagging</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Multi-Class Classification
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Method
</SectionTitle>
      <Paragraph position="0"> Let us consider the problem to decide the class of an example x among multiple classes. Such a problem is called multi-class classification problem. Many tasks in natural language processing such as POS tagging are regarded as a multi-class classification problem. When we only have binary (positive or negative) classification algorithm at hand, we have to reformulate a multi-class classification problem into a binary classification problem. We assume a binary classifier f(x) that returns positive or negative real value for the class ofx, where the absolute valuejf(x)j reflects the confidence of the classification.</Paragraph>
      <Paragraph position="1"> The one-versus-rest method is known as one of such methods (Allwein et al., 2000). For one training example of a multi-class problem, this method creates a positive training example for the true class and negative training examples for the other classes. As a result, positive and negative examples for each class are generated.</Paragraph>
      <Paragraph position="2"> Suppose we have five candidate classes A, B, C, D and E , and the true class of x is B. Figure 1 (left) shows the created training examples.</Paragraph>
      <Paragraph position="3"> Note that there are only two labels (positive and negative) in contrast with the original problem.</Paragraph>
      <Paragraph position="4"> Then a binary classifier for each class is trained using the examples, and five classifiers are created for this problem. Given a test example x0, all the classifiers classify the example whether it belongs to a specific class or not. Its class is decided by the classifier that gives the largest value of f(x0). The algorithm is shown in Figure  # Training Procedure of One-versus-Rest # This procedure is given training examples # f(xi;yi)g, and creates classifiers.</Paragraph>
      <Paragraph position="5"> # C = fc0;:::;ck!1g: the set of classes, # xi: the ith training example, # yi 2 C: the class of xi, # k: the number of classes, # l: the number of training examples, # fc(C/): the binary classifier for the class c # (see the text).</Paragraph>
      <Paragraph position="6"> procedure TrainOVR(f(x0;y0);:::;(xl!1;yl!1)g) begin # Create the training data with binary label</Paragraph>
      <Paragraph position="8"> if cj 6= yi then Add xi to the training data for the class cj as a  negative example.</Paragraph>
      <Paragraph position="9"> else Add xi to the training data for the class cj as a positive example.</Paragraph>
      <Paragraph position="10"> end end # Train the binary classifiers for j := 0 to k!1 Train the classifier fcj(C/) using the training data. end # Test Function of One-versus-Rest # This function is given a test example and # returns the predicted class of it. # C = fc0;:::;ck!1g: the set of classes, # x: the test example, # k: the number of classes, # fc(C/): binary classifier trained with the # algorithm above.</Paragraph>
      <Paragraph position="12"> However, this method has the problem of being computationally costly in training, because the negative examples are created for all the classes other than the true class, and the total number of the training examples becomes large (which is equal to the number of original training examples multiplied by the number of classes). The computational cost in testing is also large, because all the classifiers haveto work on each test example.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Revision Learning
</SectionTitle>
    <Paragraph position="0"> As discussed in the previous section, the one-versus-rest method has the problem of computational cost. This problem become more serious when costly binary classifiers are used or when a large amount of data is used. To cope with this problem, let us consider the task of POS tagging. Most portions of POS tagging is not so difficult and a simple POS-based HMMs learning 1 achieves more than 95% accuracy simply using the POS context (Brants, 2000). This means that the low capacity model is enough to do most portions of the task, and we need not use a high accuracy but costly algorithm in every portion of the task. This is the base motivation of the revision model we are proposing here.</Paragraph>
    <Paragraph position="1"> Revision learning uses a binary classifier with higher capacity to revise the errors made by the stochastic model with lower capacity as follows: During the training phase, a ranking is assigned to each class by the stochastic model for a training example, that is, the candidate classes are sorted in descending order of its conditional probability given the example. Then, the classes are checked in their ranking order to create binary classifiers as follows. If the class is incorrect (i.e. it is not equal to the true class for the example), the example is added to the training data for that class as a negative example, and the next ranked class is checked. If the class is correct, the example is added to the training data for that class as a positive exam1HMMs can be applied to either of unsupervised or supervised learning. In this paper, we use the latter case, i.e., visible Markov Models, where POS-tagged data is used for training.</Paragraph>
    <Paragraph position="2"> ple, and the remaining ranked classes are not taken into consideration (Figure 1, right). Using these training data, binary classifiers are created. Note that each classifier is a pure binary classifier regardless with the number of classes in the original problem. The binary classifier is trained just for answering whether the output from the stochastic model is correct or not.</Paragraph>
    <Paragraph position="3"> During the test phase, first the ranking of the candidate classes for a given example is assigned by the stochastic model as in the training. Then the binary classifier classifies the example according to the ranking. If the classifier answers the example as incorrect, the next highest ranked class becomes the next candidate for checking. But if the example is classified as correct, the class of the classifier is returned as the answer for the example. The algorithm is shown in Figure 3.</Paragraph>
    <Paragraph position="4"> The amount of training data generated in the revision learning can be much smaller than that in one-versus-rest. Since, in revision learning, negative examples are created only when the stochastic model fails to assign the highest probability to the correct POS tag, whereas negative examples are created for all but one class in the one-versus-rest method. Moreover, testing time of the revision learning is shorter, because only one classifier is called as far as it answers as correct, but all the classifiers are called in the one-versus-rest method.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Morphological Analysis with
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Revision Learning
</SectionTitle>
      <Paragraph position="0"> We introduced revision learning for multi-class classification in the previous section. However, Japanese morphological analysis cannot be regarded as a simple multi-class classification problem, because words in a sentence are not separated by spaces in Japanese and the morphological analyzer has to segment the sentence into words as well as to decide the POS tag of the words. So in this section, we describe how to apply revision learning to Japanese morphological analysis.</Paragraph>
      <Paragraph position="1"> For a given sentence, a lattice consisting of all possible morphemes can be built using a mor- null # Training Procedure of Revision Learning # This procedure is given training examples # f(xi;yi)g, and creates classifiers.</Paragraph>
      <Paragraph position="2"> # C = fc0;:::;ck!1g: the set of classes, # xi: the ith training example, # yi 2 C: the class of xi, # k: the number of classes, # l: the number of training examples, # ni: the ordered indexes of C # (see the following code), # fc(C/): the binary classifier for the class c # (see the text).</Paragraph>
      <Paragraph position="3"> procedure TrainRL(f(x0;y0);:::;(xl!1;yl!1)g) begin # Create the training data with binary label</Paragraph>
      <Paragraph position="5"> Call the stochastic model to obtain the</Paragraph>
      <Paragraph position="7"> if cnj 6= yi then Add xi to the training data for the class cnj as a  negative example.</Paragraph>
      <Paragraph position="8"> else begin Add xi to the training data for the class cnj as a positive example.</Paragraph>
      <Paragraph position="9"> break end end end # Train the binary classifiers for j := 0 to k!1 Train the classifier fcj(C/) using the training data. end # Test Function of Revision Learning # This function is given a test example and # returns the predicted class of it. # C = fc0;:::;ck!1g: the set of classes, # x: the test example, # k: the number of classes, # ni: the ordered indexes of C # (see the following code), # fc(C/): binary classifier trained with the # algorithm above.</Paragraph>
      <Paragraph position="10">  pheme dictionary as in Figure 4. Morphological analysis is conducted by choosing the most likely path on it. We adopt HMMs as the stochastic model and SVMs as the binary classifier. For any sub-paths from the beginning of the sentence (BOS) in the lattice, its generative probability can be calculated using HMMs (Nagata, 1999). We first pick up the end node of the sentence as the current state node, and repeat the following revision learning process backward until the beginning of the sentence. Rankings are calculated by HMMs to all the nodes connected to the current state node, and the best of these nodes is identified based on the SVMs classifiers. The selected node then becomes the current state node in the next round. This can be seen as SVMs deciding whether two adjoining nodes in the lattice are connected or not. In Japanese morphological analysis, for any given morpheme ,,, we use the following features for the SVMs:  1. the POS tags, the lexical forms and the inflection forms of the two morphemes preceding ,,; 2. the POS tags and the lexical forms of the two morphemes following ,,; 3. the lexical form and the inflection form of  ,,.</Paragraph>
      <Paragraph position="11"> The preceding morphemes are unknown because the processing is conducted from the end of the sentence, but HMMs can predict the most likely preceding morphemes, and we use them as the features for the SVMs.</Paragraph>
      <Paragraph position="12"> English POS tagging is regarded as a special case of morphological analysis where the segmentation is done in advance, and can be conducted in the same way. In English POS tagging, given a word w, we use the following features for the SVMs:  1. the POS tags and the lexical forms of the two words preceding w, which are given by HMMs; 2. the POS tags and the lexical forms of the two words following w; 3. the lexical form of w and the prefixes and suffixes of up to four characters, the exis-</Paragraph>
      <Paragraph position="14"/>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> This section gives experimental results of English POS tagging and Japanese morphological analysis with revision learning.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Experiments of English
Part-of-Speech Tagging
</SectionTitle>
      <Paragraph position="0"> Experiments of English POS tagging with revision learning (RL) are performed on the Penn Treebank WSJ corpus. The corpus is randomly separated into training data of 41,342 sentences and test data of 11,771 sentences. The dictionary for HMMs is constructed from all the words in the training data.</Paragraph>
      <Paragraph position="1"> T3 of ICOPOST release 0.9.0 (Schr&amp;quot;oder, 2001) is used as the stochastic model for ranking stage. This is equivalent to POS-based second order HMMs. SVMs with second order polynomial kernel are used as the binary classifier. The results are compared with TnT (Brants, 2000) based on second order HMMs, and with POS tagger using SVMs with one-versus-rest (1v-r) (Nakagawa et al., 2001).</Paragraph>
      <Paragraph position="2"> The accuracies of those systems for known words, unknown words and all the words are shown in Table 1. The accuracies for both known words and unknown words are improved through revision learning. However, revision learning could not surpass the one-versus-rest.</Paragraph>
      <Paragraph position="3"> The main difference in the accuracies stems from those for unknown words. The reason for that seems to be that the dictionary of HMMs for POS tagging is obtained from the training data, as a result, virtually no unknown words exist in the training data, and the HMMs never make mistakes for unknown words during the training. So no example of unknown words is available in the training data for the SVM reviser. This is problematic: Though the HMMs handles unknown words with an exceptional method, SVMs cannot learn about errors made by the unknown word processing in the HMMs. To cope with this problem, we force the HMMs to make mistakes by eliminating low frequent words from the dictionary. We eliminated the words appearing only once in the training data so as to make SVMs to learn about unknown words. The results are shown in Table 1 (row &amp;quot;cutoff-1&amp;quot;). Such procedure improves the accuracies for unknown words.</Paragraph>
      <Paragraph position="4"> One advantage of revision learning is its small computational cost. We compare the computation time with the HMMs and the one-versusrest. We also use SVMs with linear kernel function that has lower capacity but lower computational cost compared to the second order polynomial kernel SVMs. The experiments are performed on an Alpha 21164A 500MHz processor.</Paragraph>
      <Paragraph position="5"> Table 2 shows the total number of training examples, training time, testing time and accuracy for each of the five systems. The training time and the testing time of revision learning are considerably smaller than those of the oneversus-rest. Using linear kernel, the accuracy decreases a little, but the computational cost is much lower than the second order polynomial kernel.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Experiments of Japanese
Morphological Analysis
</SectionTitle>
      <Paragraph position="0"> We use the RWCP corpus and some additional spoken language data for the experiments of Japanese morphological analysis. The corpus is randomly separated into training data of 33,831 sentences and test data of 3,758 sentences. As the dictionary for HMMs, we use IPADIC version 2.4.4 with 366,878 morphemes (Matsumoto and Asahara, 2001) which is originally constructed for the Japanese morphological analyzer ChaSen (Matsumoto et al., 2001).</Paragraph>
      <Paragraph position="1"> A POS bigram model and ChaSen version 2.2.8 based on variable length HMMs are used as the stochastic models for the ranking stage, and SVMs with the second order polynomial kernel are used as the binary classifier.</Paragraph>
      <Paragraph position="2"> We use the following values to evaluate Japanese morphological analysis: recall = h# of correct morphemes in system's outputih# of morphemes in test datai ; precision = h# of correct morphemes in system's outputih# of morphemes in system's outputi ; F-measure = 2PSrecallPSprecisionrecall + precision : The results of the original systems and those with revision learning are shown in Table 3, which provides the recalls, precisions and F-measures for two cases, namely segmentation (i.e. segmentation of the sentences into morphemes) and tagging (i.e. segmentation and POS tagging). The one-versus-rest method is not used because it is not applicable to morphological analysis of non-segmented languages directly.</Paragraph>
      <Paragraph position="3"> When revision learning is used, all the measures are improved for both POS bigram and ChaSen. Improvement is particularly clear for the tagging task.</Paragraph>
      <Paragraph position="4"> The numbers of correct morphemes for each POS category tag in the output of ChaSen with and without revision learning are shown in Table 4. Many particles are correctly revised by revision learning. The reason is that the POS tags for particles are often affected by the following words in Japanese, and SVMs can revise such particles because it uses the lexical forms of the following words as the features. This is the advantage of our method compared to simple HMMs, because HMMs have difficulty in handling a lot of features such as the lexical forms of words.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related Works
</SectionTitle>
    <Paragraph position="0"> Our proposal is to revise the outputs of a stochastic model using binary classifiers. Brill  studiedtransformation-basederror-drivenlearning (TBL) (Brill, 1995), which conducts POS tagging by applying the transformation rules to the POS tags of a given sentence, and has a resemblance to revision learning in that the second model revises the output of the first model.</Paragraph>
    <Paragraph position="1">  However, our method differs from TBL in two ways. First, our revision learner simply answers whether a given pattern is correct or not, and any types of binary classifiers are applicable.</Paragraph>
    <Paragraph position="2"> Second, in our model, the second learner is applied to the output of the first learner only once. In contrast, rewriting rules are applied repeatedly in the TBL.</Paragraph>
    <Paragraph position="3"> Recently, combinations of multiple learners have been studied to achieve high performance (Alpaydm, 1998). Such methodologies to combine multiple learners can be distinguished into two approaches: one is the multi-expert method and the other is the multi-stage method. In the former, each learner is trained and answers independently, and the final decision is made based on those answers. In the latter, the multiple learners are ordered in series, and each learner is trained and answers only if the previous learner rejects the examples. Revision learning belongs to the latter approach. In POS tagging, some studies using the multi-expert method were conducted (van Halteren et al., 2001; M`arquez et al., 1999), and Brill and Wu (1998) combined maximum entropy models, TBL, unigram and trigram, and achieved higher accuracy than any of the four learners (97.2% for WSJ corpus).</Paragraph>
    <Paragraph position="4"> Regarding the multi-stage methods, cascading (Alpaydin and Kaynak, 1998) is well known, and Even-Zohar and Roth (2001) proposed the sequential learning model and applied it to POS tagging. Their methods differ from revision learning in that each learner behaves in the same way and more than one learner is used in their methods, but in revision learning the stochastic model assigns rankings to candidates and the binary classifier selects the output. Furthermore, mistakes made by a former learner are fatal in their methods, but is not so in revision learning because the binary classifier works to revise them.</Paragraph>
    <Paragraph position="5"> The advantage of the multi-expert method is that each learner can help each other even if it has some weakness, and generalization errors can be decreased. On the other hand, the computational cost becomes large because each learner is trained using every training data and answers for every test data. In contrast, multi-stage methods can decrease the computational cost, and seem to be effective when a large amount of data is used or when a learner with high computational cost such as SVMs is used.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML