File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3023_metho.xml
Size: 7,371 bytes
Last Modified: 2025-10-06 14:09:43
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3023"> <Title>Perceptron Learning for Chinese Word Segmentation</Title> <Section position="4" start_page="154" end_page="154" type="metho"> <SectionTitle> 3 Learning Algorithm </SectionTitle> <Paragraph position="0"> Perceptron is a simple and effective learning algorithm. For a binary classification problem, it checks the training examples one by one by predicting their labels. If the prediction is correct, the example is passed; otherwise, the example is used to correct the model. The algorithm stops when the model classifies all training examples correctly. The margin Perceptron not only classifies every training example correctly but also outputs for every training example a value (before thresholding) larger than a predefined parameter (margin). The margin Perceptron has better generalisation capability than the standard Perceptron. Li et al. (2002) proposed the Perceptron algorithm with uneven margins (PAUM) by introducing two margin parameters t+ and t[?] into the update rules for the positive and negative examples, respectively. Two margin parameters allow the PAUM to handle imbalanced datasets better than both the standard Perceptron and the margin Perceptron. PAUMhas been successfully used for document classification and information extraction (Li et al., 2005).</Paragraph> <Paragraph position="1"> We used the PAUM algorithm to train a classifier for each of four classes for Chinese word segmentation. For one test example, the output of the Perceptron classifier before thresholding was used for comparison among the four classifiers.</Paragraph> <Paragraph position="2"> The important parameters of the learning algorithm are the uneven margins parameters t+ and t[?]. In all our experiments t+ = 20 and t[?] = 1 were used.</Paragraph> <Paragraph position="3"> Table 1 presents the results for each of the four classification problems, obtained from 4-fold cross-validation on training set. Not surprisingly, the classification for middle character of multi-character word was much harder than other three classification problems, since middle character of Chinese word is less characteristic than beginning or end character or single-character word. On the other hand, improvement on the classification for middle character, while keeping the performances of other classification, would improve the overall performance of segmentation.</Paragraph> <Paragraph position="4"> training sets of the four corpora. C1, C2 and C3 refer to the classifier for beginning, middle and end character of multi-character word, respectively, andC4referstotheclassifier forsingle Support vector machines (SVM) is a popular learning algorithm, which has been successfully applied to many classification problems in natural language processing. Similar to the PAUM, SVM is a maximal margin algorithm. Table 2 presents a comparison of performances and computation times between the PAUM and the SVM with linear kernel1 on three subsets of cityu corpora with different sizes. The performance of SVM was better than the PAUM. However, the larger the training data was, the closer the performance of PAUM to that of SVM. On the other hand, SVM took much longer computation time than PAUM.</Paragraph> <Paragraph position="5"> As a matter of fact, we have run the SVM with linear kernel on the whole cityu training corpus using 4-fold cross-validation for one month and it has not finished yet. In contrast, PAUM just took about one hour to run the same experiment.</Paragraph> </Section> <Section position="5" start_page="154" end_page="155" type="metho"> <SectionTitle> 4 Features for Each Character </SectionTitle> <Paragraph position="0"> In our system every character was regarded as one instance for classification. The features for one character were the character form itself and the character forms of the two preceding and the two following characters of the current one.</Paragraph> <Paragraph position="1"> In other word, the features for one character c0 were the character forms from a context win- null for Chinese word segmentation: averaged F1 (%) over the 4-fold cross-validation on three subsets of cityu corpus and the computation time (in second) for each experiment. The three subsets have 100, 1000 and 5000 sentences, respectively.</Paragraph> <Paragraph position="2"> dow centering at c0 and containing five characters {c[?]2, c[?]1, c0, c1, c2} in a sentence. Our experiments on training data showed that co-occurrences of characters in the context window were helpful. Taking account of all co-occurrences of characters in context window is equivalent to using a quadratic kernel in Perceptron, while not using any co-occurrence amounts to a linear kernel. Actually we can only use part of co-occurrences as features, which can be regarded as some kind of semi-quadratic kernel.</Paragraph> <Paragraph position="3"> Table 3 compares the three types of kernel for Perceptron, where for the semi-quadratic kernel we used the co-occurrences of characters in context window as those used in (Xue and Shen, 2003), namely {c[?]2c[?]1, c[?]1c0, c0c1, c1c2, c[?]1c1}. It was shown that the quadratic kernel gave much better results than linear kernel and the semi-quadratic kernel was slightly better than fully quadratic kernel. Semi-quadratic kernel also led to less feature and less computation time than fully quadratic kernel. Therefore, this kind of semi-quadratic kernel was used in our submissions.</Paragraph> <Paragraph position="4"> nel for Perceptron, as well as for SVM, performed better than linear kernel for information extraction and other NLP tasks (see e.g. Carreras et al. (2003)). However, quadratic kernel was usually implemented in dual form for Perceptron and it took very long time for training. We implemented the quadratic kernel for Perceptron in primal form by encoding the linear and quadratic features into feature vector explicitly. Actually our implementation performed even slightly better than the Perceptron with quadratic kernel as we used only part of quadratic features, and it was still as efficient as the Perceptron with linear kernel.</Paragraph> </Section> <Section position="6" start_page="155" end_page="155" type="metho"> <SectionTitle> 5 Open Test </SectionTitle> <Paragraph position="0"> While closed test required the participants only to use theinformation presented intraining material, open test allowed to use any external information or resources besides the training data. In our submissions for the open test we just used the minimal external information, namely the utf-8 code knowledge for identifying a piece of English text or an Arabic number. and What we did by using this kind of knowledge was to pre-process the text by replacing each piece of English text with a symbol &quot;E&quot; and replacing every Arabic number with another symbol &quot;N&quot;. This kind of pre-processing resulted in a smaller training data and less computation time and yet slightly better performance on training data, as shown in Table 4 which compares the results of collapsing the English text only and collapsing both the English text and Arabic number with those for closed test.</Paragraph> <Paragraph position="1"> Table4alsopresents the95%confidence intervals for the F-measures.</Paragraph> </Section> class="xml-element"></Paper>