File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1106_metho.xml

Size: 18,231 bytes

Last Modified: 2025-10-06 14:08:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1106">
  <Title>Text Classi cation in Asian Languages without Word Segmentation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Language Model Text Classi ers
</SectionTitle>
    <Paragraph position="0"> The goal of language modeling is to predict the probability of natural word sequences; or more simply, to put high probability on word sequences that actually occur (and low probability on word sequences that never occur). Given a word sequence</Paragraph>
    <Paragraph position="2"> a11a19a18 to be used as a test corpus, the quality of a language model can be measured by the empirical perplexity (or entropy) on this corpus</Paragraph>
    <Paragraph position="4"> The goal of language modeling is to obtain a small perplexity.</Paragraph>
    <Paragraph position="5"> 2.1 a38 -gram language modeling The simplest and most successful basis for language modeling is the a7 -gram model. Note that by the chain rule of probability we can write the probability of any sequence as</Paragraph>
    <Paragraph position="7"> An a7 -gram model approximates this probability by assuming that the only words relevant to predicting</Paragraph>
    <Paragraph position="9"> words; that is, it assumes the Markov a7 -gram independence as- null where #(.) is the number of occurrences of a specied gram in the training corpus. Unfortunately, using grams of length up to a7 entails estimating the probability ofa48 a42 events, wherea48 is the size of the word vocabulary. This quickly overwhelms modern computational and data resources for even modest choices of a7 (beyond 3 to 6). Also, because of the heavy tailed nature of language (i.e. Zipf's law) one is likely to encounter novel a7 -grams that were never witnessed during training. Therefore, some mechanism for assigning non-zero probability to novel a7 -grams is a central and unavoidable issue. One standard approach to smoothing probability estimates to cope with sparse data problems (and to cope with potentially missing a7 -grams) is to use some sort of  The discounted probability (5) can be computed using different smoothing approaches including Laplace smoothing, linear smoothing, absolute smoothing, Good-Turing smoothing and Witten-Bell smoothing (Chen and Goodman, 1998).</Paragraph>
    <Paragraph position="10"> The language models described above use individual words as the basic unit, although one could instead consider models that use individual characters as the basic unit. The remaining details remain the same in this case. The only difference is that the character vocabulary is always much smaller than the word vocabulary, which means that one can normally use a much higher order, a7 , in a character level a7 -gram model (although the text spanned by a character model is still usually less than that spanned by a word model). The bene ts of the character level model in the context of text classi cation are multi-fold: it avoids the need for explicit word segmentation in the case of Asian languages, and it greatly reduces the sparse data problems associated with large vocabulary models. In this paper, we experiment with character level models to avoid word segmentation in Chinese and Japanese.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Language models as text classi ers
</SectionTitle>
      <Paragraph position="0"> Text classi ers attempt to identify attributes which distinguish documents in different categories. Such attributes may include vocabulary terms, word average length, local a7 -grams, or global syntactic and semantic properties. Language models also attempt capture such regularities, and hence provide another natural avenue to constructing text classi ers.</Paragraph>
      <Paragraph position="1"> Our approach to applying language models to text categorization is to use Bayesian decision theory.</Paragraph>
      <Paragraph position="2"> Assume we wish to classify a text</Paragraph>
      <Paragraph position="4"> training data or can be used to incorporate more assumptions, such as a uniform or Dirichelet distribution. null Therefore, our approach is to learn a separate back-off language model for each category, by training on a data set from that category. Then, to categorize a new text</Paragraph>
      <Paragraph position="6"> model, evaluate the likelihood (or entropy) of   under the model, and pick the winning category according to Equ. (9).</Paragraph>
      <Paragraph position="7"> The inference of an a7 -gram based text classi er is very similar to a naive Bayes classi er (to be dicussed below). In fact, a7 -gram classi ers are a straightforward generalization of naive Bayes (Peng and Schuurmans, 2003).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Traditional Text Classi ers
</SectionTitle>
    <Paragraph position="0"> We introduce the three standard text classi ers that we will compare against below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Naive Bayes classi ers
</SectionTitle>
      <Paragraph position="0"> A simple yet effective learning algorithm for text classi cation is the naive Bayes classi er. In this model, a document  is de ned. The features to be used during classi cation are usually selected by employing heuristic methods, such asa8a147a9 or mutual information scoring, that involve setting cutoff thresholds and conducting a greedy search for a good feature subset. We refer this method as ad hoc a7 -gram based text classi er. The nal classi cation decision is made according to</Paragraph>
      <Paragraph position="2"> Different distance metrics can be used in this approach. We implemented a simple re-ranking distance, which is sometimes referred to as the out-outplace (OOP) measure (Cavnar and Trenkle, 1994).</Paragraph>
      <Paragraph position="3"> In this method, a document is represented by an a7 -gram pro le that contains selected a7 -grams sorted by decreasing frequency. For each a7 -gram in a test document pro le, we nd its counterpart in the class pro le and compute the number of places its loca-tion differs. The distance between a test document and a class is computed by summing the individual out-of-place values.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Support vector machine classi ers
</SectionTitle>
      <Paragraph position="0"> Given a set of a38 linearly separable training exam- null , the SVM approach seeks the optimal hyperplane a11a160a159 a100 a139a162a161 a20 a59 that separates the positive and negative examples with the largest margin. The problem can be formulated as solving the following</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Empirical evaluation
</SectionTitle>
    <Paragraph position="0"> We now present our experimental results on Chinese and Japanese text classi cation problems. The Chinese data set we used has been previously investigated in (He et al., 2001). The corpus is a subset of the TREC-5 People's Daily news corpus published by the Linguistic Data Consortium (LDC) in 1995.</Paragraph>
    <Paragraph position="1"> The entire TREC-5 data set contains 164,789 documents on a variety of topics, including international and domestic news, sports, and culture. The corpus was originally intended for research on information retrieval. To make the data set suitable for text categorization, documents were rst clustered into 101 groups that shared the same headline (as indicated by an SGML tag). The six most frequent groups were selected to make a Chinese text categorization data set.</Paragraph>
    <Paragraph position="2"> For Japanese text classi cation, we consider the Japanese text classi cation data investigated by (Aizawa, 2001). This data set was converted from the NTCIR-J1 data set originally created for Japanese text retrieval research. The conversion process is similar to Chinese data. The nal text classication dataset has 24 categories which are unevenly distributed.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Experimental paradigm
</SectionTitle>
      <Paragraph position="0"> Both of the Chinese and Japanese data sets involve classifying into a large number of categories, where each document is assigned a single category. Many classi cation techniques, such as SVMs, are intrinsically de ned for two class problems, and have to be extended to handle these multiple category data sets. For SVMs, we employ a standard technique of rst converting the a33  For the experiments on Chinese data, we follow (He et al., 2001) and convert the problem into 6 binary classi cation problems. In each case, we randomly select 500 positive examples and then select 500 negative examples evenly from among the remaining negative categories to form the training data. The testing set contains 100 positive documents and 100 negative documents generated in the same way. The training set and testing set do no overlap and do not contain repeated documents.</Paragraph>
      <Paragraph position="1"> For the experiments on Japanese data, we follow (Aizawa, 2001) and directly experiment with a 24-class classi cation problem. The NTCIR data sets are unevenly distributed across categories. The training data consists of 310,355 documents distributed unevenly among the categories (with a minimum of 1,747 and maximum of 83,668 documents per category), and the testing set contains 10,000 documents unevenly distributed among categories (with a minimum of 56 and maximum of 2,696 documents per category).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Measuring classi cation performance
</SectionTitle>
      <Paragraph position="0"> In the Chinese experiments, where 6 binary classi cation problems are formulated, we measured classication performance by micro-averaged F-measure scores. To calculate the micro-averaged score, we formed an aggregate confusion matrix by adding up the individual confusion matrices from each category. The micro-averaged precision, recall, and F-measure can then be computed based on the aggregated confusion matrix.</Paragraph>
      <Paragraph position="1"> For the Japanese experiments, we measured over-all accuracy and the macro-averaged F-measure.</Paragraph>
      <Paragraph position="2"> Here the precision, recall, and F-measures of each individual category can be computed based on a</Paragraph>
      <Paragraph position="4"> a33 confusion matrix. Macro-averaged scores can be computed by averaging the individual scores.</Paragraph>
      <Paragraph position="5"> The overall accuracy is computed by dividing the number of correctly identi ed documents (summing the numbers across the diagonal) by the total number of test documents.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Results on Chinese data
</SectionTitle>
      <Paragraph position="0"> Table 1 gives the results of the character level language modeling approach, where rows correspond to different smoothing techniques. Columns correspond to different a7 -gram order a7a142a20</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
a55a157a156a94a55a157a176a94a55a150a177 . The
</SectionTitle>
    <Paragraph position="0"> entries are the micro-average F-measure. (Note that the naive Bayes result corresponds to a7 -gram order 1 with add one smoothing, which is italicized in the table.) The results the ad hoc OOP classi er, and for the SVM classi er are shown in Table 2 and Table 3 respectively, where the columns labeled Feature # are the number of features selected.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Results on Japanese data
</SectionTitle>
      <Paragraph position="0"> For the Japanese data, we experimented with byte level models (where in fact each Japanese character is represented by two bytes). We used byte level models to avoid possible character level segmentation errors that might be introduced, because we lacked the knowledge to detect misalignment errors in Japanese characters. The results of byte level language modeling classi ers on the Japanese data are shown in Table 4. (Note that the naive Bayes result corresponds to a7 -gram order 2 with add one smoothing, which is italicized in the table.) The results for the OOP classi er are shown in Table 5.</Paragraph>
      <Paragraph position="1"> Note that SVM is not applied in this situation since we are conducting multiple category classi cation directly while SVM is designed for binary classi cation. However, Aizawa (Aizawa, 2001) reported a performance of abut 85% with SVMs by converting the problem into a 24 binary classi cation problem and by performing word segmentation as preprocessing. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion and analysis
</SectionTitle>
    <Paragraph position="0"> We now give a detailed analysis and discussion based on the above results. We rst compare the language model based classi ers with other classiers, and then analyze the in uence of the order a7 of the a7 -gram model, the in uence of the smoothing method, and the in uence of feature selection in tradition approaches.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Comparing classi er performance
</SectionTitle>
      <Paragraph position="0"> Table 6 summarizes the best results obtained by each classi er. The results for the language model (LM) classi ers are better than (or at least comparable to ) other approaches for both the Chinese and Japanese data, while avoiding word segmentation. The SVM result on Japanese data is obtained from (Aizawa, 2001) where word segmentation was performed as a preprocessing. Note that SVM classi ers do not perform as well in our Chinese text classi cation as they did in English text classi cation (Dumais, 1998), neither did they in Japanese text classi cation (Aizawa, 2001). The reason worths further investigations. null Overall, the language modeling approach appears to demonstrate state of the art performance for Chinese and Japanese text classi cation. The reasons for the improvement appear to be three-fold: First, the language modeling approach always considers every feature during classi cation, and can thereby avoid an error-prone feature selection process. Second, the use of a7 -grams in the model relaxes the restrictive independence assumption of naive Bayes.</Paragraph>
      <Paragraph position="1"> Third, the techniques of statistical language modeling offer better smoothing methods for coping with features that are unobserved during training.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 In uence of thea7 -gram order
</SectionTitle>
      <Paragraph position="0"> The ordera7 is a key factor ina7 -gram language modeling. An order a7 that is too small will not capture suf cient information to accurately model character dependencies. On the other hand, a context a7 that is too large will create sparse data problems in training. In our Chinese experiments, we did not observe signi cant improvement when using higher order a7 -gram models. The reason is due to the early onset of sparse data problems. At the moment, we only have limited training data for Chinese data set (1M in size, 500 documents per class for training). If more training data were available, the higher order models may begin to show an advantage. For example, in the larger Japanese data set (average 7M size, 12,931 documents per class for training) we  observe an obvious increase in classi cation performance with higher order models (Table 4). However, here too, whena7 becomes too large, over tting will begin to occur, as better illustrated in Figure 1.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 In uence of smoothing techniques
</SectionTitle>
      <Paragraph position="0"> Smoothing plays an key role in language modeling. Its effect on classi cation is illustrated in Figure 2. In both cases we have examined, add one smoothing is obviously the worst smoothing technique, since it systematically over ts much earlier than the more sophisticated smoothing techniques.</Paragraph>
      <Paragraph position="1"> The other smoothing techniques do not demonstrate a signi cant difference in classi cation accuracy on our Chinese and Japanese data, although they do show a difference in the perplexity of the language models themselves (not shown here to save space).</Paragraph>
      <Paragraph position="2"> Since our goal is to make a nal decision based on the ranking of perplexities, not just their absolute values, a superior smoothing method in the sense of perplexity reduction does not necessarily lead to a better decision from the perspective of categorization accuracy.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 In uence of feature selection
</SectionTitle>
      <Paragraph position="0"> The number of features selected is a key factor in determining the classi cation performance of the OOP and SVM classi ers, as shown in Figure 3. Obviously the OOP classi er is adversely affected by increasing the number of selected features. By contrast, the SVM classi er is very robust with respect to the number of features, which is expected because the complexity of the SVM classi er is determined by the number of support vectors, not the dimensionality of the feature space. In practice, some heuristic search methods are normally used to obtain an optimal subset of features. However, in our language modeling based approach, we avoid explicit feature selection by considering all possible features and the importance of each individual feature is measured by its contribution to the perplexity (or entropy) value.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML