File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1025_metho.xml

Size: 20,312 bytes

Last Modified: 2025-10-06 14:08:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1025">
  <Title>Language and Task Independent Text Categorization with Simple Language Models</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 n-Gram Language Modeling
</SectionTitle>
    <Paragraph position="0"> The dominant motivation for language modeling has traditionally come from speech recognition, but language models have recently become widely used in many other application areas.</Paragraph>
    <Paragraph position="1"> The goal of language modeling is to predict the probability of naturally occurring word sequences, s = w1w2:::wN; or more simply, to put high probability on word sequences that actually occur (and low probability on word sequences that never occur). Given a word sequence w1w2:::wN to be used as a test corpus, the quality of a language model can be measured by the empirical perplexity and entropy scores on this corpus</Paragraph>
    <Paragraph position="3"> where the goal is to minimize these measures.</Paragraph>
    <Paragraph position="4"> The simplest and most successful approach to language modeling is still based on the n-gram model. By the chain rule of probability one can write the probability of any word sequence as</Paragraph>
    <Paragraph position="6"> An n-gram model approximates this probability by assuming that the only words relevant to predicting</Paragraph>
    <Paragraph position="8"> A straightforward maximum likelihood estimate of n-gram probabilities from a corpus is given by the observed frequency of each of the patterns</Paragraph>
    <Paragraph position="10"> where #(.) denotes the number of occurrences of a specified gram in the training corpus. Although one could attempt to use simple n-gram models to capture long range dependencies in language, attempting to do so directly immediately creates sparse data problems: Using grams of length up to n entails estimating the probability of Wn events, where W is the size of the word vocabulary. This quickly overwhelms modern computational and data resources for even modest choices of n (beyond 3 to 6).</Paragraph>
    <Paragraph position="11"> Also, because of the heavy tailed nature of language (i.e.</Paragraph>
    <Paragraph position="12"> Zipf's law) one is likely to encounter novel n-grams that were never witnessed during training in any test corpus, and therefore some mechanism for assigning non-zero probability to novel n-grams is a central and unavoidable issue in statistical language modeling. One standard approach to smoothing probability estimates to cope with sparse data problems (and to cope with potentially missing n-grams) is to use some sort of back-off estimator.</Paragraph>
    <Paragraph position="14"> is the discounted probability and fl(wi!n+1:::wi!1) is a normalization constant</Paragraph>
    <Paragraph position="16"> The discounted probability (6) can be computed with different smoothing techniques, including absolute smoothing, Good-Turing smoothing, linear smoothing, and Witten-Bell smoothing (Chen and Goodman, 1998).</Paragraph>
    <Paragraph position="17"> The details of the smoothing techniques are omitted here for simplicity.</Paragraph>
    <Paragraph position="18"> The language models described above use individual words as the basic unit, although one could instead consider models that use individual characters as the basic unit. The remaining details remain the same in this case. The only difference is that the character vocabulary is always much smaller than the word vocabulary, which means that one can normally use a much higher order, n, in a character-level n-gram model (although the text spanned by a character model is still usually less than that spanned by a word model). The benefits of the character-level model in the context of text classification are several-fold: it avoids the need for explicit word segmentation in the case of Asian languages, it captures important morphological properties of an author's writing, it models the typos and misspellings that are common in informal texts, it can still discover useful inter-word and inter-phrase features, and it greatly reduces the sparse data problems associated with large vocabulary models.</Paragraph>
    <Paragraph position="19"> In this paper, we experiment with character-level models to achieve flexibility and language independence.</Paragraph>
    <Paragraph position="20"> 3 Language Models as Text Classifiers Our approach to applying language models to text categorization is to use Bayesian decision theory. Assume we wish to classify a text D into a category c 2 C = fc1;:::;cjCjg. A natural choice is to pick the category c that has the largest posterior probability given the text.</Paragraph>
    <Paragraph position="22"> Using Bayes rule, this can be rewritten as</Paragraph>
    <Paragraph position="24"> where deducing Eq. (10) from Eq. (9) assumes uniformly weighted categories (since we have no other prior knowledge). Here, Pr(Djc) is the likelihood of D under category c, which can be computed by Eq. (11). Likelihood is related to perplexity and entropy by Eq. (1) and Eq. (2).</Paragraph>
    <Paragraph position="25"> Therefore, our approach is to learn a separate language model for each category, by training on a data set from that category. Then, to categorize a new text D, we supply D to each language model, evaluate the likelihood (or entropy) of D under the model, and pick the winning category according to Eq. (10).</Paragraph>
    <Paragraph position="26"> The inference of an n-gram based text classifier is very similar to a naive-Bayes classifier. In fact, n-gram classifiers are a straightforward generalization of naive-Bayes: A uni-gram classifier with Laplace smoothing corresponds exactly to the traditional naive-Bayes classifier. However, n-gram language models, for larger n, possess many advantages over naive-Bayes classifiers, including modeling longer context and applying superior smoothing techniques in the presence of sparse data.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Comparison
</SectionTitle>
    <Paragraph position="0"> We now proceed to present our results on several text categorization problems on different languages. Specifically, we consider language identification, Greek authorship attribution, Greek genre classification, English topic detection, Chinese topic detection and Japanese topic detection. null For the sake of consistency with previous research (Aizawa, 2001; He et al., 2000; Stamatatos et al., 2000), we measure categorization performance by the overall accuracy, which is the number of correctly identified texts divided by the total number of texts considered. We also measure the performance with Macro Fmeasure, which is the average of the F-measures across all categories. F-measure is a combination of precision and recall (Yang, 1999).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Language Identification
</SectionTitle>
      <Paragraph position="0"> The first text categorization problem we examined was language identification--a useful pre-processing step in information retrieval. Language identification is probably the easiest text classification problem because of the significant morphological differences between languages,  even when they are based on the same character set.1 In our experiments, we considered one chapter of Bible that had been translated into 6 different languages: English, French, German, Italian, Latin and Spanish. In each case, we reserved twenty sentences from each language for testing and used the remainder for training. For this task, with only bi-gram character-level models and any smoothing technique, we achieved 100% accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Authorship Attribution
</SectionTitle>
      <Paragraph position="0"> The second text categorization problem we examined was author attribution. A famous example is the case of the Federalist Papers, of which twelve instances are claimed to have been written both by Alexander Hamilton and James Madison (Holmes and Forsyth, 1995). Authorship attribution is more challenging than language identification because the difference among the authors is much more subtle than that among different languages. We considered a data set used by (Stamatatos et al., 2000) consisting of 20 texts written by 10 different modern Greek authors (totaling 200 documents). In each case, 10 texts from each author were used for training and the remaining 10 for testing.</Paragraph>
      <Paragraph position="1"> The results using different orders of n-gram models and different smoothing techniques are shown in Table 1.</Paragraph>
      <Paragraph position="2"> With 3-grams and absolute smoothing, we observe 90% accuracy. This result compares favorably to the 72% accuracy reported in (Stamatatos et al., 2000) which is based on linear least square fit (LLSF).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Text Genre Classification
</SectionTitle>
      <Paragraph position="0"> The third problem we examined was text genre classification, which is an important application in information retrieval (Kesseler et al., 1997; Lee et al., 2002). We considered a Greek data set used by (Stamatatos et al., 2000) consisting of 20 texts of 10 different styles extracted from various sources (200 documents total). For each style, we used 10 texts as training data and the remaining 10 as testing. null  The results of learning an n-gram based text classifier are shown in Table 2. The 86% accuracy obtained with bi-gram models compares favorably to the 82% reported in (Stamatatos et al., 2000), which again is based on a much deeper NLP analysis.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Topic Detection
</SectionTitle>
      <Paragraph position="0"> The fourth problem we examined was topic detection in text, which is a heavily researched text categorization problem (Dumais et al., 1998; Lewis, 1992; McCallum, 1998; Yang, 1999; Sebastiani, 2002). Here we demonstrate the language independence of the language modeling approach by considering experiments on English, Chinese and Japanese data sets.</Paragraph>
      <Paragraph position="1">  The English 20 Newsgroup data has been widely used in topic detection research (McCallum, 1998; Rennie, 2001).2 This collection consists of 19,974 non-empty documents distributed evenly across 20 newsgroups. We use the newsgroups to form our categories, and randomly select 80% of the documents to be used for training and set aside the remaining 20% for testing.</Paragraph>
      <Paragraph position="2"> In this case, as before, we merely considered text to be a sequence of characters, and learned character-level n-gram models. The resulting classification accuracies are reported in in Table 3. With 3-gram (or higher order) models, we consistently obtain accurate performance, peaking at 89% accuracy in the case of 6-gram models with Witten-Bell smoothing. (We note that word-level models were able to achieve 88% accuracy in this case.) These results compare favorably to the state of the art result of 87.5% accuracy reported in (Rennie, 2001), which was based on a combination of an SVM with error correct output coding (ECOC).</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4.2 Chinese Data
</SectionTitle>
      <Paragraph position="0"> Chinese topic detection is often thought to be more challenging than English, because words are not white-space delimited in Chinese text. This fact seems to  require word segmentation to be performed as a pre-processing step before further classification (He et al., 2000). However, we avoid the need for explicit segmentation by simply using a character level n-gram classifier. For Chinese topic detection we considered a data set investigated in (He et al., 2000). The corpus in this case is a subset of the TREC-5 data set created for research on Chinese text retrieval. To make the data set suitable for text categorization, documents were first clustered into 101 groups that shared the same headline (as indicated by an SGML tag) and the six most frequent groups were selected to make a Chinese text categorization data set.</Paragraph>
      <Paragraph position="1"> In each group, 500 documents were randomly selected for training and 100 documents were reserved for testing.</Paragraph>
      <Paragraph position="2"> We observe over 80% accuracy for this task, using bi-gram (2 Chinese characters) or higher order models. This is the same level of performance reported in (He et al., 2000) for an SVM approach using word segmentation and feature selection.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4.3 Japanese Data
</SectionTitle>
      <Paragraph position="0"> Japanese poses the same word segmentation issues as Chinese. Word segmentation is also thought to be necessary for Japanese text categorization (Aizawa, 2001), but we avoid the need again by considering character level language models.</Paragraph>
      <Paragraph position="1"> We consider the Japanese topic detection data investigated by (Aizawa, 2001). This data set was con- null verted from the NTCIR-J1 data set originally created for Japanese text retrieval research. The data has 24 categories. The testing set contains 10,000 documents distributed unevenly between categories (with a minimum of 56 and maximum of 2696 documents per category). This imbalanced distribution causes some difficulty since we assumed a uniform prior over categories. Although this is easily remedied, we did not fix the problem here. Nevertheless, we obtain experimental results in Table 5 that still show an 84% accuracy rate on this problem (for 6-gram or higher order models). This is the same level of performance as that reported in (Aizawa, 2001), which uses an SVM approach with word segmentation, morphology analysis and feature selection.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Analysis
</SectionTitle>
    <Paragraph position="0"> The perplexity of a test document under a language model depends on several factors. The two most influential factors are the order, n, of the n-gram model and the smoothing technique used. Different choices will result in different perplexities, which could influence the final decision in using Eq. (10). We now experimentally assess the influence of each of these factors below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Effects of n-Gram Order
</SectionTitle>
      <Paragraph position="0"> The order n is a key factor in n-gram language models.</Paragraph>
      <Paragraph position="1"> If n is too small then the model will not capture enough context. However, if n is too large then this will create severe sparse data problems. Both extremes result in a larger perplexity than the optimal context length. Figures 1 and 2 illustrate the influence of order n on classification performance and on language model quality in the previous five experiments (all using absolute smoothing).</Paragraph>
      <Paragraph position="2"> Note that in this case the entropy (bits per character) is the average entropy across all testing documents. From the curves, one can see that as the order increases, classification accuracy increases and testing entropy decreases, presumably because the longer context better captures the regularities of the text. However, at some point accu- null racy begins to decrease and entropy begins to increase as the sparse data problems begin to set in. Interestingly, the effect is more pronounced in some experiments (Greek genre classification) but less so in other experiments (topic detection under any language). The sensitivity in the Greek genre case could still be attributed to the sparse data problem (the over-fitting problem in genre classification could be more serious than the other problems, as seen from the entropy curves).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Effects of Smoothing Technique
</SectionTitle>
      <Paragraph position="0"> Another key factor affecting the performance of a language model is the smoothing technique used. Figures 3 and 4 show the effects of smoothing techniques on classification accuracy and testing entropy (Chinese topic detection and Japanese topic detection are not shown in the figure to save space).</Paragraph>
      <Paragraph position="1"> Here we find that, in most cases, the smoothing technique does not have a significant effect on text categorization accuracy, because of the small vocabulary size of  character level n-gram models. However, there are two exceptions--Greek authorship attribution and Greek text genre classification--where Good-Turing smoothing is not as effective as other techniques, even though it gives better test entropy than some others. Since our goal is to make a final decision based on the ranking of perplexities, not just their absolute values, a superior smoothing method in the sense of perplexity reduction (i.e. from the perspective of classical language modeling) does not necessarily lead to a better decision from the perspective of categorization accuracy. In fact, in all our experiments we have found that it is Witten-Bell smoothing, not Good-Turing smoothing, that gives the best results in terms of classification accuracy. Our observation is consistent with previous research which reports that Witten-Bell smoothing achieves benchmark performance in character level text compression (Bell et al., 1990). For the most part, however, one can use any standard smoothing technique in these problems and obtain comparable performance, since the rankings they produce are almost always the same.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Relation to Previous Research
</SectionTitle>
      <Paragraph position="0"> In principle, any language model can be used to perform text categorization based on Eq. (10). However, n-gram models are extremely simple and have been found to be effective in many applications. For example, character level n-gram language models can be easily applied to any language, and even non-language sequences such as DNA and music. Character level n-gram models are widely used in text compression--e.g., the PPM model (Bell et al., 1990)--and have recently been found to be effective in text classification problems as well (Teahan and Harper, 2001). The PPM model is a weighted linear interpolation n-gram models and has been set as a benchmark in text compression for decades. Building an adaptive PPM model is expensive however (Bell et al., 1990), and our back-off models are relatively much simpler. Using compression techniques for text categorization has also been investigated in (Benedetto et al., 2002), where the authors seek a model that yields the minimum compression rate increase when a new test document is introduced. However, this method is found not to be generally effective nor efficient (Goodman, 2002). In our approach, we evaluate the perplexity (or entropy) directly on test documents, and find the outcome to be both effective and efficient.</Paragraph>
      <Paragraph position="1"> Many previous researchers have realized the importance of n-gram models in designing language independent text categorization systems (Cavnar and Trenkle, 1994; Damashek, 1995). However, they have used n-grams as features for a traditional feature selection process, and then deployed classifiers based on calculating feature-vector similarities. Feature selection in such a classical approach is critical, and many required procedures, such as stop word removal, are actually language dependent. In our approach, all n-grams are considered as features and their importance is implicitly weighted by their contribution to perplexity. Thus we avoid an error prone preliminary feature selection step.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML