File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1610_evalu.xml
Size: 9,725 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1610"> <Title>Automatic Arabic Document Categorization Based on the Naive Bayes Algorithm</Title> <Section position="6" start_page="1" end_page="2" type="evalu"> <SectionTitle> 5 Experiments and results </SectionTitle> <Paragraph position="0"> For classification problems, it is customary to measure a classifier's performance in terms of classification error rate. A data set of documents is</Paragraph> <Paragraph position="2"> . The set is split into two subsets: a training set and a testing set. The trained classifier is used to assign a class AC(D</Paragraph> <Paragraph position="4"> ) in the test set, as if its true class label were not known. If AC(D</Paragraph> <Paragraph position="6"> classification is considered correct; otherwise, it is counted as an error:</Paragraph> <Paragraph position="8"> For a given class, the error rate is computed as the ratio of the number of errors made on the whole test set of unlabeled documents (X</Paragraph> <Paragraph position="10"> In order to measure the performance of the NB algorithm on Arabic document classification, we conducted several experiments: we performed cross validation using the original space (using all the words in the documents), cross validation experiments based on feature selection (using a subset of terms/roots only), and experiments based on an independently constructed evaluation set. The following paragraphs describe the data set used, and the experiments.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 5.1 The data set </SectionTitle> <Paragraph position="0"> We have collected 300 web documents for each of five categories from the website www.aljazeera.net, which is the website of Aljazeera (the Qatari television news channel in Arabic). This site contains over seven million (7,000,000) documents corresponding to the programs broadcast on the television channel; it is arguably the most visited Arabic web site.</Paragraph> <Paragraph position="1"> Aljazeera.net presents documents in (manually constructed) categories. The five (5) categories used for this work are: sports, business, culture and art, science, and health.</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 5.2 Cross validation </SectionTitle> <Paragraph position="0"> In cross validation, a fixed number of documents is reserved for testing (as if they were unlabeled documents) and the remainder are used for training (as labeled documents). Several such partitions of the data set are constructed, by making random splits of the data set. NB's performance is evaluated several times, using the different random partitions.</Paragraph> <Paragraph position="1"> Then the error statistics are aggregated. The steps of the cross validation experiments are delineated in Figure 3 next: In these experiments, each document in data set X is represented by all word roots in the document. The cross validation experiments described in obtained in the leave-one-out experiment (as illustrated in Table 1). Table 2, Table 3, Table 4, and Table 5 represent, respectively, the confusion matrices of the cross validation experiments. The percentages reported in an entry of a confusion matrix correspond to the percentage of documents that are known to actually belong to the category given by the row header of the matrix, but that are assigned by NB to the category given by the column header.</Paragraph> <Paragraph position="2"> with no feature extraction (Leave-one-out) The diagonals in tables 2-5 indicate higher classification performance for categories: Sport and Business than for the categories: Culture, Science, and health. Moreover, the leave-one-out experiment yields the best result by category as illustrated in Table 5 compared to the error rates reported in tables 2-4. Tables 2-5 revealed that error rates by category decrease from experiment to experiment. In other words, the error rates recorded in 1/3-2/3 experiment are higher than those in 1/2-1/2 experiment, those in 1/2-1/2 experiment are higher than those in 2/3-1/3 experiment, and those obtained in the 2/3-1/3 experiment are higher than those in the leave-one-out experiment. Thus, larger training sets yield higher accuracy when all the data set terms are used.</Paragraph> <Paragraph position="3"> When investigating some of the misclassifications/confusions made by NB, we have noticed that misclassified documents, in fact, contain large number of words that are representative of other categories. In other words, documents that are known to belong to a category contain numerous words that have higher frequency in other categories. Therefore, these words have higher influence on the prediction that will be made by the classifier. For instance, the confusion matrix in Table 5 shows that 30% of Culture documents have been misclassified in the Sports category. The misclassified documents contain words that are more frequent in the Sports category such as @ (Arabic for prize and for trophy), (Arabic for champion and for lead character), and (Arabic for scoring and for recording).</Paragraph> <Paragraph position="4"> 5.2.2. Cross-validation, using feature selection Feature selection techniques have been widely used in information retrieval as a means for coping with the large number of words in a document; a selection is made to keep only the more relevant words. Various feature selection techniques have been used in automatic text categorization; they include document frequency (DF), information gain (IG) (Tzeras and Hartman, 1993), minimum description length principal (Lang, 1995), and the kh statistic. (Yang and Pedersen, 1997) has found strong correlations between DF, IG and the kh statistic for a term. On the other hand, (Rogati and Yang, 2002) reports the kh to produce best performance. In this paper, we use TF-IDF (a kind of augmented DF) as a feature selection criterion, in order to ensure results are comparable with those in (Yahyaoui, 2001).</Paragraph> <Paragraph position="5"> TF-IDF (term frequency-inverse document frequency) is one of the widely used feature selection techniques in information retrieval (Yates and Neto, 1999). Specifically, it is used as a metric for measuring the importance of a word in a document within a collection, so as to improve the recall and the precision of the retrieved documents. While the TF measurement concerns the importance of a term in a given document, IDF seeks to measure the relative importance of a term in a collection of documents. The importance of each term is assumed to be inversely proportional to the number of documents that contain that term. TF is given by TF D,t , and it denotes frequency of term t in document D. IDF is given by IDF</Paragraph> <Paragraph position="7"> where N is the number of documents in the collection, and df t is the number of documents containing the term t. (Salton and Yang, 1973) proposed the combination of TF and IDF as weighting schemes, and it has been shown that their product gave better performance. Thus, the weight of each term/root in a document is given by w</Paragraph> <Paragraph position="9"> We have conducted five cross validation experiments based on TF-IDF. Experiments are based on selecting, in turn, 50, 100, 500, 1000, and 2000 terms that best represent the predefined 5 categories. We have repeated the experiments in Figure 3 for each number of terms. A summary of the results is presented in Table 6. The performance levels obtained are comparable to those obtained without feature selection. Figure 4 plots average categorization error rates versus the number of terms used for different trials.</Paragraph> <Paragraph position="10"> number of terms.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.3 Experiments using an evaluation set </SectionTitle> <Paragraph position="0"> Cross validation has been used to determine the average performance of NB for Arabic text categorization, and to design training sets that produce the best performance. This experiment, based on a separately and independently constructed evaluation set, is designed to evaluate the performance of NB on a set of documents that have never been submitted to the classifier. For this purpose, we further carefully collected manually 10 documents from Aljazeera.net for each of the 5 predefined categories. For each category, we have selected documents that best represent the variability in the category. We refer to this collection of documents as the evaluation set. This set is presented to the classifier for categorization. For testing on the evaluation set, trained NB classifiers are used. For each category, we use the NB classifier that has been trained using the training set that produced the best category classification accuracy in cross validation experiments. In our case, we have used the whole set as a training set (1,500) represented by 2,000 terms since the best cross validation accuracy was obtained in leave-one-out experiment with 2,000 terms. Table 7 summarizes NB's performance results when tested using the evaluation set. The results obtained have shown higher performance for the Sports and the Business categories with a classification accuracy that is higher than 70%. The performance of other categories ranges from 40% to 60%. The average accuracy over all categories is 62%.</Paragraph> <Paragraph position="1"> The results obtained in the evaluation set experiment are very consistent with the performance obtained in cross validation experiments.</Paragraph> </Section> </Section> class="xml-element"></Paper>