File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1610_metho.xml
Size: 9,822 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1610"> <Title>Automatic Arabic Document Categorization Based on the Naive Bayes Algorithm</Title> <Section position="3" start_page="1" end_page="1" type="metho"> <SectionTitle> 2 Related Works </SectionTitle> <Paragraph position="0"> Many machine learning algorithms have been applied for many years to text categorization. They Most modern Arabic writing (web, novels, articles) are written without vowels.</Paragraph> <Paragraph position="1"> include decision tree learning and Bayesian learning, nearest neighbor learning, and artificial neural networks, early such works may be found in (Lewis and Ringnette, 1994), (Creecy and Masand, 1992) and (Wiene and Pedersen, 1995), respectively.</Paragraph> <Paragraph position="2"> The bulk of the text categorization work has been devoted to cope with automatic categorization of English and Latin character documents. For example, (Fang et al., 2001) discusses the evaluation of two different text categorization strategies with several variations of their feature spaces. A good study comparing document categorization algorithms can be found in (Yang and Liu, 1999). More recently, (Sebastiani, 2002) has performed a good survey of document categorization; recent works can also be found in (Joachims, 2002), (Crammer and Singer, 2003), and (Lewis et al., 2004).</Paragraph> <Paragraph position="3"> Concerning Arabic, one automatic categorizer has been reported to have been put under operational use to classify Arabic documents; it is referred to as &quot;Sakhr's categorizer&quot; (Sakhr, 2004). Unfortunately, there is no technical documentation or specification concerning this Arabic categorizer. Sakhr's marketing literature claims that this categorizer is based on Arabic morphology and some research that has been carried out on natural language processing. The present work evaluates the performance on Arabic documents of the Naive Bayes algorithm (NB) - one of the simplest algorithms applied to English document categorization (Mitchell, 1997). The aim of this work is to gain some insight as to whether Arabic document categorization (using NB) is sensitive to the root extraction algorithm used or to different data sets. This work is a continuation of that initiated in (Yahyaoui, 2001), which reports an overall NB classification correctness of 75.6%, in cross validation experiments, on a data set that consists of 100 documents for each of 12 categories (the data set is collected from different Arabic portals). A 50% overall classification accuracy is also reported when testing with a separately collected evaluation set (3 documents for each of the 12 categories). The present work expands the work in (Yahyaoui, 2001) by experimenting with the use of a better root extraction algorithm (El Kourdi, 2004) for document preprocessing, and using a different data set, collected from the largest Arabic site on the web: aljazeera.net.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Preprocessing of document </SectionTitle> <Paragraph position="0"> Prior to applying document categorization techniques to an Arabic document, the latter is typically preprocessed: it is parsed, in order to remove stopwords (these are conjunction and disjunction words etc.). In addition, at this stage in this work, vowels are stripped from the full text representation when the document is (fully or partially) voweled/vocalized. Then roots are extracted for words in the document.</Paragraph> <Paragraph position="1"> In Arabic, however, the use of stems will not yield satisfactory categorization. This is mainly due to the fact that Arabic is a non-concatenative language (Al-Shalabi and Evens, 1998), and that the stem/infix obtained by suppression of infix and prefix add-ons is not the same for words derived from the same origin called the root. The infix form (or stem) needs further to be processed in order to obtain the root. This processing is not straightforward: it necessitates expert knowledge in Arabic language word morphology (Al-Shalabi and Evens, 1998). As an example, two close roots (i.e., roots made of the same letters), but semantically different, can yield the same infix form thus creating ambiguity.</Paragraph> <Paragraph position="2"> The root extraction process is concerned with the transformation of all Arabic word derivatives to their single common root or canonical form. This process is very useful in terms of reducing and compressing the indexing structure, and in taking advantage of the semantic/conceptual relationships between the different forms of the same root. In this work, we use the Arabic root extraction technique in (El Kourdi, 2004). It compares favorably to other stemming or root extraction algorithms (Yates and Neto, 1999; Al-Shalabi and Evens, 1998; and Houmame, 1999), with a performance of over 97% for extracting the correct root in web documents, and it addresses the challenge of the Arabic broken plural and hollow verbs. In the remainder of this paper, we will use the term &quot;root&quot; and &quot;term&quot; interchangeably to refer to canonical forms obtained through this root extraction process.</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 NB for document categorization </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 The classifier module </SectionTitle> <Paragraph position="0"> The classifier module is considered to be the core component of the document categorizer. It is responsible for classifying given Arabic documents to their target class. This is performed using the Naive Bayes (NB) algorithm. The NB classifier computes a posteriori probabilities of classes, using estimates obtained from a training set of labeled documents. When an unlabeled document is presented, the a posteriori probability is computed for each class using (1) in Figure 1; and the unlabeled document is then assigned to the class with the largest a posteriori probability.</Paragraph> <Paragraph position="1"> A posteriori probability computation Let D be a document represented as a set of finite</Paragraph> <Paragraph position="3"> and |Examples |be the number of documents in the training set of labeled documents.</Paragraph> <Paragraph position="4"> Let n be the total number of distinct stems in C</Paragraph> <Paragraph position="6"> Then the a posteriori probability as given by Bayes theorem is:</Paragraph> <Paragraph position="8"> When comparing a posteriori probabilities for the same document D, P(D) is the same for all categories and will not affect the comparison.</Paragraph> <Paragraph position="9"> The other quantities in (1) are estimated from the training set using NB learning (see Figure 2).</Paragraph> <Paragraph position="10"> The assigned class AC(D) to document D is the class with largest a posteriori probability given by</Paragraph> <Paragraph position="12"/> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 The learning module </SectionTitle> <Paragraph position="0"> The main task of the learning module is to learn from a set of labeled documents with predefined categories in order to allow the categorizer to classify the newly encountered documents D and to assign them to each of the predefined target</Paragraph> <Paragraph position="2"> . This module is based on the NB learning algorithm given in Figure 2. The learning module is one way of estimating the needed quantities in (1) by learning from a training set of documents.</Paragraph> <Paragraph position="4"> , and |Examples |be the number of documents in the training set of labeled documents.</Paragraph> <Paragraph position="5"> Step 1: collect the vocabulary, which is defined as the set of distinct words in the whole training set Step2: For each category C</Paragraph> <Paragraph position="7"> (2) where docs j is the number of training documents for the category is C</Paragraph> <Paragraph position="9"> is the total number of distinct terms in all training documents labeled C</Paragraph> <Paragraph position="11"> is a single documents generated by concatenating all the training documents for category C algorithm for document categorization The m-estimate method (with m equal to the size of word vocabulary) (Cestink, 1990) is used to compute the probability terms and handle zero count probabilities (smoothing). Equation (3) gives an estimate for P(w</Paragraph> <Paragraph position="13"> Various assumptions are needed in order to simplify Equation (1), whose computations are otherwise expensive. These assumptions are applied in Figure 2 to obtain the needed quantities for the class-conditional probabilities (Equations (4) and (5)). These assumptions are: 1. The probability of encountering a specific word within a document is the same regardless the word position. In other words, P(w</Paragraph> <Paragraph position="15"> for every i, j, and m where i and m are different possible positions of the same word within the document. This assumption allows representing a document as a bag of word (Equation (4) in Figure 2).</Paragraph> <Paragraph position="16"> 2. The probability of occurrence of a word is independent of the occurrence of other words in the same document. This is reflected in Equation (5): in fact a naive assumption, but it significantly reduces computation costs, since the number of probabilities that should be computed is decreased. Even though this assumption does not hold in reality, NB performs surprisingly well for text classification (Mitchell, 1997).</Paragraph> </Section> </Section> class="xml-element"></Paper>