File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3809_metho.xml

Size: 18,662 bytes

Last Modified: 2025-10-06 14:10:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3809">
  <Title>Random-Walk Term Weighting for Improved Text Classification</Title>
  <Section position="3" start_page="53" end_page="54" type="metho">
    <SectionTitle>
2 Graph-based Ranking Algorithms
</SectionTitle>
    <Paragraph position="0"> The basic idea implemented by an iterative graph-based ranking algorithm is that of &amp;quot;voting&amp;quot; or &amp;quot;recommendation&amp;quot;. When one vertex links to another one, it is basically casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance of the vertex casting a vote determines how important the vote itself is, and this information is also taken into account by the ranking algorithm. Hence, the score associated with a vertex is determined based on the votes that are cast for it, and the scores of the vertices casting these votes.</Paragraph>
    <Paragraph position="1"> While there are several graph-based ranking algorithms previously proposed in the literature (Herings et al., 2001), we focus on only one such algorithm, namely PageRank (Brin and Page, 1998), as it was previously found successful in a number of applications, including Web link analysis (Brin and Page, 1998), social networks (Dom et al., 2003), citation analysis, and more recently in several text processing applications (Mihalcea and Tarau, 2004), (Erkan and Radev, 2004).</Paragraph>
    <Paragraph position="2"> Given a graph G = (V,E), let In(Va) be the set of vertices that point to vertex Va (predecessors), and let Out(Va) be the set of vertices that vertex Va points to (successors). The PageRank score associated with the vertex Va is then defined using a recursive function that integrates the scores of its predecessors: null</Paragraph>
    <Paragraph position="4"> where d is a parameter that is set between 0 and 11.</Paragraph>
    <Paragraph position="5"> The score of each vertex is recalculated upon each iteration based on the new weights that the neighboring vertices have accumulated. The algorithm terminates when the convergence point is reached for all the vertices, meaning that the error rate for each vertex falls below a pre-defined threshold. Formally, 1The typical value for d is 0.85 (Brin and Page, 1998), and this is the value we are also using in our implementation.</Paragraph>
    <Paragraph position="6"> for a vertex Vi let Sk(Vi) be the rank or the score at iteration k and Sk+1(Vi) be the score at iteration k + 1. The error rate ER is defined as:</Paragraph>
    <Paragraph position="8"> This vertex scoring scheme is based on a random walk model, where a walker takes random steps on the graph G, with the walk being modeled as a Markov process - that is, the decision on what edge to follow is solely based on the vertex where the walker is currently located. Under certain conditions, this model converges to a stationary distribution of probabilities, associated with vertices in the graph. Based on the Ergodic theorem for Markov chains (Grimmett and Stirzaker, 1989), the algorithm is guaranteed to converge if the graph is both aperiodic and irreducible. The first condition is achieved for any graph that is a non-bipartite graph, while the second condition holds for any strongly connected graph - property achieved by PageRank through the random jumps introduced by the (1[?]d) factor. In matrix notation, the PageRank vector of stationary probabilities is the principal eigenvector for the matrix Arow, which is obtained from the adjacency matrix A representing the graph, with all rows normalized to sum to 1: (P = ATrowP).</Paragraph>
    <Paragraph position="9"> Intuitively, the stationary probability associated with a vertex in the graph represents the probability of finding the walker at that vertex during the random walk, and thus it represents the importance of the vertex within the graph. In the context of sequence data labeling, the random walk is performed on the label graph associated with a sequence of words, and thus the resulting stationary distribution of probabilities can be used to decide on the most probable set of labels for the given sequence.</Paragraph>
    <Section position="1" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
2.1 TextRank
</SectionTitle>
      <Paragraph position="0"> Given a natural language processing task, the TextRank model includes four general steps for the application of a graph-based ranking algorithm to graph structures derived from natural language texts:  1. Identify text units that best define the proposed task and add them as vertices in the graph.</Paragraph>
      <Paragraph position="1"> 2. Identify relations that connect such test units, and use these relations to draw edges between  vertices in the graph. Edges can be directed or undirected, weighted or un-weighted.</Paragraph>
      <Paragraph position="2">  3. Iterate the graph ranking algorithm to convergence. null 4. Sort vertices based on their final score. Use the  values attached to each vertex for ranking. The strength of this model lies in the global representation of the context and its ability to model how the co-occurrence between features might propagate across the context and affect other distant features. While TextRank has already been applied to several language processing tasks, we focus here on the keyword extraction task, since it best relates to our approach. The goal of a keyword extraction tool is to find a set of words or phrases that best describe a given document. The co-occurrence relation within a specific window is used to portray the correlation between words, which are represented as vertices in the graph. Two vertices are connected if their corresponding lexical units co-occur within a window of at most N words, where N can be set to any value greater than two. The TextRank application to keyword extraction has also used different syntactic filters for vertex selection, including all open class words, nouns and verbs, nouns and adjectives, and others. The algorithm was found to provide the best results using nouns and adjectives with a window size of two.</Paragraph>
      <Paragraph position="3"> Our approach follows the same main steps as used in the TextRank keyword extraction application. We are however incorporating a larger number of lexical units, and we use different window sizes, as we will show in the following section.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="54" end_page="55" type="metho">
    <SectionTitle>
3 TextRank for Term Weighting
</SectionTitle>
    <Paragraph position="0"> The goal of the work reported in this paper is to study the ranking scores obtained using TextRank, and evaluate their potential usefulness as a new measure of term weighting.</Paragraph>
    <Paragraph position="1"> To understand how the random-walk weights (rw) might be a good replacement for the traditional term frequency weights (tf), consider the example in Figure 1. The example represents a sample document from the Reuters collection. A graph is constructed as follows. If a term has not been previously seen, then a node is added to the graph to represent this term. A term can only be represented by one node in the graph. An undirected edge is drawn between two nodes if they co-occur within a certain window size. This example assumes a window size of two, corresponding to two consecutive terms in the text (e.g. London is linked to based).</Paragraph>
    <Paragraph position="2"> London-based sugar operator Kaines Ltd confirmed it sold two cargoes of white sugar to India out of an estimated overall sales total of four or five cargoes in which other brokers participated. The sugar, for April/May and April/June shipment, was sold at between 214 and 218 dlrs a tonne cif, it said.  Table 1 shows the tf and rw weights, also plotted in Figure 3. By analyzing the rw weights, we can observe a non-linear correlation with the tf weights, with an emphasis given to terms surrounding important key term like e.g. &amp;quot;sugar&amp;quot; or &amp;quot;cargoes.&amp;quot; This spatial locality has resulted in higher ranks for terms like &amp;quot;operator&amp;quot; compared to other terms like &amp;quot;lon-</Paragraph>
  </Section>
  <Section position="5" start_page="55" end_page="57" type="metho">
    <SectionTitle>
4 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> To evaluate our random-walk based approach to feature weighting, we integrate it in a text classification algorithm, and evaluate its performance on several standard text classification data sets.</Paragraph>
    <Section position="1" start_page="55" end_page="55" type="sub_section">
      <SectionTitle>
4.1 Random-Walk Term Weighting
</SectionTitle>
      <Paragraph position="0"> Starting with a given document, we determine a ranking over the words in the document by using the approach described in Section 3.</Paragraph>
      <Paragraph position="1"> First, we tokenize the document for punctuation, special symbols, word abbreviations. We also remove the common words, using a list of approximately 500 frequently used words as used in the Smart retrieval system 3.</Paragraph>
      <Paragraph position="2"> Next, the resulting text is processed to extract both tf and rw weights for each term in the document.</Paragraph>
      <Paragraph position="3"> Note that we do not apply any syntactic filters, as it was previously done in applications of TextRank.</Paragraph>
      <Paragraph position="4"> Instead, we consider each word as a potential feature. To determine tf we simply count the frequencies of each word in the document. To determine rw, all the terms are added as vertices in a graph representing the document. A co-occurrence scanner is then applied to the text to relate the terms that co-occur within a given window size . For a given term, all the terms that fall in the vicinity of this term are considered dependent terms. This is represented by a set of edges that connect this term to all the other terms in the window. Experiments are performed for window sizes of 2, 4, 6, and 8. Once the graph is constructed and the edges are in place, the TextRank algorithm is applied4. The result of the ranking process is a list of all input terms and their corresponding rw scores.</Paragraph>
      <Paragraph position="5"> We then calculate tf.idf and rw.idf as follows:</Paragraph>
      <Paragraph position="7"> where ND represent the total number of documents in the collection and n is the number of documents in which the target term appeared at least once.</Paragraph>
      <Paragraph position="8"> Similarly,</Paragraph>
      <Paragraph position="10"> These term weights (tf.idf or rw.idf) are then used to create a feature vector for each document.</Paragraph>
      <Paragraph position="11"> The vectors are fed to a traditional text classification system, using one of the learning algorithms described below. The results obtained using tf.idf will act as a baseline in our evaluation.</Paragraph>
    </Section>
    <Section position="2" start_page="55" end_page="56" type="sub_section">
      <SectionTitle>
4.2 Text Classification
</SectionTitle>
      <Paragraph position="0"> Text classification is a problem typically formulated as a machine learning task, where a classifier learns how to distinguish between categories in a given set 3ftp://ftp.cs.cornell.edu/pub/smart.</Paragraph>
      <Paragraph position="1"> 4We use an implementation where the maximum number of iterations is limited to 100, the damping factor is set to 0.85, and convergence threshold to 0.0001. Each graph node is assigned with an initial weight of 0.25.</Paragraph>
      <Paragraph position="2">  using features automatically extracted from a collection of training documents. There is a large body of algorithms previously tested on text classification problems, due also to the fact that this task is one of the testbeds of choice for machine learning algorithms. In the experiments reported here, we compare results obtained with four frequently used text classifiers - Rocchio, Na&amp;quot;ive Bayes, Nearest Neighbor, and Support Vector Machines, selected based on their diversity of learning methodologies.</Paragraph>
      <Paragraph position="3"> Na&amp;quot;ive Bayes. The basic idea in a Na&amp;quot;ive Bayes text classifier is to estimate the probability of a category given a document using joint probabilities of words and documents. Na&amp;quot;ive Bayes assumes word independence, which means that the conditional probability of a word given a category is assumed to be independent of the conditional probability of other words given the same category.</Paragraph>
      <Paragraph position="4"> Despite this simplification, Na&amp;quot;ive Bayes classifiers were shown to perform surprisingly well on text classification (Joachims, 1997), (Schneider, 2004).</Paragraph>
      <Paragraph position="5"> While there are several versions of Na&amp;quot;ive Bayes classifiers (variations of multinomial and multivariate Bernoulli), we use the multinomial model (Mc-Callum and Nigam, 1998), which was shown to be more effective.</Paragraph>
      <Paragraph position="6"> Rocchio. This is an adaptation of the relevance feedback method developed in information retrieval (Rocchio, 1971). It uses standard tf.idf weighted vectors to represent documents, and builds a prototype vector for each category by summing up the vectors of the training documents in each category.</Paragraph>
      <Paragraph position="7"> Test documents are then assigned to the category that has the closest prototype vector, based on a cosine similarity. Text classification experiments with different versions of the Rocchio algorithm showed competitive results on standard benchmarks (Joachims, 1997), (Moschitti, 2003).</Paragraph>
      <Paragraph position="8"> KNN. K-Nearest Neighbor is one of the earliest text categorization approaches (Makoto and Takenobu, 1995; Masand et al., 1992). The algorithm classifies a test document based on the best class label identified for the nearest K-neighbors in the training documents. The best class label is chosen by weighting the class of each similar training document with its similarity to the target test document.</Paragraph>
      <Paragraph position="9"> SVM. Support Vector Machines (Vapnik, 1995) is a state-of-the-art machine learning approach based on decision plans. The algorithm defines the best hyper-plan which separates set of points associated with different class labels with a maximum-margin.</Paragraph>
      <Paragraph position="10"> The unlabeled examples are then classified by deciding in which side of the hyper-surface they reside. The hyper-plan can be a simple linear plan as first proposed by Vapnik, or a non-linear plan such as e.g. polynomial, radial, or sigmoid. In our evaluation we used the linear kernel since it was proved to be as powerful as the other kernels when tested on text classification data sets (Yang and Liu, 1999).</Paragraph>
    </Section>
    <Section position="3" start_page="56" end_page="57" type="sub_section">
      <SectionTitle>
4.3 Data Sets
</SectionTitle>
      <Paragraph position="0"> In our experiments we use Reuters-21578, WebKB, 20Newsgroups, and LingSpam datasets. These datasets are commonly used for text classification evaluations (Joachims, 1996; Craven et al., 1998; Androutsopoulos et al., 2000; Mihalcea and Hassan, 2005).</Paragraph>
      <Paragraph position="1"> Reuter-21578. This is a publicly available subset of the Reuters news, containing about 120 categories. We use the standard ModApte data split (Apte et al., 1994). The unlabeled documents were discarded and only the documents with one or more class labels were used in the classification experiments. WebKB. This is a data set collected from computer science departments of various universities by the CMU text learning group. The dataset contains seven class labels which are Project, Student, Department, Faculty, Staff, Course, and Other. The Other label was removed from the dataset for evaluation purposes. Most of the evaluations in the literature have been performed on only four of the categories (Project, Student, Faculty, and Course) since they represent the largest categories. However, since we wanted to see how our system behaves when only a few training examples were available as e.g. in the Staff and the Department classes, we performed our evaluations on two versions of WebKB: one with the four categories version (WebKB4) and one with the six categories (WebKB6).</Paragraph>
      <Paragraph position="2"> 20-Newsgroups. This is a collection of 20,000 messages from 20 different newsgroups, corresponding to different topics or subjects. Each newsgroup has about 1000 message split into 400 test and 600 train documents.</Paragraph>
      <Paragraph position="3"> LingSpam. This is a spam corpus, consisting of email messages organized in 10 collections to al- null low for 10-fold cross validation. Each collection has roughly 300 spam and legitimate messages. There are four versions of the corpus standing for bare, stop-word filtered, lemmatized, and stop-word and lemmatized. We use the bare collection with a standard 10-fold cross validation.</Paragraph>
    </Section>
    <Section position="4" start_page="57" end_page="57" type="sub_section">
      <SectionTitle>
4.4 Performance Measures
</SectionTitle>
      <Paragraph position="0"> To evaluate the classification system we used the traditional accuracy measure defined as the number of correct predictions divided with the number of evaluated examples.</Paragraph>
      <Paragraph position="1"> We also use the correlation coefficient (r) as a diversity measure to evaluate the dissimilarity between the weighting models. Pairwise diversity measures have been traditionally used to measure the statistical independence among ensemble of classifiers (Kuncheva and Whitaker, 2003). Here, we use them to measure the correlation between our random-walk approach and the traditional term frequency approach. The typical setting in which the pairwise diversity measures are used is a set of different classifiers which are used to classify the same set of feature vectors or documents over a given dataset. In our evaluation we use the same classifier to evaluate two different sets of feature vectors that are produced by different weighting features: the rw random walk weighting, and the tf term frequency weighting. Since the two feature vector collections are evaluated by one classifier at a time, the resulted diversity scores will reflect the diversity of the two systems.</Paragraph>
      <Paragraph position="2"> Let Di and Dj be two feature weighting models with the following contingency table.</Paragraph>
      <Paragraph position="4"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML