File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1108_intro.xml
Size: 2,580 bytes
Last Modified: 2025-10-06 14:01:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1108"> <Title>A Text Categorization Based on Summarization Technique</Title> <Section position="3" start_page="79" end_page="79" type="intro"> <SectionTitle> 2 Methods </SectionTitle> <Paragraph position="0"> This section describes a series of algorithms based on the title summarization technique for text categorization.</Paragraph> <Section position="1" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 2.1 Preprocessing and Feature Selection </SectionTitle> <Paragraph position="0"> We divide the corpus texts into words, delineate by white space and punctuation. All characters are lower-case and stop words are removed.</Paragraph> <Paragraph position="1"> After the words are stemmed, we call them terms. These terms are then used as features.</Paragraph> </Section> <Section position="2" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 2.2 Term Weighting </SectionTitle> <Paragraph position="0"> Weights are now assigned to the surviving features in each category. We design several different formulas for term weighting. In each formula, we associate a weight, W~ c), with each surviving feature, f, in category c, in the same way weights can be obtained in information retrieval when assigning them to index terms. In addition, we normalize the value of term frequency, q, between categories.</Paragraph> <Paragraph position="1"> The probability of category is also taken into account. We define W~ c) as equations 1 through 3.</Paragraph> <Paragraph position="3"> the frequency of the feature f appearing in the category c, the number of categories, the number of categories that contain the featureL the maximum frequency of any feature in category c, the document numbers belonging category c in training sets.</Paragraph> </Section> <Section position="3" start_page="79" end_page="79" type="sub_section"> <SectionTitle> 2.3 Category Ranking </SectionTitle> <Paragraph position="0"> We now have an index suitable for use in the category ranking process. The index contains features and a weighted value, W(f, c), associated with each feature fin each category c. Given a document, d, a rank can be associated with each category with respect to d. Let Fc is the set of features,f, in category c. The ranking of category c with respect to document d, R(c, d), is defined as equation 4.</Paragraph> <Paragraph position="2"> where tf:~ = the frequency of the feature f appearing in the document d, F~= the set of features fin category c.</Paragraph> </Section> </Section> class="xml-element"></Paper>