File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1909_metho.xml
Size: 7,182 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1909"> <Title>Mining Linguistically Interpreted Texts</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Text Mining </SectionTitle> <Paragraph position="0"> Text mining processes are usually divided in five major phases: A) Document collection: consists of the definition of the set of the documents from which knowledge must be extracted. B) Preprocessing: consists of a set of actions that transform the set of documents in natural language into a list of useful terms. C) Preparation and selection of the data: consists in the identification and selection of relevant terms form the pre-processed ones. D) Knowledge Extraction: consists of the application of machine learning techniques to identify patterns that can classify or cluster the documents in the collection. E) Evaluation and interpretation of the results: consists of the analysis of the results.</Paragraph> <Paragraph position="1"> The pre-processing phase in text mining is essential and usually very expensive and time consuming. As texts are originally non-structured a series of steps are required to represent them in a format compatible with knowledge extraction methods and tools. The usual techniques employed in phase B are the use of a list of stop-words, which are discarded from the original documents and the use of stemming which reduces the words to their root.</Paragraph> <Paragraph position="2"> Having the proper tools to process Portuguese texts, we investigate whether linguistic information can have an impact on the results of the whole process. In the next section we describe the tools we used for acquiring the linguistic knowledge in which we base our experiments.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Tools for acquiring linguistic knowledge </SectionTitle> <Paragraph position="0"> The linguistic knowledge we use in the experiments is based on the syntactic analysis performed by the PALAVRAS parser (Bick, 2000). This Portuguese parser is robust enough to always give an output even for incomplete or incorrect sentences (which might be the case for the type of documents used in text mining tasks). It has a comparatively low percentage of errors (less than 1% for word class and 3-4% for surface syntax) (Bick, 2003). We also used another tool that makes easier the extraction of features from the analyzed texts: the Palavras Xtractor (Gasperin et. al. 2003). This tool converts the parser output into three XML files, containing: a) the list of all words from the text and their identifier; b) morpho-syntactic information for each word; c) the sentence's syntactic structures. Using XSL (eXtensible Stylesheet Language)1 we can extract specified terms from the texts, according to their linguistic value. The resulting lists of terms according to each combination are then passed to phases C, D and E. The experiments are described in detail in the next section.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Corpus </SectionTitle> <Paragraph position="0"> The corpus used in the experiments is composed by a subset of the NILC corpus (Nucleo Interdisciplinar de Linguistica Computacional2) containing 855 documents corresponding to newspaper articles of Folha de Sao Paulo from 1994. These documents are related to five newspaper sections: informatics, property, sports, politics and tourism.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Pre-processing techniques </SectionTitle> <Paragraph position="0"> We prepared three different versions of the corpus (V1, V2 and V3) for 3-fold cross validation.</Paragraph> <Paragraph position="1"> Each version is partitioned in different training and testing parts, containing 2/3 and 1/3 of the documents respectively.</Paragraph> <Paragraph position="2"> For the experiments with the usual methods, irrelevant terms (stop-words) were eliminated from the documents, on the basis of a list of stop-words, containing 476 terms (mainly articles, prepositions, auxiliary verbs, pronouns, etc). The remaining terms were stemmed according to Martin Porter's algorithm (Porter, 1980). Based on these techniques we generated a collection of pre-processed documents called PD1.</Paragraph> <Paragraph position="3"> To test our proposal we then pre-processed the 855 documents in a different way: we parsed all texts of our corpus, generating the corresponding XML files and extracted terms according to their grammatical categories, using XSL. Based on these techniques we generated a collection of pre-processed documents called PD2.</Paragraph> <Paragraph position="4"> All other text mining phases were equally applied to both PD1 and PD2. We used relative frequency for the selection of relevant terms. The representation of the documents was according to the vector space model. For the categorization task, vectors corresponding to each class were built, where the more frequent terms were selected. After that, a global vector was composed. We also tested with different numbers of terms in the global vector (30, 60, 90, 120, 150). For the clustering task we measured the similarity of the documents using cosine. After calculating similarity of the documents, the data was codified according to format required by the machine learning tool Weka (Witten, 2000). Weka is a collection of machine learning algorithms for data mining tasks that contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.</Paragraph> <Paragraph position="5"> In this work the adopted machine learning techniques are Decision Tree for the categorization process and K-means for text clustering.</Paragraph> <Paragraph position="6"> Decision Tree is a supervised learning algorithm based on the recursive division of the training examples in representative subsets, using the metric of information gain. After the induction of a classifying tree, it can be applied to new examples, described with the same attributes of the training examples.</Paragraph> <Paragraph position="7"> K-means divides a group of objects in k groups in a way that the resulting intracluster similarity is high, but the intercluster similarity is low. The similarity of groups is measured in respect to the medium value of the objects in a group, which can be seen as the center of gravity (centroid) of the group. The parameters used to run k-means are the default ones as suggested by the tool, seed 10 and 5 groups.</Paragraph> <Paragraph position="8"> The evaluation of the results for the categorization task is based on the classification error, which was used to compare the results for PD1 and PD2. For the clustering task the evaluation of the results is given by recall and precision, based on the generated confusion matrices.</Paragraph> </Section> </Section> class="xml-element"></Paper>