File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1049_intro.xml

Size: 3,727 bytes

Last Modified: 2025-10-06 14:01:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1049">
  <Title>Building Semantic Perceptron Net for Topic Spotting</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Topic spotting is the problem of identifying the presence of a predefined topic in a text document.</Paragraph>
    <Paragraph position="1"> More formally, given a set of n topics together with a collection of documents, the task is to determine for each document the probability that one or more topics is present in the document. Topic spotting may be used to automatically assign subject codes to newswire stories, filter electronic emails and on-line news, and pre-screen document in information retrieval and information extraction applications.</Paragraph>
    <Paragraph position="2"> Topic spotting, and its related problem of text categorization, has been a hot area of research for over a decade. A large number of techniques have been proposed to tackle the problem, including: regression model, nearest neighbor classification, Bayesian probabilistic model, decision tree, inductive rule learning, neural network, on-line learning, and, support vector machine (Yang &amp; Liu, 1999; Tzeras &amp; Hartmann, 1993). Most of these methods are word-based and consider only the relationships between the features and topics, but not the relationships among features.</Paragraph>
    <Paragraph position="3"> It is well known that the performance of the word-based methods is greatly affected by the lack of linguistic understanding, and, in particular, the inability to handle synonymy and polysemy. A number of simple linguistic techniques has been developed to alleviate such problems, ranging from the use of stemming, lexical chain and thesaurus (Jing &amp; Tzoukermann, 1999; Green, 1999), to word-sense disambiguation (Chen &amp; Chang, 1998; Leacock et al, 1998; Ide &amp; Veronis, 1998) and context (Cohen &amp; Singer, 1999; Jing &amp; Tzoukermann, 1999).</Paragraph>
    <Paragraph position="4"> The connectionist approach has been widely used to extract knowledge in a wide range of information processing tasks including natural language processing, information retrieval and image understanding (Anderson, 1983; Lee &amp; Dubin, 1999; Sarkas &amp; Boyer, 1995; Wang &amp; Terman, 1995). Because the connectionist approach closely resembling human cognition process in text processing, it seems natural to adopt this approach, in conjunction with linguistic analysis, to perform topic spotting. However, there have been few attempts in this direction. This is mainly because of difficulties in automatically constructing the semantic networks for the topics.</Paragraph>
    <Paragraph position="5"> In this paper, we propose an approach to automatically build a semantic perceptron net (SPN) for topic spotting. The SPN is a connectionist model with hierarchical structure. It uses a combination of context, co-occurrence statistics and thesaurus to group the distributed but semantically related words to form basic semantic nodes. The semantic nodes are then used to identify the topic. This paper discusses the design, implementation and testing of an SPN for topic spotting.</Paragraph>
    <Paragraph position="6"> The paper is organized as follows. Section 2 discusses the topic representation, which is the prototype structure for SPN. Sections 3 &amp; 4 respectively discuss our approach to extract the semantic correlations between words, and build semantic groups and topic tree. Section 5 describes the building and training of SPN, while Section 6 presents the experiment results. Finally, Section 7 concludes the paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML