File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-3001_intro.xml
Size: 3,723 bytes
Last Modified: 2025-10-06 14:01:42
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-3001"> <Title>Semantic Language Models for Topic Detection and Tracking</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> TDT is a research program investigating methods for automatically organizing news stories by the events that they discuss (Allan, 2002a). The goal of TDT consists of breaking the stream of news into individual news stories, to monitor the stories for events that have not been seen before and to gather stories into groups that each discuss a single topic.</Paragraph> <Paragraph position="1"> Several approaches have been explored for comparing news stories in TDT. The traditional vector space approach (Yang et al., 1999) using cosine similarity has by far been the most consistently successful approach across different tasks and several data sets.</Paragraph> <Paragraph position="2"> In the recent past, a new probabilistic approach called Language Modeling (Ponte and Croft, 1998) has proven to be very effective in several information retrieval tasks. One of the attractive features of language models is that they are rmly rooted in the theory of probability thereby allowing a researcher to explore more sophisticated models guided by the theoretical framework.</Paragraph> <Paragraph position="3"> Allan et al (Allan et al., 1999) applied language models to the rst story detection task of TDT and found that its performance is on par with the traditional vector space models, if not better. In the language modeling approach to TDT, we measure the similarity of a news story D to a topic by the probability of its generation from the topic model M. Using the unigram assumption of independence of terms, one can compute the probability of generation of a news story as the product of probabilities of generation of the terms in the story, as shown in the following equation:</Paragraph> <Paragraph position="5"> where wi is the i-th term in the story. The topic model M is typically evaluated from the statistics of a set of stories that are known to be on the topic in consideration.</Paragraph> <Paragraph position="6"> One potential drawback of the unigram language model is that it treats all terms on an equal footing and seems to ignore semantic information of the terms. We believe that such information could be useful in determining the relative importance of a term to the topic of the story. For example, terms that belong to the named-entity type such as person, location, organization may convey more information about the topic of the story than other entity types. Likewise, one might expect that nouns and verbs play a more important role than adjectives, adverbs or propositions in determining the topic of the story.</Paragraph> <Paragraph position="7"> The present work is an attempt to extend the language modeling framework to incorporate a model of the relative importance of terms according to the semantic class they belong to.</Paragraph> <Paragraph position="8"> The remainder of the report is organized as follows.</Paragraph> <Paragraph position="9"> Section 2 summarizes attempts made in the past in capturing semantic-class information in information retrieval related tasks. We present the methodology of the new semantic language modeling approach in section 3. In section 4, we present details of the link detection task and its evaluation. Section 5 describes the experiments performed and presents the results obtained. In section 6 we analyze the performance of the new model. Section 7 ends the discussion with a few observations and lays down the path to future work.</Paragraph> </Section> class="xml-element"></Paper>