File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/p06-2030_relat.xml

Size: 3,091 bytes

Last Modified: 2025-10-06 14:15:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2030">
  <Title>Using Bilingual Comparable Corpora and Semi-supervised Clustering for Topic Tracking</Title>
  <Section position="5" start_page="231" end_page="231" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Most of the work which addresses the small number of positive training stories applies statistical techniques based on word distribution and ML techniques. Allan et. al explored on-line adaptive filtering approaches based on the threshold strategy to tackle the problem(Allan et. al, 1998). The basic idea behind their work is that stories closer together in the stream are more likely to discuss related topics than stories further apart. The method is based on unsupervised learning techniques except for its incremental nature. When a tracking query is first created from the N t training stories, it is also given a threshold. During the tracking phase, if a story S scores over that threshold, S is regarded to be relevant and the query is regenerated as if S were among the N t training stories. This method was tested using the TDT1 corpus and it was found that the adaptive approach is highly successful. But adding more than four training stories provided only little help, although in their approach, 12 training stories were added. The method proposed in this paper is similar to Allan's method, however our method for collecting relevant stories is based on story pairs which are extracted from bilingual comparable corpora. The methods for finding bilingual story pairs are well studied in the cross-language IR task, or MT systems/bilingual lexicons(Dagan, 1997).</Paragraph>
    <Paragraph position="1"> Much of the previous work uses cosine similarity between story term vectors with some weighting techniques(Allan et. al, 1998) such as TF-IDF, or cross-language similarities of terms. However, most of them rely on only two stories in question to estimate whether or not they are about the same topic. We use multiple-links among stories to produce optimal results.</Paragraph>
    <Paragraph position="2"> In the TDT tracking task, classifying negative stories into meaningful groups is also an important issue to track topics, since a large number of labelled negative stories are available in the TDT context. Basu et. al. proposed a method using k-means clustering with the EM algorithm, where labeled data provides prior information about the conditional distribution of hidden category labels(Basu, 2002). They reported that the method outperformed the standard random seeding and COP-k-means(Wagstaff, 2001). Our method shares the basic idea with Basu et. al. An important difference with their method is that our method does not require the number of clusters k in advance, since it is determined during clustering. We use the BIC as the splitting criterion, and estimate the proper number for k. It is an important feature because in the tracking task, no knowledge of the number of topics in the negative training stories is available.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML