File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-2030_intro.xml
Size: 2,204 bytes
Last Modified: 2025-10-06 14:03:44
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2030"> <Title>Using Bilingual Comparable Corpora and Semi-supervised Clustering for Topic Tracking</Title> <Section position="4" start_page="0" end_page="231" type="intro"> <SectionTitle> 3 English and Japanese newspapers, Mainichi and </SectionTitle> <Paragraph position="0"> Yomiuri Shimbun). Our hypothesis using bilingual corpora is that many of the broadcasting station from one country report local events more frequently and in more detail than overseas' broadcasting stations, even if it is a world-wide famous ones. Let us take a look at some topic from the TDT corpora. A topic, 'Kobe Japan quake' from the TDT1 is a world-wide famous one, and 89 stories are included in the TDT1. However, Mainichi and Yomiuri Japanese newspapers have much more stories from the same period of time, i.e. 5,029 and 4,883 stories for each. These observations show that it is crucial to investigate the use of bilingual comparable corpora based on the NL techniques in terms of collecting more information about some specific topics. We extract Japanese stories which are relevant to the positive English stories using English-Japanese bilingual corpora, together with the EDR bilingual dictionary. The associated story is the result of alignment of a Japanese term association with an English term association. null For a large number of labelled negative stories, we classify them into some clusters using labelled positive stories. We used a semi-supervised clustering technique which combines labeled and unlabeled stories during clustering.</Paragraph> <Paragraph position="1"> Our goal for semi-supervised clustering is to classify negative stories into clusters where each cluster is meaningful in terms of class distribution provided by one cluster of positive training stories. We introduce k-means clustering that can be viewed as instances of the EM algorithm, and classify negative stories into clusters. In general, the number of clusters k for the k-means algorithm is not given beforehand. We thus use the Bayesian Information Criterion (BIC) as the splitting criterion, and select the proper number for k.</Paragraph> </Section> class="xml-element"></Paper>