File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2030_metho.xml
Size: 11,413 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2030"> <Title>Using Bilingual Comparable Corpora and Semi-supervised Clustering for Topic Tracking</Title> <Section position="6" start_page="231" end_page="232" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"> The system consists of four procedures: extracting bilingual story pairs, extracting monolingual story pairs, clustering negative stories, and tracking.</Paragraph> <Section position="1" start_page="231" end_page="232" type="sub_section"> <SectionTitle> 3.1 Extracting Bilingual Story Pairs </SectionTitle> <Paragraph position="0"> We extract story pairs which consist of positive English story and its associated Japanese stories using the TDT English and Mainichi and Yomiuri Japanese corpora. To address the optimal positive English and their associated Japanese stories, we combine the output of similarities(multiplelinks). The idea comes from speech recognition where two outputs are combined to yield a better result in average. Fig.1 illustrates multiple-links.</Paragraph> <Paragraph position="1"> The TDT English corpus consists of training and test stories. Training stories are further divided into positive(black box) and negative stories(doted box). Arrows in Fig.1 refer to an edge with similarity value between stories. In Fig.1, for example, whether the story J discusses the target topic, and is related to E or not is determined by not only the value of similarity between E .</Paragraph> <Paragraph position="2"> Extracting story pairs is summarized as follows: Let initial positive training stories E</Paragraph> <Paragraph position="4"> initial node, and each Japanese stories J</Paragraph> <Paragraph position="6"> be node or terminal node in the graph G. We calculate cosine similarities between E</Paragraph> <Paragraph position="8"> late similarities between J</Paragraph> <Paragraph position="10"> If the value of similarity between nodes is larger than a certain threshold, we connect them by an edge(bold arrow in Fig.1). Next, we delete an edge which is not a constituent of maximal connected sub-graph(doted arrow in Fig.1). After eliminating edges, we extract pairs of initial positive English story E</Paragraph> <Paragraph position="12"> as a linked story pair, and add associated Japanese story J j to the training stories. In Fig.1, E are extracted. The procedure for calculating cosine similarities between E</Paragraph> <Paragraph position="14"> consists of two sub-steps: extracting terms, and estimating bilingual term correspondences.</Paragraph> <Paragraph position="15"> Extracting terms The first step to calculate similarity between</Paragraph> <Paragraph position="17"> is to align a Japanese term with its associated English term using the bilingual dictionary, EDR. However, this naive method suffers from frequent failure due to incompleteness of the bilingual dictionary. Let us take a look at the Mainichi Japanese newspaper stories. The total number of terms(words) from Oct. 1, 1998 to Dec. 31, 1998, was 528,726. Of these, 370,013 terms are not included in the EDR bilingual dictionary. For example, '(Endeavour)' which is a key term for the topic 'Shuttle Endeavour mission for space station' from the TDT3 corpus is not included in the EDR bilingual dictionary. New terms which fail to segment by during a morphological analysis are also a problem in calculating similarities between stories in mono-lingual data. For example, a proper noun 'NG f'(Tokyo Metropolitan Univ.) is divided into three terms, 'N' (Metropolitan), 'G(Univ.)', Japanese story pairs.</Paragraph> <Paragraph position="19"> and 'f(Tokyo)'. To tackle these problems, we conducted term extraction from a large collection of English and Japanese corpora. There are several techniques for term extraction(Chen, 1996). We used n-gram model with Church-Gale smoothing, since Chen reported that it outperforms all existing methods on bigram models produced from large training data. The length of the extracted terms does not have a fixed range . We thus applied the normalization strategy which is shown in Eq.(1) to each length of the terms to bring the probability value into the range [0,1]. We extracted terms whose probability value is greater than a certain threshold. Words from the TDT English(Japanese newspaper) corpora are identified if they match the extracted terms.</Paragraph> <Paragraph position="20"> statistics. We estimated bilingual term correspondences with a large collection of English and Japanese data. More precisely, let E i be an English story (1 [?] i [?] n), where n is the number of stories in the collection, and S</Paragraph> <Paragraph position="22"> the set of Japanese stories with cosine similarities higher than a certain threshold value th: S</Paragraph> <Paragraph position="24"> ) [?] th}. Then, we concatenate constituent Japanese stories of S</Paragraph> <Paragraph position="26"> and construct a pseudo-parallel corpus PPC</Paragraph> </Section> </Section> <Section position="7" start_page="232" end_page="234" type="metho"> <SectionTitle> EJ </SectionTitle> <Paragraph position="0"> of English and Japanese stories: PPC</Paragraph> <Paragraph position="2"> negationslash= 0 }. Suppose that there are two criteria, monolingual term t</Paragraph> <Paragraph position="4"> in English story and t J in Japanese story. We can determine whether or not a particular term belongs to a particular story. Consequently, terms are divided into four classes, as shown in Table 1. Based on the contingency table of co-occurence frequencies of t We set at most five noun words.</Paragraph> <Paragraph position="5"> We extract term t J as a pair of t</Paragraph> <Paragraph position="7"> )}. For the extracted English and Japanese term pairs, we conducted semi-automatic acquisition, i.e. we manually selected bilingual term pairs, since our source data is not a clean parallel corpus, but an artificially generated noisy pseudo-parallel corpus, it is difficult to compile bilingual terms full-automatically(Dagan, 1997). Finally, we align a Japanese term with its associated English term using the selected bilingual term correspondences, and again calculate cosine similarities between Japanese and English stories.</Paragraph> <Section position="1" start_page="233" end_page="233" type="sub_section"> <SectionTitle> 3.2 Extracting Monolingual Story Pairs </SectionTitle> <Paragraph position="0"> We noted above that our source data is not a clean parallel corpus. Thus the difference of dates between bilingual stories is one of the key factors to improve the performance of extracting story pairs, i.e. stories closer together in the timeline are more likely to discuss related subjects. We therefore applied a method for extracting bilingual story pairs from stories closer in the timelines. However, this often hampers our basic motivation for using bilingual corpora: bilingual corpora helps to collect more information about the target topic. We therefore extracted monolingual(Japanese) story pairs and added them to the training stories. Extracting Japanese monolingual story pairs is quite simple: Let J</Paragraph> <Paragraph position="2"> ) be the extracted Japanese story in the procedure, extracting bilingual story pairs. We calculate cosine similarities between J</Paragraph> <Paragraph position="4"> to the training stories.</Paragraph> </Section> <Section position="2" start_page="233" end_page="234" type="sub_section"> <SectionTitle> 3.3 Clustering Negative Stories </SectionTitle> <Paragraph position="0"> Our method for classifying negative stories into some clusters is based on Basu et. al.'s method(Basu, 2002) which uses k-means with the EM algorithm. K-means is a clustering algorithm based on iterative relocation that partitions a dataset into the number of k clusters, locally minimizing the average squared distance between the data points and the cluster centers(centroids).</Paragraph> <Paragraph position="1"> Suppose we classify X = { x</Paragraph> <Paragraph position="3"> into k clusters: one is the cluster which consists of positive stories, and other k-1 clusters consist of negative stories. Here, which clusters does each negative story belong to? The EM is a method of finding the maximum-likelihood estimate(MLE) of the parameters of an underlying distribution from a set of observed data that has missing value. K-means is essentially an EM on a mixture of k Gaussians under certain assumptions. In the standard k-means without any initial supervision, the k-means are chosen randomly in the initial M-step and the stories are assigned to the nearest means in the subsequent E-step. For positive training stories, the initial labels are kept unchanged throughout the algorithm, whereas the conditional distribution for the negative stories are re-estimated at every E-step. We select the number of k initial stories: one is the cluster center of positive stories, and other k-1 stories are negative stories which have the top k-1 smallest value between the negative story and the cluster center of positive stories. In Basu et. al's method, the number of k is given by a user. However, for negative training stories, the number of clusters is not given beforehand. We thus developed an algorithm for estimating k. It goes into action after each run of k means , making decisions about which sets of clusters should be chosen in order to better fit the data. The splitting decision is done by computing the Bayesian Information Criterion which is shown in Eq.(3).</Paragraph> <Paragraph position="5"> (X) is the log-likelihood of X according to the number of k is l, N is the total number of training stories, and p l is the number of parameters in k = l. We set p l to the sum of k class prob-</Paragraph> <Paragraph position="7"> ) , the number of n * k centroid coordinates, and the MLE for the variance,</Paragraph> <Paragraph position="9"/> <Paragraph position="11"> as their closest centroid. The log-likelihood of ll(X) We set the maximum number of k to 100 in the experiment. null</Paragraph> <Paragraph position="13"> We choose the number of k whose value of BIC is highest.</Paragraph> </Section> <Section position="3" start_page="234" end_page="234" type="sub_section"> <SectionTitle> 3.4 Tracking </SectionTitle> <Paragraph position="0"> Each story is represented as a vector of terms with tf* idf weights in an n dimensional space, where n is the number of terms in the collection.</Paragraph> <Paragraph position="1"> Whether or not each test story is positive is judged using the distance (measured by cosine similarity) between a vector representation of the test story and each centroid g of the clusters. Fig.2 illustrates each cluster and a test story in the tracking procedure. Fig.2 shows that negative training stories are classified into three groups. The centroid g for each cluster is calculated as follows:</Paragraph> <Paragraph position="3"> . The test story is judged by using these centroids. If the value of cosine similarity between the test story and the centroid with positive stories is smallest among others, the test story is declared to be positive. In Fig.2, the test story is regarded as negative, since the value between them is smallest. This procedure, is repeated until the last test story is judged.</Paragraph> </Section> </Section> class="xml-element"></Paper>