File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2030_evalu.xml
Size: 9,300 bytes
Last Modified: 2025-10-06 13:59:43
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2030"> <Title>Using Bilingual Comparable Corpora and Semi-supervised Clustering for Topic Tracking</Title> <Section position="8" start_page="234" end_page="237" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="234" end_page="234" type="sub_section"> <SectionTitle> 4.1 Creating Japanese Corpus </SectionTitle> <Paragraph position="0"> We chose the TDT3 English corpora as our gold standard corpora. TDT3 consists of 34,600 stories with 60 manually identified topics. We then created Japanese corpora (Mainichi and Yomiuri newspapers) to evaluate the method. We annotated the total number of 66,420 stories from Oct.1, to Dec.31, 1998, against the 60 topics. Each story was labelled according to whether the story discussed the topic or not. Not all the topics were present in the Japanese corpora. We therefore collected 1 topic from the TDT1 and 2 topics from the TDT2, each of which occurred in Japan, and added them in the experiment. TDT1 is collected from the same period of dates as the TDT3, and the first story of 'Kobe Japan Quake' topic starts from Jan.</Paragraph> <Paragraph position="1"> 16th. We annotated 174,384 stories of Japanese corpora from the same period for the topic. Table 2 shows 24 topics which are included in the Japanese corpora. 'TDT' refers to the evaluation data, TDT1, 2, or 3. 'ID' denotes topic number defined by the TDT. 'OnT.'(On-Topic) refers to the number of stories discussing the topic. Bold font stands for the topic which happened in Japan. The evaluation of annotation is made by three humans.</Paragraph> <Paragraph position="2"> The classification is determined to be correct if the majority of three human judges agree.</Paragraph> </Section> <Section position="2" start_page="234" end_page="235" type="sub_section"> <SectionTitle> 4.2 Experiments Set Up </SectionTitle> <Paragraph position="0"> The English data we used for extracting terms is Reuters'96 corpus(806,791 stories) including TDT1 and TDT3 corpora. The Japanese data was 1,874,947 stories from 14 years(from 1991 to 2004) Mainichi newspapers(1,499,936 stories), and 3 years(1994, 1995, and 1998) Yomiuri newspapers(375,011 stories). All Japanese stories were tagged by the morphological analysis Chasen(Matsumoto, 1997). English stories were tagged by a part-of-speech tagger(Schmid, 1995), and stop word removal. We applied n-gram model with Church-Gale smoothing to noun words, and selected terms whose probabilities are higher than a certain threshold . As a result, we obtained 338,554 Japanese and 130,397 English terms. We used the EDR bilingual dictionary, and translated Japanese terms into English. Some of the words had no translation. For these, we estimated term correspondences. Each story is represented as a vector of terms with tf*idf weights. We calculated story similarities and extracted story pairs between positive and its associated stories The threshold value for bilingual story pair was 0.65, and that for monolingual was 0.48. The difference of dates between bilingual stories was +-4.</Paragraph> <Paragraph position="1"> the tracking, we used the extracted terms together with all verbs, adjectives, and numbers, and represented each story as a vector of these with tf*idf weights.</Paragraph> <Paragraph position="2"> We set the evaluation measures used in the TDT benchmark evaluations. 'Miss' denotes Miss rate, which is the ratio of the stories that were judged as YES but were not evaluated as such for the run in question. 'F/A' shows false alarm rate, which is the ratio of the stories judged as NO but were evaluated as YES. The DET curve plots misses and false alarms, and better performance is indicated by curves more to the lower left of the graph. The detection cost function(C are the costs of a missed detection, false alarm, and priori probability of finding a target, respectively. C are usually set to 10, 1, and 0.02, respectively. The normalized cost function is defined by Eq.(9), and lower cost scores indicate better performance. null is the number of initial positive training stories. We recall that we used subset of the topics defined by the TDT. We thus implemented Allan's method(Allan et. al, 1998) which is similar to our method, and compared the results. It is based on a tracking query which is created from the top 10 most commonly occurring features in the N t stories, with weight equal to the number of times the term occurred in those stories multiplied by its incremental idf value. They used a shallow tagger and selected all nouns, verbs, adjectives, and numbers. We added the extracted terms to these part-of-speech words to make their results comparable with the results by our method. 'Baseline' in Table 3 shows the best result with their method among varying threshold values of similarity between queries and test stories. We can see that the performance of our method was competitive to the baseline at every N t value.</Paragraph> <Paragraph position="3"> Fig.3 shows DET curves by both our method and Allan's method(baseline) for 23 topics from the TDT2 and 3. Fig.4 illustrates the results for 3 topics from TDT2 and 3 which occurred in Japan. To make some comparison possible, only the N</Paragraph> <Paragraph position="5"> 4 is given for each. Both Figs. show that we have an advantage using bilingual comparable corpora.</Paragraph> </Section> <Section position="3" start_page="235" end_page="236" type="sub_section"> <SectionTitle> 4.4 The Effect of Story Pairs </SectionTitle> <Paragraph position="0"> The contribution of the extracted story pairs, especially the use of two types of story pairs, bilingual and monolingual, is best explained by looking at the two results: (i) the tracking results with two types of story pairs, with only English and Japanese stories in question, and without story pairs, and (ii) the results of story pairs by varying values of N</Paragraph> <Paragraph position="2"> =4.</Paragraph> <Paragraph position="3"> As can be clearly seen from Fig.5, the result with story pairs improves the overall performance, especially the result with two types of story pairs was better than that with only English and Japanese stories in question. Table 4 shows the performance of story pairs which consist of positive and its associated story. Each result denotes micro-averaged scores. 'Rec.' is the ratio of correct story pair assignments by the system divided by the total number of correct assignments. 'Prec.' is the ratio of correct story pair assignments by the system divided by the total number of system's assignments. Table 4 shows that the system with two types of story pairs correctly extracted stories related to the target topic even for a small number of positive training stories, since the ratio of Prec. in N t = 1 is 0.82. However, each recall value in Table 4 is low. One solution is to use an incremental approach, i.e. by repeating story pairs extraction, new story pairs that are not extracted previously may be extracted. This is a rich space for further exploration.</Paragraph> <Paragraph position="4"> The effect of story pairs for the tracking task also depends on the performance of bilingual term correspondences. We obtained 1,823 English and Japanese term pairs in all when a period of days was +-4. Fig.6 illustrates the result using different period of days(+-1to+-10). For example, '+-1' shows that the difference of dates between English and Japanese story pairs is less than +-1. Y-axis shows the precision which is the ratio of correct term pairs by the system divided by the total number of system's assignments. Fig.6 shows that the difference of dates between bilingual story pairs, affects the overall performance.</Paragraph> </Section> <Section position="4" start_page="236" end_page="237" type="sub_section"> <SectionTitle> 4.5 The Effect of k-means with EM </SectionTitle> <Paragraph position="0"> The contribution of k-means with EM for classifying negative stories is explained by looking at the result without classifying negative stories. We calculated the centroid using all negative training stories, and a test story is judged to be negative or positive by calculating cosine similarities between the test story and each centroid of negative and positive stories. Further, to examine the effect of using the BIC, we compared with choosing a pre-defined k, i.e. k=10, 50, and 100. Fig.7 illustrates part of the result for k=100. We can see that the method without classifying negative stories(k=0) does not perform as well and results in a high miss rate. This result is not surprising, because the size of negative training stories is large compared with that of positive ones, and therefore, the test story is erroneously judged as NO. Furthermore, the result indicates that we need to run BIC, as the result was better than the results with choosing any number of pre-defined k, i.e. k=10, 50, and 100. We also found that there was no correlation between the number of negative training stories for each of the 24 topics and the number of clusters k obtained by the BIC. The minimum number of clusters k was 44, and the maximum was 100.</Paragraph> </Section> </Section> class="xml-element"></Paper>