File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1086_metho.xml
Size: 16,555 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1086"> <Title>Implicit Ambiguity Resolution Using Incremental Clustering in Korean-to-English Cross-Language Information Retrieval</Title> <Section position="3" start_page="2" end_page="2" type="metho"> <SectionTitle> 2 Implicit ambiguity resolution using </SectionTitle> <Paragraph position="0"> incremental clustering Figure 1 shows the overall architecture of our system which incorporates implicit ambiguity resolution method based on query-oriented document clusters. In the system, a query in Korean is first translated into English by looking up dictionaries, and documents are retrieved based on the vector space retrieval for the translated query. For the top-ranked retrieved documents, document clusters are incrementally created and the weight of each retrieved document is re-calculated by using clusters with preference. This phase is the core of our implicit ambiguity resolution method. Below, we will describe each module in the system.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.1 Dictionary-based query translation and ambiguities </SectionTitle> <Paragraph position="0"> Queries are written in natural language in Korean. We first apply morphological analysis and part-of-speech (POS) tagging to a query, and select keywords based on the POS information. For each keyword, we look up Korean-English dictionaries, and all the English translations in the dictionaries are chosen as query terms. We used a general-purpose bilingual dictionary and technical bilingual dictionaries (Chun, 2000). All in all, they have 282,511 Korean entries and 505,003 English translations.</Paragraph> <Paragraph position="1"> Since a term can have multiple translations, the list of translated query terms can contain terms of different meanings as well as synonyms. While synonyms can improve retrieval effectiveness, terms with different meanings produced from the same original term can degrade retrieval performance tremendously.</Paragraph> <Paragraph position="2"> At this stage, we can apply statistical ambiguity resolution method based on mutual information. In the experiment below, we will examine two cases, i.e. with and without ambiguity resolution at this stage.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.2 Document retrieval based on vector space </SectionTitle> <Paragraph position="0"> retrieval model For the query, documents are retrieved based on the vector space retrieval method. This method simply checks the existence of query terms, and calculates similarities between the query and documents. The query-document similarity of each document is calculated by vector inner product of the query and document vectors:</Paragraph> <Paragraph position="2"> where query and document weight,</Paragraph> <Paragraph position="4"> are calculated by ntc-ltn weighting scheme which yields the best retrieval result in Lee et al (2001) among several weighting schemes used in SMART system (Salton, 1989).</Paragraph> <Paragraph position="5"> As the translated query can contain noises, non-relevant documents may have higher ranks than relevant documents.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 2.3 Query-oriented incremental clustering for </SectionTitle> <Paragraph position="0"> implicit ambiguity resolution In order to exclude non-relevant documents from higher ranks, we take top N documents to create clusters incrementally and dynamically, and use similarities between the clusters and the query to re-rank the documents. Basic idea is: Each cluster created by clustering of retrieved documents can be seen as giving a context of the documents belonging to the cluster; by calculating the similarity between each cluster and the query, therefore, we can spot the relevant context of the query; documents that belong to more relevant context or cluster are likely to be relevant to the query.</Paragraph> <Paragraph position="1"> It should be noted here that the static global clustering is not practical in the current setup, because it takes much computational time and the document space is too sparse (see Anick and Vaithyanathan (1997) for the comparison of static and dynamic clustering).</Paragraph> <Paragraph position="2"> We make clusters based on incremental centroid method. There are a few variations in the agglomerative clustering method. The agglomerative centroid method joins the pair of clusters with the most similar centroid at each stage (Frakes and Baeza-Yates, 1992).</Paragraph> <Paragraph position="3"> Incremental centroid clustering method is straightforward. The input document of incremental clustering proceeds according to the ranks of the top-ranked N documents resulted from vector space retrieval for a query.</Paragraph> <Paragraph position="4"> Document and cluster centroid are represented in vectors. For the first input document (rank 1), create one cluster whose member is itself. For each consecutive document (rank 2, ..., N), compute cosine similarity between the document and each cluster centroid in the already created clusters. If the similarity between the document and a cluster is above a threshold, then add the document to the cluster as a member and update cluster centroid. Otherwise, create a new cluster with this document. Note that one document can be a member of several clusters as shown in Figure 2 (sold lines show that the document belongs to the cluster).</Paragraph> <Paragraph position="5"> Similarities between the clusters and the query, or query-cluster similarities, are calculated by the combination of the query inclusion ratio and vector inner product between the query vector and the centroid vectors of the clusters.</Paragraph> <Paragraph position="7"> |is the number of query terms included in a cluster centroid, |c q |/|q |is the query inclusion ratio for the cluster. The documents included in the same cluster have the same query-cluster similarity.</Paragraph> <Paragraph position="8"> Cluster preferences are influenced by the query inclusion ratio, which prefers the cluster whose centroid includes more various query terms. Thus incorporating this information into the weighting of each document means adding information which is related to the behavior of terms in documents as well as the association of terms and documents into the evaluation of the relevance of each document; it therefore has the effect of ambiguity resolution.</Paragraph> <Paragraph position="9"> 2.4 Reflecting cluster information to the documents Using the query-cluster similarity, we re-calculate the relevance of each document according to the following equation:</Paragraph> <Paragraph position="11"> by vector space retrieval as defined in equation (1) and simC(q,c) is a query-cluster similarity of a document d defined in equation (2). Since each document can be a member of several clusters, we assign the highest query-cluster similarity value to the document. The new document similarity, sim(q,d), is calculated by multiplication of a query-cluster similarity and a of the top-ranked N documents similarity sim(q,d), we re-rank the retrieved documents. In the equation, we tried to use weighted sum of a query-document similarity and a query-cluster similarity. The combination by multiplication showed better performances than that of weighted sum.</Paragraph> <Paragraph position="12"> Through this procedure, we can effectively take into account the contexts of all the terms in a document as well as of the query terms. Thus, even if a document which has a low query-document similarity can have a high query-cluster similarity thanks to the effect of neighboring documents in the same cluster. The reverse can be true as well.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Experimental environment </SectionTitle> <Paragraph position="0"> We evaluated our method on TREC-6 CLIR test collection which contains 242,918 English documents (AP news from 1988 to 1990) and 24 English queries. English queries are translated to Korean queries manually. We use title field of queries which consist of three fields such as title, description and narrative.</Paragraph> <Paragraph position="1"> In dictionary-based query translation, one query term has multiple translations. Table 3 shows the degree of ambiguities.</Paragraph> <Paragraph position="2"> The number of Korean query terms 47 The number of translated terms 149 The average number of translations 3.2 In our experiment, we only use 14 queries which consist of more than one term to observe real effects of our method. This is because, if a query consists of more than one term, human can select the correct meaning of the term by its neighbours. But if a query consists of one term such as 'bank' and it is polysemous, no one can resolve ambiguities without considering additional external information. The rest 10 queries which consist of one term are used to decide a threshold in incremental clustering. We use SMART system (Salton, 1989) developed at Cornell as a vector space retrieval.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> The retrieval effectiveness was evaluated using the 11-point average precision metric.</Paragraph> <Paragraph position="1"> We compared our method with original English queries, with translated queries with ambiguities, and with translated queries with the best translation after disambiguation. The followings are the brief descriptions for comparison methods: 1) monolingual: the performance of vector space retrieval system for original English queries as the monolingual baseline.</Paragraph> <Paragraph position="2"> 2) tall_base: the performance of vector space retrieval system for translated English queries which have all possible translations in bilingual dictionaries without ambiguity resolution.</Paragraph> <Paragraph position="3"> 3) tall_rerank: the performance of proposed method using dynamic incremental clusters for the retrieved documents of tall_base.</Paragraph> <Paragraph position="4"> 4) tone_base: the performance of vector space retrieval system for translated queries with the best translations for each query term after ambiguity resolution based on mutual information.</Paragraph> <Paragraph position="5"> 5) tone_rerank: the performance of proposed method using dynamic incremental clusters for the retrieved documents of tone_base.</Paragraph> <Paragraph position="6"> 'tall_rerank' and 'tone_rerank' use our implicit disambiguation method. The number of top N documents used in dynamic incremental clustering is 300 and thresholds for incremental centroid clustering are set as 0.41 which are learned from training 10 queries with one term in both tall_rerank and tone_rerank.</Paragraph> <Paragraph position="7"> The main objective of this paper is to observe the performance change by incremental clusters for translated queries with ambiguities (tall_base and tall_rerank).</Paragraph> <Paragraph position="8"> methods.</Paragraph> <Paragraph position="9"> To observe the effect of clusters, we compared the results after disambiguation based on mutual information (tone_base and tone_rerank). We selected the best translation based on mutual information among all translation terms. Mutual information MI(x,y) is defined as following (Church and Hanks, 1990):</Paragraph> <Paragraph position="11"> where f(x) and f(y) are frequency of term x and term y, respectively. Co-occurrence frequency of term x and term y, f(x,y), is taken in window size 6 for AP 1988 news documents.</Paragraph> <Paragraph position="12"> The 11-point average precision value, corresponding result to monolingual (C/M), and performance change are summarized in Table 2.</Paragraph> <Paragraph position="13"> The retrieval effectiveness of tall_rerank is 0.2780, corresponding to 97.27% of monolingual performance. The performance of tone_rerank yields 0.3026 (105.87%). This is even better than the monolingual performance.</Paragraph> <Paragraph position="14"> The performance of our implicit ambiguity resolution method for all translations (tall_rerank) shows 8.63% improvement compared with that of ambiguity resolution based on mutual information (tone_base). The proposed method achieved 28% improvement for all translation queries and 18% for best translation queries compared with the vector space retrieval. Our method after disambiguation (tone_rerank) using mutual information improved about 39.6% over vector space retrieval for all translations queries (tall_base).</Paragraph> <Paragraph position="15"> The cluster-based implicit disambiguation method, therefore, is more effective for performance improvement than the simple query disambiguation method based on mutual information; if used together, it shows yet further improvement.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Result analysis </SectionTitle> <Paragraph position="0"> We examined the effects of our method for a query with ambiguities increased after bilingual dictionary-based term translation.</Paragraph> <Paragraph position="1"> The Korean query is 'jadongca[ja-dong-cha] gonggi[gong-gi] oyeom[o-yeom]' whose original English query is 'automobile air pollution'. The translated query with all the possible translations in Korean-English dictionaries for this query is as follows: In this query, the term 'gonggi' is polysemous which has several meanings such as <air>, <atmosphere>, <jackstone>, <co-occurrence>, and <bowl>. This is the cause of degrading system performance.</Paragraph> <Paragraph position="2"> 146 clusters were created for the retrieved 300 documents of this query. The token number of documents in the clusters was 435. The distribution of cluster members is shown in Figure 3. Most non-relevant documents had a tendency to make singleton cluster, and most relevant documents made large group clusters.</Paragraph> <Paragraph position="3"> We examined inside the clusters how to see cluster give effects to resolve ambiguity and reflect context. Cluster C4 in Figure 3 has 60 members, which contains 56 relevant documents and 4 non-relevant documents, among 209 relevant documents for this query. This cluster centroid includes following terms related to the for the query with translation ambiguities.</Paragraph> <Paragraph position="4"> Although this centroid includes a noise term 'atmosphere', its weight is low. The other terms are appropriate to the query; they are synonyms. Since all of the query terms are included in the centroid, query inclusion ratio is 1 and all synonyms affect positively to the vector inner product value. Therefore, since this cluster preference is high, the ranks of all documents in this cluster changed higher. The cluster performed as a context of the documents relevant to the query. Cluster C85 is a singleton whose centroid includes one of three query terms: bowl 0.101 marble 0.19 Since query inclusion ratio is low, the cluster preference is low. Therefore this cluster's effect is weak to the document.</Paragraph> <Paragraph position="5"> Figure 4 presents the rank changes, calculated by subtracting ranks by our method (tall_rerank) from those by vector space retrieval (tall_base) for each relevant document of the ambiguous query. The ranks of most documents are changed higher through cluster analysis, although the ranks of some documents are changed lower. Figure 5 shows recall/precision curves for the performances of original English query (monolingual; 0.6783 in 11-pt avg.</Paragraph> <Paragraph position="6"> precision), translated query without disambiguation (tall_base; 0.5635), and our method (tall_rerank; 0.6622). For increased query ambiguity, we could achieve 97.62% performance compared to the monolingual retrieval.</Paragraph> <Paragraph position="7"> These results indicate that cluster analysis help to resolve ambiguity. Thus, we could effectively take into account the context of all the terms in a document as well as the query terms.</Paragraph> </Section> </Section> class="xml-element"></Paper>