File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1075_metho.xml
Size: 19,644 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1075"> <Title>The Effect of Translation Quality in MT-Based Cross-Language Information Retrieval</Title> <Section position="5" start_page="593" end_page="594" type="metho"> <SectionTitle> 3 System Description </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="593" end_page="593" type="sub_section"> <SectionTitle> 3.1 The Rule-Based ECMT System </SectionTitle> <Paragraph position="0"> The MT system used in this research is a rule-based ECMT system. The translation quality of this ECMT system is comparable to the best commercial ECMT systems. The basis of the system is semantic transfer (Amano et al., 1989).</Paragraph> <Paragraph position="1"> Translation resources comprised in this system include a large dictionary and a rule base. The rule base consists of rules of different functions such as analysis, transfer and generation.</Paragraph> </Section> <Section position="2" start_page="593" end_page="594" type="sub_section"> <SectionTitle> 3.2 KIDS IR System </SectionTitle> <Paragraph position="0"> KIDS is an information retrieval engine that is based on morphological analysis (Sakai et al., 2003). It employs the Okapi/BM25 term weighting scheme, as fully described in (Robertson & Walker, 1999; Robertson & Sparck Jones, 1997).</Paragraph> <Paragraph position="1"> To focus our study on the relationship between MT performance and retrieval effectiveness, we do not use techniques such as pseudo-relevance feedback although they are available and are known to improve IR performance.</Paragraph> </Section> </Section> <Section position="6" start_page="594" end_page="594" type="metho"> <SectionTitle> 4 Experimental Method </SectionTitle> <Paragraph position="0"> To obtain MT systems of varying quality, we degrade the rule-based ECMT system by impairing the translation resources comprised in the system. Then we use the degraded MT systems to translate the queries and evaluate the translation quality. Next, we submit the translated queries to the KIDS system and evaluate the retrieval performance. Finally we calculate the correlation between the variation of translation quality and the variation of retrieval effectiveness to analyze the relationship between MT performance and CLIR performance.</Paragraph> <Section position="1" start_page="594" end_page="594" type="sub_section"> <SectionTitle> 4.1 Degradation of MT System </SectionTitle> <Paragraph position="0"> In this research, we degrade the MT system in two ways. One is rule-based degradation, which is to decrease the size of the rule base by randomly removing rules from the rule base. For sake of simplicity, in this research we only consider transfer rules that are used for transferring the source language to the target language and keep other kinds of rules untouched. That is, we only consider the influence of transfer rules on translation quality . We first randomly divide the rules into segments of equal size. Then we remove the segments from the rule base, one at each time and obtain a group of degraded rule bases. Afterwards, we use MT systems with the degraded rule bases to translate the queries and get groups of translated queries, which are of different translation quality.</Paragraph> <Paragraph position="1"> The other is dictionary-based degradation, which is to decrease the size of the dictionary by randomly removing a certain number of word entries from the dictionary iteratively. Function words are not removed from the dictionary. Using MT systems with the degraded dictionaries, we also obtain groups of translated queries of different translation quality.</Paragraph> </Section> <Section position="2" start_page="594" end_page="594" type="sub_section"> <SectionTitle> 4.2 Evaluation of Performance </SectionTitle> <Paragraph position="0"> We measure the performance of the MT system by translation quality and use NIST score as the evaluation measure (Doddington, 2002). The In the following part of this paper, rules refer to transfer rules unless explicitly stated.</Paragraph> <Paragraph position="1"> NIST scores reported in this paper are generated by NIST scoring toolkit .</Paragraph> <Paragraph position="2"> For retrieval performance, we use Mean Average Precision (MAP) as the evaluation measure (Voorhees, 2003). The MAP values reported in this paper are generated by trec_eval toolkit , which is the standard tool used by TREC for evaluating an ad hoc retrieval run.</Paragraph> </Section> </Section> <Section position="7" start_page="594" end_page="598" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="594" end_page="594" type="sub_section"> <SectionTitle> 5.1 Data </SectionTitle> <Paragraph position="0"> The experiments are conducted on the TREC5&6 Chinese collection. The collection consists of document set, topic set and the relevance judgment file.</Paragraph> <Paragraph position="1"> The document set contains articles published in People's Daily from 1991 to 1993, and news articles released by the Xinhua News Agency in 1994 and 1995. It includes totally 164,789 documents. The topic set contains 54 topics. In the relevance judgment file, a binary indication of relevant (1) or non-relevant (0) is given.</Paragraph> </Section> <Section position="2" start_page="594" end_page="595" type="sub_section"> <SectionTitle> 5.2 Query Formulation & Evaluation </SectionTitle> <Paragraph position="0"> For each TREC topic, three fields are provided: title, description and narrative, both in Chinese and English, as shown in figure 1. The title field is the statement of the topic. The description The toolkit could be downloaded from: http://www.nist.gov/speech/tests/mt/resources/scoring.htm The toolkit could be downloaded from: field lists some terms that describe the topic. The narrative field provides a complete description of document relevance for the assessors. In our experiments, we use two kinds of queries: title queries (use only the title field) and desc queries (use only the description field). We do not use narrative field because it is the criteria used by the assessors to judge whether a document is relevant or not, so it usually contains quite a number of unrelated words.</Paragraph> <Paragraph position="1"> Title queries are one-sentence queries. When use NIST scoring tool to evaluate the translation quality of the MT system, reference translations of source language sentences are required. NIST scoring tool supports multi references. In our experiments, we introduce two reference translations for each title query. One is the Chinese title (C-title) in title field of the original TREC topic (reference translation 1); the other is the translation of the title query given by a human translator (reference translation 2). This is to alleviate the bias on translation evaluation introduced by only one reference translation. An example of title query and its reference translations are shown in figure 2. Reference 1 is the Chinese title provided in original TREC topic. Reference 2 is the human translation of the query. For this query, the translation output generated by the MT system is &quot;Zai Zhong Guo De Ji Qi Ren Ji Zhu Yan Jiu &quot;. If only use reference 1 as reference translation, the system output will not be regarded as a good translation. But in fact, it is a good translation for the query. Introducing reference 2 helps to alleviate the unfair evaluation.</Paragraph> <Paragraph position="2"> A desc query is not a sentence but a string of terms that describes the topic. The term in the desc query is either a word, a phrase or a string of words. A desc query is not a proper input for the MT system. But the MT system still works. It translates the desc query term by term. When the term is a word or a phrase that exists in the dictionary, the MT system looks up the dictionary and takes the first translation in the entry as the translation of the term without any further analysis. When the term is a string of words such as &quot;number(Shu Liang ) of(De ) infections(Gan Ran )&quot;, the system translates the term into &quot;Gan Ran Shu Liang &quot;. Besides using the Chinese description (C-desc) in the description field of the original TREC topic as the reference translation of each desc query, we also have the human translator give another reference translation for each desc query. Comparison on the two references shows that they are very similar to each other. So in our final experiments, we use only one reference for each desc query, which is the Chinese description (Cdesc) provided in the original TREC topic. An example of desc query and its reference translation is shown in figure 3.</Paragraph> </Section> <Section position="3" start_page="595" end_page="596" type="sub_section"> <SectionTitle> 5.3 Runs </SectionTitle> <Paragraph position="0"> Previous studies (Kwok, 1997; Nie et al., 2000) proved that using words and n-grams indexes leads to comparable performance for Chinese IR.</Paragraph> <Paragraph position="1"> So in our experiments, we use bi-grams as index units.</Paragraph> <Paragraph position="2"> We conduct following runs to analyze the relationship between MT performance and CLIR title-cn1: use reference translation 1 of each title query as Chinese query title-cn2: use reference translation 2 of each title query as Chinese query desc-cn: use reference translation of each desc query as Chinese query Among all the three monolingual runs, desc-cn achieves the best performance. Title-cn1 achieves better performance than title-cn2, which indicates directly using Chinese title as Chinese query performs better than using human translation of title query as Chinese query.</Paragraph> </Section> <Section position="4" start_page="596" end_page="596" type="sub_section"> <SectionTitle> 5.5 Results on Rule-Based Degradation </SectionTitle> <Paragraph position="0"> There are totally 27,000 transfer rules in the rule base. We use all these transfer rules in the experiment of rule-based degradation. The 27,000 rules are randomly divided into 36 segments, each of which contains 750 rules. To degrade the rule base, we start with no degradation, then we remove one segment at each time, up to a complete degradation with all segments removed.</Paragraph> <Paragraph position="1"> With each of the segment removed from the rule base, the MT system based on the degraded rule base produces a group of translations for the input queries. The completely degraded system with all segments removed could produce a group of rough translations for the input queries.</Paragraph> <Paragraph position="2"> Figure 4 and figure 5 show the experimental results on title queries (rule-title) and desc queries (rule-desc) respectively.</Paragraph> <Paragraph position="3"> Figure 4(a) shows the changes of translation quality of the degraded MT systems on title queries. From the result, we observe a successive change on MT performance. The fewer rules, the worse translation quality achieves. The NIST score varies from 7.3548 at no degradation to 5.9155 at complete degradation. Figure 4(b) shows the changes of retrieval performance by using the translations generated by the degraded MT systems as queries. The MAP varies from 0.3126 at no degradation to 0.2810 at complete degradation. Comparison on figure 4(a) and 4(b) indicates similar variations between translation quality and retrieval performance. The better the translation quality, the better the retrieval performance is.</Paragraph> <Paragraph position="4"> Figure 5(a) shows the changes of translation quality of the degraded MT systems on desc queries. Figure 5(b) shows the corresponding changes of retrieval performance. We observe a similar relationship between MT performance and retrieval performance as to the results based ary-based Degradation with Desc Query on title queries. The NIST score varies from 5.0297 at no degradation to 4.8497 at complete degradation. The MAP varies from 0.2877 at no degradation to 0.2759 at complete degradation.</Paragraph> </Section> <Section position="5" start_page="596" end_page="596" type="sub_section"> <SectionTitle> 5.6 Results on Dictionary-Based Degrada- </SectionTitle> <Paragraph position="0"> tion The dictionary contains 169,000 word entries. To make the results on dictionary-based degradation comparable to the results on rule-based degradation, we degrade the dictionary so that the variation interval on translation quality is similar to that of the rule-based degradation. We randomly select 43,200 word entries for degradation. These word entries do not include function words. We equally split these word entries into 36 segments. Then we remove one segment from the dictionary at each time until all the segments are removed and obtain 36 degraded dictionaries. We use the MT systems with the degraded dictionaries to translate the queries and observe the changes on translation quality and retrieval performance. The experimental results on title queries (dic-title) and desc queries (dic-desc) are shown in figure 6 and figure 7 respectively.</Paragraph> <Paragraph position="1"> From the results, we also observe a similar relationship between translation quality and retrieval performance as what we have observed in the rule-based degradation. For both title queries and desc queries, the larger the dictionary size, the better the NIST score and MAP is. For title queries, the NIST score varies from 7.3548 at no degradation to 6.0067 at complete degradation.</Paragraph> <Paragraph position="2"> The MAP varies from 0.3126 at no degradation to 0.1894 at complete degradation. For desc queries, the NIST score varies from 5.0297 at no degradation to 4.4879 at complete degradation.</Paragraph> <Paragraph position="3"> The MAP varies from 0.2877 at no degradation to 0.2471 at complete degradation.</Paragraph> </Section> <Section position="6" start_page="596" end_page="598" type="sub_section"> <SectionTitle> 5.7 Summary of the Results </SectionTitle> <Paragraph position="0"> Here we summarize the results of the four runs in</Paragraph> </Section> </Section> <Section position="8" start_page="598" end_page="598" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Based on our observations, we analyze the correlations between NIST scores and MAPs, as listed in Table 3. In general, there is a strong correlation between translation quality and retrieval effectiveness. The correlations are above 95% for all of the four runs, which means in general, a better performance on MT will lead to a better performance on retrieval.</Paragraph> <Section position="1" start_page="598" end_page="598" type="sub_section"> <SectionTitle> ity & Retrieval Effectiveness 6.1 Impacts of Query Format </SectionTitle> <Paragraph position="0"> For Chinese monolingual runs, retrieval based on desc queries achieves better performance than the runs based on title queries. This is because a desc query consists of terms that relate to the topic, i.e., all the terms in a desc query are precise query terms. But a title query is a sentence, which usually introduces words that are unrelated to the topic.</Paragraph> <Paragraph position="1"> Results on bilingual retrieval are just contrary to monolingual ones. Title queries perform better than desc queries. Moreover, MAP at no degradation for title queries is 0.3126, which is about 99.46% of the performance of monolingual run title-cn1, and outperforms the performance of title-cn2 run. But MAP at no degradation for desc queries is 0.2877, which is just 81.87% of the performance of the monolingual run desc-cn.</Paragraph> <Paragraph position="2"> Comparison on the results shows that the MT system performs better on title queries than on desc queries. This is reasonable because desc queries are strings of terms, however the MT system is optimized for grammatically correct sentences rather than word-by-word translation.</Paragraph> <Paragraph position="3"> Considering the correlation between translation quality and retrieval effectiveness, it is rational that title queries achieve better results on retrieval than desc queries.</Paragraph> </Section> <Section position="2" start_page="598" end_page="598" type="sub_section"> <SectionTitle> 6.2 Impacts of Rules and Dictionary </SectionTitle> <Paragraph position="0"> Table 4 shows the fall of NIST score and MAP at complete degradation compared with NIST score and MAP achieved at no degradation.</Paragraph> <Paragraph position="1"> Comparison on the results of title queries shows that similar variation of translation quality leads to quite different variation on retrieval effectiveness. For rule-title run, 19.57% reduction in translation quality results in 10.11% reduction in retrieval effectiveness. But for dic-title run, 18.33% reduction in translation quality results in 39.41% reduction in retrieval effectiveness. This indicates that retrieval effectiveness is more sensitive to the size of the dictionary than to the size of the rule base for title queries. Why dictionary-based degradation has stronger impact on retrieval effectiveness than rule-based degradation? This is because retrieval systems are typically more tolerant of syntactic than semantic translation errors (Fluhr, 1997). Therefore although syntactic errors caused by the degradation of the rule base result in a decrease of translation quality, they have smaller impacts on retrieval effectiveness than the word translation errors caused by the degradation of dictionary.</Paragraph> <Paragraph position="2"> For desc queries, there is no big difference between dictionary-based degradation and rule-based degradation. This is because the MT system translates the desc queries term by term, so degradation of rule base mainly results in word translation errors instead of syntactic errors.</Paragraph> <Paragraph position="3"> Thus, degradation of dictionary and rule base has similar effect on retrieval effectiveness.</Paragraph> </Section> </Section> <Section position="9" start_page="598" end_page="599" type="metho"> <SectionTitle> Effectiveness 7 Conclusion and Future Work </SectionTitle> <Paragraph position="0"> In this paper, we investigated the effect of translation quality in MT-based CLIR. Our study showed that the performance of MT system and IR system correlates highly with each other. We further analyzed two main factors in MT-based CLIR. One factor is the query format. We concluded that title queries are preferred for MT-based CLIR because MT system is usually optimized for translating sentences rather than words. The other factor is the translation resources comprised in the MT system. Our observation showed that the size of the dictionary has a stronger effect on retrieval effectiveness than the size of the rule base in MT-based CLIR. Therefore in order to improve the retrieval effectiveness of a MT-based CLIR application, it is more effective to develop a larger dictionary than to develop more rules. This introduces another interesting question relating to MT-based CLIR.</Paragraph> <Paragraph position="1"> That is how CLIR can benefit further from MT.</Paragraph> <Paragraph position="2"> Directly using the translations generated by the MT system may not be the best choice for the IR system. There are rich features generated during the translation procedure. Will such features be helpful to CLIR? This question is what we would like to answer in our future work.</Paragraph> </Section> class="xml-element"></Paper>