File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1905_metho.xml
Size: 27,796 bytes
Last Modified: 2025-10-06 14:10:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1905"> <Title>Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese</Title> <Section position="4" start_page="0" end_page="32" type="metho"> <SectionTitle> 3 Extension for Cross-Lingual QA </SectionTitle> <Paragraph position="0"> Because of JAVELIN's modular design, significant changes to the monolingual architecture were not required. We customized the system in order to handle Unicode characters and &quot;plug in&quot; cross-lingual components and resources.</Paragraph> <Paragraph position="1"> For the Question Analyzer, we created the Keyword Translator, a sub-module for translating keywords. The Retrieval Strategist was adapted to search in multilingual corpora. The Information Extractors use language-independent extraction algorithms. The Answer Generator uses language-specific sub-modules for normalization, and a language-independent algorithm for answer ranking. The overall cross-lingual architecture is shown in Figure 2. The rest of this section explains the details of each module.</Paragraph> <Section position="1" start_page="0" end_page="31" type="sub_section"> <SectionTitle> 3.1 Question Analyzer </SectionTitle> <Paragraph position="0"> The Question Analyzer (QA) is responsible for extracting information from the input question in order to formulate a representation of the EACL 2006 Workshop on Multilingual Question Answering - MLQA06 Figure1: Javelin Monolingual Architecture Figure2: Javelin Architecture with Cross-Lingual Extension information required to answer the question.</Paragraph> <Paragraph position="1"> Input questions are processed using the RASP parser (Korhonen and Briscoe, 2004), and the module output contains three main components: a) selected keywords; b) the answer type (e.g.</Paragraph> <Paragraph position="2"> numeric-expression, person-name, location); and c) the answer subtype (e.g. author, river, city). The selected keywords are words or phrases which are expected to appear in documents with correct answers. In order to reduce noise in the document retrieval phase, we use stop-word lists to eliminate high-frequency terms; for example, the term &quot;old&quot; is not included as a keyword for &quot;how-old&quot; questions.</Paragraph> <Paragraph position="3"> We extended the QA module with a keyword translation sub-module, so that translated key-words can be used to retrieve documents from multilingual corpora. This straightforward approach has been used by many other CLQA systems. An alternative approach is to first translate the whole question sentence from English to the target language, and then analyze the translated question. Our reasons for favoring keyword translation are two-fold. First, to translate the question to the target language and analyze it, we would have to replace the English NLP components in the Question Analyzer with their counterparts for the target language. In contrast, key-word translation decouples the question analysis from the translation, and requires no language specific resources during question analysis. The second reason is that machine translation is not perfect, and therefore the resulting translation(s) for the question may be incomplete or ungrammatical, thus adding to the complexity of the analysis task. One could argue that when translating the full sentence instead of just the keywords, we can better utilize state-of-art machine translation techniques because more context information is available. But for our application, an accurate translation of functional words (such as prepositions or conjunctions) is less important.</Paragraph> <Paragraph position="4"> We focus more on words that carry more content information, such as verbs and nouns. We will present more detail on the use of contextual information for disambiguation in the next section.</Paragraph> <Paragraph position="5"> In some recent work (Kwok, 2005, Mori and Kawagishi, 2005), researchers have combined these two approaches, but to date no studies have compared their effectiveness.</Paragraph> </Section> <Section position="2" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 3.2 Translation Module </SectionTitle> <Paragraph position="0"> The Translation Module (TM) is used by the QA module to translate keywords into the language of the target corpus. Instead of combining multiple translation candidates with a disjunctive query operator (Isozaki et al., 2005), the TM selects the best combination of translated keywords from several sources: Machine Readable Dictionaries (MRDs), Machine Translation systems (MTs) and Web-mining-Based Keyword Translators (WBMTs) (Nagata et al., 2001, Li et al., 2003). For translation from English to Japanese, we used two MRDs, eight MTs and one WBMT.</Paragraph> <Paragraph position="1"> If none of them return a translation, the word is transliterated into kana for Japanese (for details on transliteration, see Section 5.2). For translation from English to Chinese, we used one MRD, three MTs and one WBMT. After gathering all possible translations for every keyword, the TM uses a noisy channel model to select the best combination of translated keywords. The TM estimates model statistics using the World Wide Web. Details of the translation selection method are described in the rest of this subsection.</Paragraph> <Paragraph position="2"> The Noisy Channel Model: In the noisy channel model, an undistorted signal passes through a noisy channel and becomes distorted. Given the distorted signal, we are to find the original, undistorted signal. IBM applied the noisy channel model idea to translation of sentences from aligned parallel corpora, where the source language sentence is the distorted signal, and the EACL 2006 Workshop on Multilingual Question Answering - MLQA06 target language sentence is the original signal (Brown et al., 1990). We adopt this model for disambiguating keyword translation, with the source language keyword terms as the distorted signal and the target language terms as the original signal. The TM's job is to find the target language terms given the source language terms, by finding the probability of the target language terms given the source language terms P(T|S).</Paragraph> <Paragraph position="3"> Using Bayes' Rule, we can break the equation down to several components:</Paragraph> <Paragraph position="5"> Because we are comparing probabilities of different translations of the same source keyword terms, we can simplify the problem to be:</Paragraph> </Section> </Section> <Section position="5" start_page="32" end_page="35" type="metho"> <SectionTitle> )|()()|( TSPTPSTP [?]= </SectionTitle> <Paragraph position="0"> We can now reduce the equation to two components. P(T) is the language model and P(S|T) is the translation model. If we assume independence among the translations of individual terms, we can represent the translation probability of a keyword by the product of the probabilities of the individual term translations:</Paragraph> <Paragraph position="2"> Estimating Probabilities using the World Wide Web: For estimating the probabilities of the translation model and the language model, we chose to gather statistics from the World Wide Web. There are three advantages in utilizing the web for gathering translation statistics: 1) it contains documents written in many different languages, 2) it has high coverage of virtually all types of words and phrases, and 3) it is constantly updated. However, we also note that the web contains a lot of noisy data, and building up web statistics is time-consuming unless one has direct access to a web search index.</Paragraph> <Paragraph position="3"> Estimating Translation Model Probabilities: We make an assumption that terms that are translations of each other co-occur more often in mixed-language web pages than terms that are not translations of each other. This assumption is analogous to Turney's work on the co-occurrence of synonyms (Turney, 2001). We then define the translation probability of each keyword translation as:</Paragraph> <Paragraph position="5"> hits be a number of web pages retrieved from a certain search engine. co(s</Paragraph> <Paragraph position="7"> ., where log is applied to adjust the count so that translation probabilities can still be comparable at higher counts.</Paragraph> <Paragraph position="8"> Estimating Language Model Probabilities: In estimating the language model, we simply obtain hits given a conjunction of all the candidate terms in the target language, and divide that count by the sum of the occurrences of the individual terms:</Paragraph> <Paragraph position="10"> The final score of a translation candidate for a query is the product of the translation model score P(S|T) and the language model score P(T).</Paragraph> <Paragraph position="11"> Smoothing and Pruning: As with most statistical calculations in language technologies, there is a data sparseness problem when calculating the language model score. Also, because statistics are gathered real-time by accessing a remote search engine via internet, it can take a long time to process a single query when there is a large number of translation candidates. We describe methods for smoothing the language model and pruning the set of translation candidates below.</Paragraph> <Paragraph position="12"> The data sparseness problem occurs when there are many terms in the query, and the terms are relatively rare keywords. When calculating the language model score, it is possible that none of the translation candidates appear on any web page. To address this issue, we propose a &quot;moving-window smoothing&quot; algorithm: * When the target keyword co-occurrence count with n keywords is below a set threshold for all of the translation candidates, we use a moving window of size n-1 that &quot;moves&quot; through the keywords in sequence, splitting the set of keywords into two sets, each with n-1 keywords.</Paragraph> <Paragraph position="13"> * If the co-occurrence count of all of these sets of keywords is above the threshold, return the product of the language model EACL 2006 Workshop on Multilingual Question Answering - MLQA06 score of these two sets as the language model score.</Paragraph> <Paragraph position="14"> * If not, decrease the window and repeat until either all of the split sets are above the threshold or n = 1.</Paragraph> <Paragraph position="15"> The moving window smoothing technique gradually relaxes the search constraint without losing the &quot;connectivity&quot; of keywords (there is always overlap in the split parts) before finally backing off to just the individual keywords.</Paragraph> <Paragraph position="16"> However, there are two issues worth noting with this approach: 1. &quot;Moving-window smoothing&quot; assumes that keywords that are next to each other are also more semantically related, which may not always be the case.</Paragraph> <Paragraph position="17"> 2. &quot;Moving-window smoothing&quot; tends to give the keywords near the middle of the question more weight, which may not be desirable.</Paragraph> <Paragraph position="18"> A better smoothing technique may be used with trying all possible &quot;splits&quot; at each stage, but this would greatly increase the time cost. Therefore, we chose the moving-window smoothing as a trade-off between a more robust smoothing technique that tries all possible split combinations and no smoothing at all.</Paragraph> <Paragraph position="19"> The set of possible translation candidates is produced by creating all possible combinations of the translations of individual keywords. For a question with n keywords and an average of m possible translations per keyword, the number of possible combinations is m n . This quickly becomes intractable as we have to access a search engine at least m n times just for the language model score. Therefore, pruning is needed to cut down the number of translation candidates. We prune possible translation candidates twice during each run, using early and late pruning: 1. Early Pruning: We prune possible translations of the individual keywords before combining them to make all possible translations of a query. We use a very simple pruning heuristic based on target word frequency using a word frequency list. Very rare translations produced by a resource are not considered.</Paragraph> <Paragraph position="20"> 2. Late Pruning: We prune possible transla- null tion candidates of the entire set of key-words after calculating translation probabilities. Since the calculation of the translation probabilities requires little access to the web, we can calculate only the language model score for the top N candidates with the highest translation score and prune the rest.</Paragraph> <Paragraph position="21"> An Example of English to Chinese Keyword Translation Selection: Suppose we translate the following question from English to Chinese. &quot;What if Bush leaves Iraq?&quot; Three keywords are extracted: &quot;Bush&quot;, &quot;leaves&quot;, and &quot;Iraq.&quot; Using two MT systems and an MRD, we obtain the following translations: &quot;Bush&quot; and &quot;leaves&quot; both have two translations because they are ambiguous keywords, while &quot;Iraq&quot; is unambiguous. Translation (1,1) means bush as in a shrub, and translation (1,2) refers to the person named Bush. Translation (2,1) is the verb &quot;to go away&quot;, and translation (2,2) is the noun for leaf. Note that we would like translation (1,2) and translation (2,1) because they match the sense of the word intended by the user. Now we can create all possible combinations of the keywords in the target language: By calculating hits, we obtain the statistics and the translation scores shown in Table 2 and 3. Now we can proceed to use the search engine to obtain language model statistics, which we use to obtain the language model. Then, together with the translation model score, we calculate the Score, and Overall Score As shown in Table 6, we select the most probable combination of translated keywords with the highest overall score (the third candidate), which is the correct translation of the English keywords.</Paragraph> <Section position="1" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 3.3 Retrieval Strategies </SectionTitle> <Paragraph position="0"> The Retrieval Strategist (RS) module retrieves documents from a corpus in response to a query.</Paragraph> <Paragraph position="1"> For document retrieval, the RS uses the Lemur 3.0 toolkit (Ogilvie and Callan, 2001). Lemur supports structured queries using operators such as Boolean AND, Synonym, Ordered/Un-Ordered Window and NOT. An example of a structured query is shown below: For simplicity, we don't apply smoothing and pruning.</Paragraph> <Paragraph position="2"> In formulating a structured query, the RS uses an incremental relaxation technique, starting from an initial query that is highly constrained; the algorithm searches for all the keywords and data types in close proximity to each other. The priority is based on a function of the likely answer type, keyword type (word, proper name, or phrase) and the inverse document frequency of each keyword. The query is gradually relaxed until the desired number of relevant documents is retrieved.</Paragraph> </Section> <Section position="2" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 3.4 Information Extraction </SectionTitle> <Paragraph position="0"> In the JAVELIN system, the Information Extractor (IX) is not a single module that uses one extraction algorithm; rather, it is an abstract interface which allows different information extractor implementations to be plugged into JAVELIN. These different extractors can be used to produce different results for comparison, or the results of running them all in parallel can be merged. Here we will describe just one of the extractors, the one which is currently the best algorithm in our CLQA experiment: the Light IX.</Paragraph> <Paragraph position="1"> The Light IX module uses simple, distance-based algorithms to find a named entity that matches the expected answer type and is &quot;closest&quot; to all the keywords according to some distance measure. The algorithm considers as answer candidates only those terms that are tagged as named entities which match the desired answer type. The score for an answer candidate a is calculated as follows: )()()( aDistScoreaOccScoreaScore [?]+[?]= ba where a + b = 1, OccScore is the occurrence score and DistScore is the distance score. Both OccScore and DistScore return a number between zero and one, and likewise Score returns a number between zero and one. Usually, a is much smaller than b . The occurrence score formula is:</Paragraph> <Paragraph position="3"> is the i-th keyword, and n is the number of keywords. Exist returns 1 if the i-th keyword exists in the document, and 0 otherwise. The distance score for EACL 2006 Workshop on Multilingual Question Answering - MLQA06 each answer candidate is calculated according to the following formula: This formula produces a score between zero and one. If the i-th keyword does not exist in a document, the equation inside the summation will return zero. If the i-th keyword appears more than once in the document, the one closest to the answer candidate is considered. An additional restriction is that the answer candidate cannot be one of the keywords. The Dist function is the distance measure, which has two definitions: 1. ),(),( batTokensAparbaDist = 2. )),(log(),( batTokensAparbaDist = The first definition simply counts the number of tokens between two terms. The second definition is a logarithmic measure. The function returns the number of tokens from a to b; if a and b are adjacent, the count is 1; if a and b are separated by one token, the count is 2, and so on. A token can either be a character or a word; for the E-C, we used character-based tokenization, whereas for the E-J, we use word-based tokenization. By heuristics obtained from training results, we used the linear Dist measure for E-C and logarithmic Dist measure for E-J in the evaluation.</Paragraph> <Paragraph position="4"> This algorithm is a simple statistical approach which requires no language-specific external tools beyond word segmentation and a named-entity tagger. It is not as sophisticated as other approaches which perform deep linguistic analysis, but one advantage is faster adaptation to multiple languages. In our experiments, this simple algorithm performs at the same level as a FST-based approach (Nyberg, et al. 2005).</Paragraph> </Section> <Section position="3" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 3.5 Answer Generator </SectionTitle> <Paragraph position="0"> The task of the Answer Generator (AG) module is to produce a ranked list of answer candidates from the IX output. The AG is designed to normalize answer candidates by resolving representational differences (e.g. in how numbers, dates, etc. are expressed in text). This canonicalization makes it possible to combine answer candidates that differ only in surface form.</Paragraph> <Paragraph position="1"> Even though the AG module plays an important role in JAVELIN, we did not use its full potential in our E-C and E-J systems, since we lacked some language-specific resources required for multilingual answer merging.</Paragraph> </Section> </Section> <Section position="6" start_page="35" end_page="36" type="metho"> <SectionTitle> 4 Evaluation and Effect of Translation Accuracy </SectionTitle> <Paragraph position="0"> To evaluate the effect of translation accuracy on the overall performance of the CLQA system, we conducted several experiments using different translation methods. Three different runs were carried out for both the E-C and E-J systems, using the same 200-question test set and the document corpora provided by the NTCIR CLQA task. The first run was a fully automatic run using the original translation module in the CLQA system; the result is exactly same as the one we submitted to NTCIR5 CLQA. For the second run, we manually translated the keywords that were selected by the Question Analyzer module. This translation was done by looking at only the selected keywords, but not the original question. For both E-C and E-J tasks, the NTCIR organizers provided the translations for the English questions, which we assume are the gold-standard translations. Taking advantage of this resource, in the third run we simply looked up the corresponding term for each English keyword from the gold-standard translation of the question. The results for these runs are shown in Table 7 and 8 below.</Paragraph> <Paragraph position="1"> We found that in the NTCIR task, the supported/correct document set was not complete. Some answers judged as unsupported were indeed well supported, but the supporting document did not appear in NTCIR's correct document set. Therefore, we think the Top1+U column is more informative for this evaluation.</Paragraph> <Paragraph position="2"> From Table 7 and 8, it is obvious that the overall performance increases as translation accuracy EACL 2006 Workshop on Multilingual Question Answering - MLQA06 increases. From Run1 to Run2, we eliminated all the overt translation errors produced by the system, and also corrected word-sense errors. Then from Run2 to Run3, we made different lexical choices among the seemingly all correct translations of a word. This type of inappropriateness cannot be classified as an error, but it makes a difference in QA systems, especially at the document retrieval stage. For example, the phrase &quot;Kyoto Protocol&quot; can have two valid translations: Jing Du Xie Yi or Jing Du Yi Ding Shu . Both translations would be understandable to a human, but the second translation will appear much more frequently than the first one in the document set. This type of lexical choice is hard to make, because we would need either subtle domain-specific knowledge, or knowledge about the target corpus; neither is easily obtainable.</Paragraph> <Paragraph position="3"> Comparing Run 1 and 3 in Table 8, we see that improving keyword translation had less overall impact on the E-J system. Information extraction (including named entity identification) did not perform as well in E-J. We also compared the translation effect on cross-lingual document retrieval (Figure 3). As we can see, Run 3 retrieved supporting documents more frequently in rank 1 than in Run 1 or 2. From these preliminary investigations, it would seem that information extraction and/or answer generation must be improved for English-Japanese CLQA.</Paragraph> <Paragraph position="4"> Figure3: Comparison of three runs: Cross-lingual document retrieval performance in E-J</Paragraph> </Section> <Section position="7" start_page="36" end_page="37" type="metho"> <SectionTitle> 5 Translation Issues </SectionTitle> <Paragraph position="0"> In this section, we discuss language specific key-word translation issues for Chinese and Japanese CLQA.</Paragraph> <Section position="1" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 5.1 Chinese </SectionTitle> <Paragraph position="0"> One prominent problem in Chinese keyword translation is word sense disambiguation. In question answering systems, the translation results are used directly in information retrieval, which exhibits a high dependency on the lexical form of a word but not so much on the meaning.</Paragraph> <Paragraph position="1"> In other words, having a different lexical form from the corresponding term in corpora is the same as having a wrong translation. For example, to translate the word &quot;bury&quot; into Chinese, our system gives a translation of Mai , which means &quot;bury&quot; as the action of digging a hole, hiding some items in the hole and then covering it with earth. But the desired translation, as it appears in the document is Zang , which means &quot;bury&quot; too, but specifically for burial in funerals. Even more challenging are regional language differences. In our system, for example, the corpora are newswire articles written in Traditional Chinese from Taiwan, and if we use an MT system that produces translations in Simplified Chinese followed by conversion to Traditional Chinese, we may run into problems. The MT system generates Simplified Chinese translations first, which may suggest that the translation resources it uses were written in Simplified Chinese and originate from mainland China. In mainland China and in Taiwan, people commonly use different words for describing the same thing, especially for proper nouns like foreign names. Table 9 lists some examples. Therefore if the MT system generates its output using text from mainland China, it may produce a different word than the one used in Taiwan, which may not appear in the corpora. This could lead to failure in document retrieval.</Paragraph> </Section> <Section position="2" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 5.2 Japanese </SectionTitle> <Paragraph position="0"> Representational Gaps: One of the advantages of using structured queries and automatic query formulation in the RS is that the system is able to handle slight representational gaps between a EACL 2006 Workshop on Multilingual Question Answering - MLQA06 translated query and corresponding target words in the corpus.</Paragraph> <Paragraph position="1"> For example, Werner Spies appears as vueru na siyupisu in our Japanese preprocessed corpus and therefore vueruna siyupi su , which is missing a dot between last and first name, is a wrong translation if our retrieval module only allows exact match. Lemur supports an Ordered Distance Operator where the terms within a #ODN operator must be found within N words of each other in the text in order to contribute to the document's belief value. This enables us to bridge the representational gaps; such as when #OD1(vueruna siyupisu ) does not match any words in the corpus, #OD2(vueruna siyupisu ) is formulated in the next step in order to capture vueruna siyupisu .</Paragraph> <Paragraph position="2"> Transliteration in WBMT: After detecting Japanese nouns written in romaji (e.g. Funabashi), we transliterated them into hiragana for a better result in WBMT. This is because we are assuming higher positive co-occurrence between kana and kanji (i.e. hunabasi and Chuan Qiao ) than between romaji and kanji (i.e. funabashi andChuan Qiao ). When there are multiple transliteration candidates, we iterate through each candidate.</Paragraph> <Paragraph position="3"> Document Retrieval in Kana: Suppose we are going to transliterate Yusuke. This romaji can be mapped to kana characters with relatively less ambiguity (i.e. yusuke, yuusuke), when compared to their subsequent transliteration to kanji (i.e. Xiong Jie , You Jie , You Jie , Yong Jie , Xiong Fu etc.). Therefore, indexing kana readings in the corpus and querying in kana is sometimes a useful technique for CLQA, given the difficulty in converting romaji to kana and romaji to kanji.</Paragraph> <Paragraph position="4"> To implement this approach, the Japanese corpus was first preprocessed by annotating named entities and by chunking morphemes. Then, we annotated a kana reading for each named entity.</Paragraph> <Paragraph position="5"> At query time, if there is no translation found from other resources, the TM transliterates romaji to kana as a back-off strategy.</Paragraph> </Section> </Section> class="xml-element"></Paper>