File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1041_metho.xml
Size: 14,778 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1041"> <Title>Machine Translation vs. Dictionary Term Translation a Comparison for English-Japanese News Article Alignment</Title> <Section position="4" start_page="0" end_page="263" type="metho"> <SectionTitle> 2 MLIR Methods </SectionTitle> <Paragraph position="0"> There has recently been much interest in the MLIR task (Carbonell et al., 1997)(Dumais et al., 1996)(Hull and Grefenstette, 1996). MLIR differs from traditional informalion retrieval in several respects which we will discuss below. The most obvious is that we must introduce a translation stage in between matching the query and the texts in the document collection.</Paragraph> <Paragraph position="1"> Query translation, which is currently considered to be preferable to document collection translation, introduces several new factors to the IR task: * Term transfer mistakes - analysis is far from perfect in today's MT systems and we must con- null sider how to compensate for incorrect translations. null * Unresolved lexical ambiguity- occurs when analysis cannot decide between alternative meanings of words in the target language.</Paragraph> <Paragraph position="2"> * Synonym selection - when we use an MT system to translate a query, generation will usually result in a single lexical choice, even though alternative synonyms exist. For matching texts, the MT system may not have chosen the same synonym in the translated query as the author of the matching document.</Paragraph> <Paragraph position="3"> * Vocabulary limitations- are an inevitable factor when using bilingual dictionaries.</Paragraph> <Paragraph position="4"> Most of the previous work in MLIR has used simple dictionary term translation within the vector space model (Salton, 1989). This avoids synonymy selection constraints imposed by sentence generation in machine translation systems, but fails to resolve lexical transfer ambiguity. Since all possible translations are generated, the correctly matching term is assumed to be contained in the list and term transfer mistakes are not an explicit factor.</Paragraph> <Paragraph position="5"> Two important issues need to be considered in dictionary term based MLIR. The first, raised by Hull et al (Hull and Grefenstette, 1996), is that generating multiple translations breaks the term independence assumption of the vector space model. A second issue, identified by (Davis, 1996), is whether vector matching methods can succeed given that they essentially exploit linear (term-for-term) relations in the query and target document. This becomes important for languages such as English and Japanese where high-level transfer is necessary.</Paragraph> <Paragraph position="6"> Machine translation of the query on the other hand, uses high level analysis and should be able to resolve much of the lexical transfer ambiguity supplied by the bilingual dictionary, leading to significant improvements in performance over DTL, e.g. see (Davis, 1996). We assume that the MT system will select only one synonym where a choice exists so term independence in the vector space model is not a problem. Term transfer mistakes clearly depend on the quality of analysis, but may become a significant factor when the query contains only a few terms and little surrounding context.</Paragraph> <Paragraph position="7"> Surprisingly, to the best of our knowledge, no comparison has been attempted before between DTL and MT in MLIR. This may be due either to the unreliability of MT, or because queries in MLIR tend to be short phrases or single terms and MT is considered too challenging. In our application of article alignment, where the query contains sentences, it is both meaningful and important to compare the two methods.</Paragraph> </Section> <Section position="5" start_page="263" end_page="263" type="metho"> <SectionTitle> 3 News Article Alignment </SectionTitle> <Paragraph position="0"> The goal of news article alignment is the same as that in MLIR: we want to find relevant matching documents in the source language corpus collection for those queries in the target language corpus collection. The main characteristics which make news article alignment different to MLIR are: * Number of query terms - the number of terms in a query is very large compared to the usual IR task; * Small search space - we can reduce the search to those documents within a fixed range of the publication date; * Free text retrieval - we cannot control the search vocabulary as is the case in some information retrieval systems; * High precision - is required because the quality of the bilingual knowledge which we can acquire is directly related to the quality of article alignment. null We expect the end prod~act of article alignment to be a noisy-parallel corpus.</Paragraph> <Paragraph position="1"> In contrast to clean-parallel texts we are just beginning to explore noisy-parallel texts as a serious option for corpus-based NLP, e.g. (Fung and McKeown, 1996). Noisy-parallel texts are characterised by heavy reformatting at the translation stage, including large sections of uatranslated text and textual reordering. Methods which seek to align single sentences are unlikely to succeed with noisy parallel texts and we seek to match whole documents rather than sentences before bilil~gual lexical knowledge acquisition. The search effort required to align individual documents is considerable and makes manual alignment both tedious aJld time consuming.</Paragraph> </Section> <Section position="6" start_page="263" end_page="265" type="metho"> <SectionTitle> 4 System Overview </SectionTitle> <Paragraph position="0"> In our collections of English and Japanese news articles we find that the Japanese texts are much shorter than the English texts, typically only two or three paragraphs, and so it was natural to translate from Japanese into English and to think of the Japanese texts as queries. The goal of article alignment can be reformulated as an IR task by trying to find the English document(s) in the collection (corpus) of news articles which most closely corresponded to the Japanese query. The overall system is outlined in Figure 1 and discussed below.</Paragraph> <Section position="1" start_page="263" end_page="264" type="sub_section"> <SectionTitle> 4.1 Dictionary term lookup method </SectionTitle> <Paragraph position="0"> DTL takes each term in the query and performs dictionary lookup to produ,:e a list of possible translation terms in the document collection language.</Paragraph> <Paragraph position="1"> Duplicate terms were not removed from the translation list. In our simulaticms we used a 65,000 term common word bilingual dictionary and 14,000 terms from a proper noun bilingual dictionary which we consider to be relevant to international news events. The disadvantage of term vector translation using DTL arises from the shallow level of analysis. This leads to the incorporation of a range of polysemes and homographs in the translated query which reduces the precision of document retrieval. In fact, the greater the depth of coverage in the bilingual lexicon, the greater this problem will become.</Paragraph> </Section> <Section position="2" start_page="264" end_page="264" type="sub_section"> <SectionTitle> 4.2 Machine translation method </SectionTitle> <Paragraph position="0"> Full machine translation (MT) is another option for the translation stage and it should allow us to reduce the transfer ambiguity inherent in the DTL model through linguistic analysis. The system we use is</Paragraph> </Section> <Section position="3" start_page="264" end_page="264" type="sub_section"> <SectionTitle> Toshiba Corporation's ASTRANSAC (Hirakawa et </SectionTitle> <Paragraph position="0"> al., 1991) for Japanese to English translation.</Paragraph> <Paragraph position="1"> The translation model in ASTRANSAC is the transfer method, following the standard process of morphological analysis, syntactic analysis, semantic analysis and selection of translation words. Analysis uses ATNs (Augmented Transition Networks) on a context free grammar. We modified the system so that it used the same dictionary resources as the DTL method described above.</Paragraph> </Section> <Section position="4" start_page="264" end_page="265" type="sub_section"> <SectionTitle> 4.3 Example query translation </SectionTitle> <Paragraph position="0"> Figure 2 shows an example sentence taken from a Japanese query together with its English translation produced by MT and DTL methods. We see that in both translations there is missing vocabulary (e.g. &quot; 7,~ 4~&quot; 7~-~ ~ b&quot; is not translated); since the two methods both use the same dictionary resource this is a constant factor and we can ignore it for comparison purposes.</Paragraph> <Paragraph position="1"> As expected we see that MT has correctly resolved some of the lexical ambiguities such as '~: --+ world', whereas DTL has included the spu-Original Japanese text: Translation using MT: Although the American who aims at an independent world round by the balloon, and Mr. Y,~ 4--7&quot; :7e-set are flying the India sky on 19th, it can seem to attain a simple world round.</Paragraph> <Paragraph position="2"> Translation using DTL: independent individual singlt.handed single separate sole alone balloon round one rouad one revolution world earth universe world-wide internal ional base found ground depend turn hang approach come draw drop cause due twist choose call according to bascd on owing to by by means of under due to through from accord owe round one round one revolution go travel drive sail walk run American 7, 4--7&quot; aim direct toward shoot for have direct India Republic of India Rep. of India 7 ~--- Mr. Miss Ms.</Paragraph> <Paragraph position="3"> Mis. Messrs. Mrs. Mmes. Ms. Mses. Esq. American sky skies upper air upper rc~3ions high up in the sky up in the air an altitude a height in the sky of over set arrangement arrange world earth universe world-wide universal international simple innoccr~t naive unsophisticated inexperienced fly hop flight aviation round one round one revolution go travel drive sz, iI walk run seem appear encaustic signs sign indicatioits attain achieve accomplish realise fulfill achievement at lainment tence taken from a Japanese query with its translation in English rious homonym terms &quot;earth, universe, world-wide, universal, international&quot;.</Paragraph> <Paragraph position="4"> In the case of synonyn-ty we notice that MT has decided on &quot;independent&quot; as the translation of &quot;~ ~&quot;, DTL also includes the synonyms &quot;individual, singlehanded, single, separate, sole,...&quot; ,etc.. The author of the correctly matching English text actually chose the term 'singlehauded', so synonym expansion will provide us with a better match in this case. The choice of synonyms is quite dependent on author preference and style considerations which MT cannot be expected to second-guess.</Paragraph> <Paragraph position="5"> The limitations of MT analysis give us some selection errors, for example we see that &quot;4' ~&quot; I <~ _1=~}~ ~L77~;5&quot; is translated as &quot;flying the India sky.__.&quot;, whereas the natural translation would be 'flying over India&quot;, even though 'over' is registered as a possible translation of '_l=~' in the dictionary.</Paragraph> </Section> </Section> <Section position="7" start_page="265" end_page="265" type="metho"> <SectionTitle> 5 Corpus </SectionTitle> <Paragraph position="0"> The English document collection consisted of Reuter daily news articles taken from the internet for the December 1996 to the May 1997. In total we have 6782 English articles with an average of about 45 articles per day. After pre-processing to remove hypertext and formatting characters we are left with approximately 140000 paragraphs of English text.</Paragraph> <Paragraph position="1"> In contrast to the English news articles, the Japanese articles, which are also produced daily by Reuter's, are very short. The Japanese is a translated summary of an English article, but considerable reformatting has taken place. In many cases the Japanese translation seems to draw on multiple sources including some which do not appear on the public newswire at all. The 1488 Japanese articles cover the same period as the English articles.</Paragraph> </Section> <Section position="8" start_page="265" end_page="265" type="metho"> <SectionTitle> 6 Implementation </SectionTitle> <Paragraph position="0"> The task of text alignment takes a list of texts {Q~ .... Q~} in a target language and a list of texts {Do, .., Din} in a source language and produces a list I of aligned pairs. A pair < Q~, Dy > is in the list if Q~ is a partial or whole translation of Dy. In order to decide on whether the source and target language text should be in the list of aligned pairs we translate Q~ into the source language to obtain Q~ using bilingual dictionary lookup. We then match texts from {Q0, .., Qn } and {D0, .., Din} using standard models from Information Retrieval. We now describe the basic model.</Paragraph> <Paragraph position="1"> Terminology An index of t terms is generated from the document collection (English corpus) and the query set (Japanese translated articles). Each document has a description vector D = (Wdl, Wd2, .., Walt) where Wd~ represents the weight of term k in document D. The set of documents in the collection is N, and nk represents the number of documents in which term k appears, tfdk denotes the term frequency of term k in document D. A query Q is formulated as a query description vector Q = (wql, wq~, .., Wqt).</Paragraph> <Section position="1" start_page="265" end_page="265" type="sub_section"> <SectionTitle> 6.1 Model </SectionTitle> <Paragraph position="0"> We implemented the standard vector-space model with cosine normalisation, inverse document frequency idf and lexical stemming using the Porter algorithm (Porter, 1980) to remove suffix variations between surface words.</Paragraph> <Paragraph position="1"> The cosine rule is used to compensate for variations in document length and the number of terms when matching a query Q from the Japanese text collection and a document D from the English text collection.</Paragraph> <Paragraph position="3"> We combined term weights in the document and query with a measure of the importance of the term in the document collection as a whole. This gives us the well-known inverse document frequency (tf+id\]) score:</Paragraph> <Paragraph position="5"> Since log(INI/nk) favours rarer terms idf is known to improve precision.</Paragraph> </Section> </Section> class="xml-element"></Paper>