File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1110_evalu.xml
Size: 5,503 bytes
Last Modified: 2025-10-06 13:59:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1110"> <Title>Issues in Preand Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words</Title> <Section position="9" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> These results clearly demonstrate the significant utility of post-translation document expansion for English-Mandarin CLIR with Mandarin spoken documents, in contrast to pre-translation expansion. Not only do these results extend our understanding of the interactions of translation and expansion, but they contrast dramatically with prior work on translation</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Document Expansion </SectionTitle> <Paragraph position="0"> and query expansion - in particular, with the (Mc-Namee and Mayfield, 2002) work emphasizing the primary importance of pre-translation expansion.</Paragraph> <Paragraph position="1"> Two main factors contribute to this contrast: first, differences between languages, and second, differences between documents and queries. The characteristics of the document and query languages play a crucial role in determining the effectiveness of preand post-translation document expansion. In particular, the orthography of Mandarin Chinese and the difference in writing systems between the English queries and Mandarin documents affect the expansion process. If one examines the terms contributed by post-translation expansion, one can quickly observe the utility of the enriching terms. For instance in a document about the Iraqi oil embargo, one finds the names of Tariq Aziz and Saddam; in an article about the former Soviet republic of Georgia, one finds the name of former president Zviad Gamsakhurdia. These and many of the other useful expansion terms do not appear anywhere in the translation resource. Even if these terms were proposed by pre-translation expansion or existed in the original document, they would not be available in the translated result. These named entities are highly useful in many information retrieval activities but are notoriously absent from translation resources. For languages with different orthographies, these terms can not match as cognates but must be explicitly translated or transliterated. Thus, these terms are only useful for enrichment when the translation barrier has already been passed. In contrast, the majority of the query translation experiments that demonstrate the utility of pre-translation expansion have been performed on European language pairs that share a common alphabet, making names found at any stage of expansion available for matching as cognates in retrieval even when no explicit translation is available. Recent side experiments on preand post-translation query expansion on the English-Chinese pair show a similar pattern of effectiveness for post-translation expansion over pre-translation expansion (Levow et al., Under Review).</Paragraph> <Paragraph position="2"> A further complication is caused by the fact that Mandarin Chinese is written without white space separating words. As a result, some segmentation process must be performed to identify words for translation, even though indexing and retrieval can be performed effectively on a0 -gram units (Meng et al., 2001). This segmentation process typically relies on a list of terms that may appear in legal segmentations. Just as in the case of translation, these term lists often lack good coverage of proper names.</Paragraph> <Paragraph position="3"> Thus, these terms may not be identified for translation, expansion, or even transcription by an automatic speech recognition system that also depends on word lists as models. These constraints limit the effectiveness of pre-translation expansion. In post-translation expansion, however, these problems are much less significant. In English, white-space delimited terms are available and largely sufficient for retrieval (especially after stemming). Even with multi-word concepts as in the name examples above, the cooccurrence of these terms in expansion documents makes it likely that they will cooccur in the list of enriching terms as well, though perhaps not in the same order. In Chinese or other typically unsegmented languages, overlapping a0 -grams can be used as indexing or expansion units, to bypass segmentation issues, once translation has been completed. Finally, (McNamee and Mayfield, 2002) observe that pre-translation query expansion plays a crucial role in ensuring that some terms are translatable, and post-translation expansion would having nothing to operate on if no query terms translated. This is certainly true, but this problem is much more likely to arise in the case of short queries, where only a single term may represent a topic and there are few terms in the query. As documents are typically much longer, there is often more redundancy of representation.</Paragraph> <Paragraph position="4"> This is analogous to the observation (Krovetz, 1993) that stemming has less of an impact as documents become longer because a wider variety of surface forms are likely to appear. Thus it is more likely that some translatable form of a concept is likely to appear in a long document, even without expansion and even with a poor translation resource. As a result, pre-translation expansion may be less crucial for long documents.</Paragraph> </Section> </Section> class="xml-element"></Paper>