XML Viewer - w03-1110

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1110_metho.xml
Size: 13,598 bytes
Last Modified: 2025-10-06 14:08:36
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1110">
  <Title>Issues in Preand Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental Configuration
</SectionTitle>
    <Paragraph position="0"> Here we describe the basic experimental configuration under which contrastive document expansion experiments were carried out.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Experimental Collection
</SectionTitle>
      <Paragraph position="0"> We used the Topic Detection and Tracking (TDT) Collection for this work. TDT is an evaluation program where participating sites tackle tasks as such identifying the first time a story is reported on a given topic or grouping similar topics from audio and textual streams of newswire date. In recent years, TDT has focused on performing such tasks in both English and Mandarin Chinese.1 The task that we have performed is not a strict part of TDT because we are performing retrospective retrieval which permits knowledge of the statistics for the entire collection. Nevertheless, the TDT collection serves as a valuable resource for our work. The TDT multilingual collection includes English and Mandarin newswire text as well as (audio) broadcast news. For most of the Mandarin audio data, word-level transcriptions produced by the Dragon 1This year Arabic was added to the languages of interest. automatic speech recognition system are provided.</Paragraph>
      <Paragraph position="1"> All news stories are exhaustively tagged with event-based topic labels, which serve as the relevance judgments for performance evaluation of our cross-language spoken document retrieval work. We used a subset of the TDT-2 corpus for the experiments reported here.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Query Formulation
</SectionTitle>
      <Paragraph position="0"> TDT frames the retrieval task as query-by-example, designating 4 exemplar documents to specify the information need. For query formulation, we constructed a vector of the 180 terms that best distinguish the query exemplars from other contemporaneous (and hopefully not relevant) stories. We used a a0a2a1 test in a manner similar to that used by Sch&amp;quot;utze et al (Sch&amp;quot;utze et al., 1995) to select these terms.</Paragraph>
      <Paragraph position="1"> The pure a0 a1 statistic is symmetric, assigning equal value to terms that help to recognize known relevant stories and those that help to reject the other contemporaneous stories. We limited our choice to terms that were positively associated with the known relevant training stories. For the a0 a1 computation, we constructed a set of 996 contemporaneous documents for each topic by removing the four query examplars from a topic-dependent set of up to 1000 stories working backwards chronologically from the last English query example. Additional details may be found in (Levow and Oard, 2000).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Document Translation
</SectionTitle>
      <Paragraph position="0"> Our translation strategy implemented a word-for-word translation approach. For our original spoken documents, we used the word boundaries provided in the baseline recognizer transcripts. We next perform dictionary-based word-for-word translation, using a bilingual term list produced by merging the entries from the second release of the LDC Chinese-English term list (http://www.ldc.upenn.edu, (Huang, 1999)) and entries from the CETA file, a large human-readable Chinese-English dictionary. The resulting term list contains 195,078 unique Mandarin terms, with an average of 1.9 known English translations per Mandarin term. We select the translation with the highest target language unigram frequency, based on a side collection in the target language.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Document Expansion
</SectionTitle>
      <Paragraph position="0"> We implemented document expansion for the VOA Mandarin broadcast news stories in an effort to partially recover terms that may have been mistranscribed. Singhal et al. used document expansion for monolingual speech retrieval (Singhal and Pereira, 1999).</Paragraph>
      <Paragraph position="1"> The automatic transcriptions of the VOA Mandarin broadcast news stories and their word-for-word translations are an often noisy representation of the underlying stories. For expansion, the text of these documents was treated as a query to a comparable collection (in Mandarin before translation and English after translation), by simply combining all the terms with uniform weighting. This query was presented to the InQuery retrieval system version 3.1pl developed at the University of Massachusetts (Callan et al., 1992).</Paragraph>
      <Paragraph position="2"> Figure 1 depicts the document expansion process.</Paragraph>
      <Paragraph position="3"> The use of pre- and post-translation document expansion components was varied as part of the experimental suite described below. We selected the five highest ranked documents from the ranked retrieval list. From those five documents, we extracted the most selective terms and used them to enrich the original translations of the stories. For this expansion process we first created a list of terms from the documents where each document contributed one instance of a term to the list. We then sorted the terms by inverse document frequency (IDF). We next augmented the original documents with these terms until the document had approximately doubled in length. Doubling was computed in terms of number of whitespace delimited units. For Chinese audio documents, words were identified by the Dragon automatic speech recognizer as part of the transcription process. For the Chinese newswire text, segmentation was performed by the NMSU segmenter ( (Jin, 1998)). The expansion factor chosen here followed Singhal et al's original proposal. A proportional expansion factor is more desirable than some constant additive number of words or some selectivity threshold, as it provides a more consistent effect on documents of varying lengths; an IDF-based threshold, for example, adds disproportionately more new terms to short original documents than long ones, outweighing the original content. Prior experiments indicate little sensitivity to the exact expansion factor chosen, as long as it is proportional.</Paragraph>
      <Paragraph position="4"> This process thus relatively increased the weight of terms that occurred rarely in the document collection as a whole but frequently in related documents. The resulting augmented documents were then indexed by InQuery in the usual way.This expanded document collection formed the basis for retrieval using the translated exemplar queries.</Paragraph>
      <Paragraph position="5"> The intuition behind document expansion is that terms that are correctly transcribed will tend to be topically coherent, while mistranscription will introduce spurious terms that lack topical coherence. In other words, although some &amp;quot;noise&amp;quot; terms are randomly introduced, some &amp;quot;signal&amp;quot; terms will survive. The introduction of spurious terms degrades ranked retrieval somewhat, but the adverse effect is limited by the design of ranking algorithms that give high scores to documents that contain many query terms.</Paragraph>
      <Paragraph position="6"> Because topically related terms are far more likely to appear together in documents than are spurious terms, the correctly transcribed terms will have a disproportionately large impact on the ranking process. The highest ranked documents are thus likely to be related to the correctly transcribed terms, and to contain additional related terms. For example, a system might fail to accurately transcribe the name &amp;quot;Yeltsin&amp;quot; in the context of the (former) &amp;quot;Russian Prime Minister&amp;quot;. However, in a large contemporaneous text corpus, the correct form of the name will appear in such document contexts, and relatively rarely outside of such contexts. Thus, it will be a highly correlated and highly selective term to be added in the course of document expansion.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Document Expansion Experiments
</SectionTitle>
    <Paragraph position="0"> Our goal is to evaluate the effectiveness of pseudo-relevance feedback expansion applied at different stages of document processing and determine what factors contribute to the any differences in final retrieval effectiveness. We consider expansion before translation, after translation, and at both points. The expansion process aims to (re)introduce terminology that could have been used by the author to express the concepts in the documents. Expansion at different stages of processing addresses different causes of loss or absence of terms. At all points, it can ad- null dress terminological choice by the author.</Paragraph>
    <Paragraph position="1"> Since we are working with automatic transcriptions of spoken documents, pre-translation (posttranscription) expansion directly addresses term loss due to substitution or deletion errors in automatic recognition. In addition, as emphasized by (Mc-Namee and Mayfield, 2002), pre-translation expansion can be crucial to providing translatable terms so that there is some material for post-translation indexing and matching to operate on. In other words, by including a wider range of expressions of the document concepts, pre-translation expansion can avoid translation gaps by enhancing the possibility that some term representing a concept that appears in the original document will have a translation in the bilingual term list. Addition of terms can also serve a disambiguating effect as identified by (Ballesteros and Croft, 1997).</Paragraph>
    <Paragraph position="2"> Post-translation expansion provides an opportunity to address translation gaps even more strongly. Pre-translation expansion requires that there be some representation of the document language concept in the term list, whereas post-translation expansion can acquire related terms with no representation in the translation resources from the query language side collection. This capability is particularly desirable given both the important role of named entities (e.g. person and organization names) in many retrieval activities, in conjunction with their poor coverage in most translation resources. Finally, it provides the opportunity to introduce additional conceptually related terminology in the query language, even if the document language form of the term was not introduced by the original author to enhance the representation.</Paragraph>
    <Paragraph position="3"> We evaluate four document processing configurations: null</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. No Expansion
</SectionTitle>
    <Paragraph position="0"> Documents are translated directly as described above, based on the provided automatic speech recognition transcriptions.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. Pre-translation Expansion
</SectionTitle>
    <Paragraph position="0"> Documents are expanded as described above, using a contemporaneous Mandarin newswire text collection from Xinhua and Zaobao news agencies. These collections are segmented into words using the NMSU segmenter. The resulting documents are translated as usual. Note that translation requires that the expansion units be words.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Post-translation Expansion
</SectionTitle>
    <Paragraph position="0"> The English document forms produced by item 1 are expanded using a contemporaneous collection of English newswire text from the New York Times and Associated Press (also part of the TDT-2 corpus).</Paragraph>
    <Paragraph position="1">  4. Pre- and Post-translation Expansion  The document forms produced by item 2 are translated in the the usual word-for-word process. The resulting English text is expanded as in item 3.</Paragraph>
    <Paragraph position="2"> After the above processing, the resulting English documents are indexed.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Results
</SectionTitle>
      <Paragraph position="0"> The results of these different expansion configurations appear in Figure 2. We observe that both post-translation expansion and combined pre- and post-translation document expansion yield highly significant improvements (Wilcoxon signed rank test, two-tailed, a0a2a1a4a3a6a5a7a3a8a3a10a9a8a11 ) in retrieval effectiveness over the unexpanded case. In contrast, although pre-translation expansion yields an 18% relative increase in mean average precision, this improvement does not reach significance. The combination of preand post-translation expansion increases effectiveness by only 3% relative over post-translation expansion, but 33% relative over pre-translation expansion alone. This combination of pre- and post-translation expansion significantly improves over pre-translation document expansion alone (a0 a1 a3a6a5a7a3a8a3a10a12 ).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML