XML Viewer - w00-0504

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0504_metho.xml
Size: 16,016 bytes
Last Modified: 2025-10-06 14:07:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0504">
  <Title>Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval</Title>
  <Section position="4" start_page="24" end_page="25" type="metho">
    <SectionTitle>
3. Multiscale Audio Indexing
</SectionTitle>
    <Paragraph position="0"> A popular approach to spoken document retrieval is to apply Large-Vocabulary s Examples drawn from \[Meng and Ip, 1999\]. Continuous Speech Recognition (LVCSR) 9 for audio indexing, followed by text retrieval techniques. Mandarin Chinese presents a challenge for word-level indexing by LVCSR, because of the ambiguity in tokenizing a sentence into words (as mentioned earlier).</Paragraph>
    <Paragraph position="1"> Furthermore, LVCSR with a static vocabulary is hampered by the out-of-vocabulary (OOV) problem, especially when searching sources with topical coverage as diverse as that found in broadcast news.</Paragraph>
    <Paragraph position="2"> By virtue of the monosyllabic nature of the Chinese language and its dialects, the syllable inventory can provide a complete phonological coverage for spoken documents, and circumvent the OOV problem in news audio indexing, offering the potential for greater recall in subsequent retrieval. The approach thus supports searches for previously unknown query terms in the indexed audio.</Paragraph>
    <Paragraph position="3"> The pros and cons of subword indexing for an English spoken document retrieval task was studied in \[Ng, 2000\]. Ng pointed out that the exclusion of lexical knowledge when subword indexing is performed in isolation may adversely impact discrimination power for retrieval, but that some of that impact can be mitigated by modeling sequential constraints among subword units. We plan to investigate the efficacy of using both word and subword units for Mandarin audio indexing \[Meng et al., 2000\]. Although Ng found that such an approach produced little gain over words alone for English, the structure of Mandarin Chinese may produce more useful subword features.</Paragraph>
    <Section position="1" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
3.1 Modeling Syllable Sequence Constraints
</SectionTitle>
      <Paragraph position="0"> We have thus far used overlapping syllable N-grams for spoken document retrieval for two Chinese dialects - Mandarin and Cantonese.</Paragraph>
      <Paragraph position="1"> Results on a known-item retrieval task with over 1,800 error-free news transcripts \[Meng et al., 1999\] indicate that constraints from overlapping bigrams can yield significant improvements in retrieval performance over syllable unigrams, producing retrieval performance competitive 9 The lexicon size of a typical large-vocabulary continuous speech recognizer can range from 10,000 to 100,000 word forms.</Paragraph>
      <Paragraph position="2">  with that obtained using automatically tokenized Chinese words.</Paragraph>
      <Paragraph position="3"> The study in \[Chen, Wang and Lee, 2000\] also used syllable pairs with skipped syllables in between. This is because many Chinese abbreviations are derived from skipping characters, e.g. J.~:~.~t:~ ~ ~ National Science Council&amp;quot; can be abbreviated as l~r~ (including only the first, third and the last characters). Moreover, synonyms often differ by one or two characters, e.g. both ~'/~4~ and ~.~,,Ag mean &amp;quot;Chinese culture&amp;quot;. Inclusion of these &amp;quot;skipped syllable pairs&amp;quot; also contributed to retrieval performance.</Paragraph>
      <Paragraph position="4"> When modeling sequential syllable constraints, lexical constraints on recognized words may be helpful. We thus plan to exp\]Iore the potential for integrated sequential model\]ling of both words and syllables \[Meng et al., 20013\].</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="25" end_page="26" type="metho">
    <SectionTitle>
4. Multiseale Embedded Translation
</SectionTitle>
    <Paragraph position="0"> Figures 1 and 2 illustrate two translingual retrieval strategies. In query translation, English text queries are transformed into Mandarin and then used to retrieve Mandarin documents. For document translation, Mandarin documents are translated into English before they are indexed and then matched with English queries.</Paragraph>
    <Paragraph position="1"> McCarley has reported improved effectiveness from techniques that couple the two techniques \[McCarley, 1999\], but time constraints may limit us to explonng only the query translation strategy dunng the six-week Workshop.</Paragraph>
    <Section position="1" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
4,1 Word Translation
</SectionTitle>
      <Paragraph position="0"> While we make use of sub-word transcription to smooth out-of-vocabulary(OOV) problems in speech recognition as described above, and to alleviate the OOV problem :for translation as we discuss in the next section, accurate translation generally relies on the additional information available at the word and phrase levels. Since the &amp;quot;bag of words&amp;quot; information retrieval techniques do not incorporate any meaningful degree of language understanding to assess similarity between queries and documents, a word-for-word (or, more generally, term-for-term) embedded translation approach can achieve a useful level of effectiveness for many translingual retrieval applications \[Oard and Diekema, 1998\].</Paragraph>
      <Paragraph position="1"> We have developed such a technique for the TDT-3 topic tracking evaluation \[Levow and Oard, 2000\]. For that work we extracted an enriched bilingual Mandarin-English term list by combining two term lists: (i) A list assembled by the Linguistic Data Consortium from freely available on-line resources; and (ii) entries from the CETA file (sometimes referred to as &amp;quot;Optilex&amp;quot;). This is a Chinese to English translation resource that was manually compiled by a team of linguists from more than 250 text sources, including special and general-purpose print dictionaries, and other text sources such as newspapers. The CETA file contains over 250,000 entries, but for our lexical work we extracted a subset of those entries drawn from contemporary general-purpose sources. We also excluded definitions such as &amp;quot;particle indicating a yes/no question.&amp;quot; Our resulting Chinese to English merged bilingual term list contains translations for almost 200,000 Chinese terms, with average of almost two translation alternatives per term. We have also used the same resources to construct an initial English to Chinese bilingual term list that we plan to refine before the Workshop.</Paragraph>
      <Paragraph position="2"> Three significant challenges faced by termto-term translation systems are term selection in the source language, the source language coverage of the bilingual term list, and translation selection in the target language when more than one alternative translation is known.</Paragraph>
      <Paragraph position="3"> Word segmentation is a natural by-product of large vocabulary Mandarin speech recognition, and white space provides word boundaries for the English queries. We thus plan to choose words as our basic term set, perhaps augmenting this with the multiword expressions found in the bilingual term list.</Paragraph>
      <Paragraph position="4"> Achieving adequate source language coverage is challenging in news retrieval applications of the type modelled by TDT, because proper names and technical terms that may not be present in general-purpose lexical resources often provide important retrieval cues. Parallel (translation equivalent) corpora have proven to be a useful source of translation  equivalent terms, but obtaining appropriate domain-specific parallel corpora in electronic form may not be practical in some applications.</Paragraph>
      <Paragraph position="5"> We therefore plan to investigate the use of comparable corpora to learn translation equivalents, based on techniques in \[Fung, 1998\]. Subword translation, described below, provides a complementary way of handling terms for which translation equivalents cannot be reliably extracted from the available comparable corpora.</Paragraph>
      <Paragraph position="6"> One way of dealing with multiple translations is to weight the alternative translations using either a statistical translation model trained on parallel or comparable corpora to estimate translation probability conditioned on the source language term. When such resources are not sufficiently informative, it is generally possible to back off to an unconditioned preference statistic based on usage frequency of each possible translation in a representative monolingual corpus in the target language. In retrospective retrieval applications the collection being searched can be used for this purpose. We have applied simple versions of this approach with good results \[Levow and Oard, 2000\].</Paragraph>
      <Paragraph position="7"> We have recently observed that a simpler technique introduced by \[Pirkola, 1998\] can produce excellent results. The key idea is to use the structure of the lexicon, in which several target language terms can represent a single source language term, to induce structure in the translated query that the retrieval system can automatically exploit. In essence, the translated query becomes a bag of bags of terms, where each smaller bag corresponds to the set of possible translations for one source-language term. We plan to implement this structured query translation approach using the Inquery \[Callan, 1992\] &amp;quot;synonym&amp;quot; operator in the same manner as \[Pirkola, 1998\], and to the potential to extend the technique to accommodate alternative recognition hypothesis and subword units as well:</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
4.2 Subword Translation
</SectionTitle>
      <Paragraph position="0"> Since Mandarin spoken documents can be indexed with both words and subwords, the translation (or &amp;quot;phonetic transliteration&amp;quot;) of subword units is of particular interest. We plan to make use of cross-language phonetic mappings derived from English and Mandarin pronunciation rules for this purpose. This should be especially useful for handling named entities in the queries, e.g. names of people, places and organizations, etc. which are generally important for retrieval, but may not be easily translated.</Paragraph>
      <Paragraph position="1"> Chinese translations of English proper nouns may involve semantic as well as phonetic mappings. For example, &amp;quot;Northern Ireland&amp;quot; is translated as :~b~ttlM -- where the first character ~ means 'north', and the remaining characters ~tllllll are pronounced as /ai4-er3lan2L Hence the translation is both semantic and phonetic. When Chinese translations strive to attain phonetic similarity, the mapping may be inconsistent. For example, consider the translation of &amp;quot;Kosovo&amp;quot; - sampling Chinese newspapers in China, Taiwan and Hong Kong produces the following translations: ~-~r~ /kel-suo3-wo4C/, ~-~ /kel-suo3-fo2/, ~'~&amp;/kel-suo3-ful/fl4&amp;quot;~dt/kel-suo3-fu2/, or ~/ke 1-suo3-fo2/.</Paragraph>
      <Paragraph position="2"> As can be seen, there is no systematic mapping to the Chinese character sequences, but the translated Chinese pronunciations bear some resemblance to the English pronunciation (/k ow s ax vow/). In order to support retrieval under these circumstances, the approach should involve approximate matches between the English pronunciation and the Chinese pronunciation. The matching algorithm should also accommodate phonological variations.</Paragraph>
      <Paragraph position="3"> Pronunciation dictionaries, or pronunciation generation tools for both English words and Chinese words / characters will be useful for the matching algorithm. We can probably leverage off of ideas in the development of universal speech recognizers \[Cohen et al., 1997\].</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="26" end_page="27" type="metho">
    <SectionTitle>
5. Mulfiscale Retrieval
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
5.1 Coupling Words and Subwords
</SectionTitle>
      <Paragraph position="0"> We intend to use both words and subwords for retrieval. Loose coupling would involve separate retrieval runs using words and subwords, producing two ranked lists, followed by list merging using techniques such as those explored by \[Voorhees, 1995\]. Tight coupling, by  contrast, would require creation of a unified index containing both word and subword units, resulting in a single ranked list. We hope to explore both techniques during the Workshop.</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
5.2 Imperfect Indexing and Translation
</SectionTitle>
      <Paragraph position="0"> It should be noted that speech recognition exacerbates uncertainty when indexing audio, and that translation or transliteration exacerbates uncertainty when translating queries and/or documents. To achieve robustness for retrieval, we have tried three techniques that we have found useful: (i) Syllable lattices were used in \[Wang, 1999\] and \[Chien et al., 2000\] for monolingual Chinese retrieval experiments. The lattices were pruned to constrain the search space, but were able to achieve robust retrieval based on imperfect recognized transcripts. (ii) Query expansion, in which syllable transcription were expanded to include possibly confusable syllable sequences based on a syllable confusion matrix derived from recognition errors, was used in \[Meng et al., 1999\]. (iii) We have expanded the document representation using terms extracted from similar documents in a comparable collection \[Levow and Oard, 2000\], and similar techniques are known to work well in the case of query translation (Ballesteros and Croft, 1997). We hope to add to this set: of techniques by exploring the potential for query expansion based on cross-language phonetic mapping.</Paragraph>
      <Paragraph position="1"> 6. Using the TDT-3 Collection We plan to use the TDT-2 collection for development testing and the TDT-3 collection for evaluation. Both collections provide documents from two English newswire sources, six English broadcast news audio sources, two Mandarin Chinese newswire sources, and one Mandarin broadcast news source (Voice of America). Manually established story boundaries are available for all audio collections, and we plan to exploit that information to simplify our experiment design.</Paragraph>
      <Paragraph position="2"> The TDT-2 collection includes complete relevance assessments for 20 topics, and the TDT-3 collection provides the same for 60 additional topics, 56 of which have at least one relevant audio story. For each topic, at least four English stories and four Chinese stories are known.</Paragraph>
      <Paragraph position="3"> We plan to automatically derive text queries based on one or more English stories that are presented as exemplars, and to use those queries to search the Mandarin audio collection.</Paragraph>
      <Paragraph position="4"> Manually constructed queries will provide a contrastive condition. Unlike the TDT &amp;quot;topic tracking&amp;quot; task in which stories must be declared relevant or not relevant in the order of their arrival, we plan to perform retrospective retrieval experiments in which all documents are known when the query is issued. By relaxing the temporal ordering of the TDT topic tracking task, we can meaningfully search for Mandarin Chinese stories that may have arrived before the exemplar story or stories. We thus plan to report ranked retrieval measures of effectiveness such as average precision in addition to the detection statistics (miss and false alarm) typically reported in TDT.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML