File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0504_intro.xml

Size: 7,133 bytes

Last Modified: 2025-10-06 14:00:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0504">
  <Title>Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval</Title>
  <Section position="3" start_page="0" end_page="24" type="intro">
    <SectionTitle>
2. Background
2.1 Translingual Information Retrieval
</SectionTitle>
    <Paragraph position="0"> The earliest work on large-vocabulary cross-language information retrieval from free-text (i,e., without manual topic indexing) was reported in 1990 \[Landauer and Littman, 1990\], and the topic has received increasing attention over the last five years \[Oard and Diekema, 1998\]. Work on large-vocabulary retrieval from recorded speech is more recent, with some initial work reported in 1995 using subword indexing \[Wechsler and Schauble, 1995\], followed by the first TREC 2 Spoken Document Retrieval (SDR)  evaluation \[Garofolo et al., 2000\]. The Topic Detection and Tracking (TDT) evaluations, which started in 1998, fall within our definition of speech retrieval for this purpose, differing from other evaluations principally in the nature of the criteria that human assessors use when assessing the relevance of a news stow to an information need. In TDT, stories are assessed for relevance to an event, while in TREC stories are assessed for relevance to an explicitly stated information need that is often subject- rather than event-oriented.</Paragraph>
    <Paragraph position="1"> The TDT-33 evaluation marked the first case of translingual speech retrieval - the task of finding information in a collection of recorded speech based on evidence of the information need that might be expressed (at least partially) in a different language. Translingual speech retrieval thus merges two lines of research that have developed separately until now. In the TDT-3 topic tracking evaluation, recognizer transcripts which have recognition errors were available, and it appears that every team made use of them. This provides a valuable point of reference for investigation of techniques that more tightly couple speech recognition with translingual retrieval. We plan to explore one way of doing this in the Mandarin-English Information (MEI) project.</Paragraph>
    <Section position="1" start_page="23" end_page="24" type="sub_section">
      <SectionTitle>
2.2 The Chinese Language
</SectionTitle>
      <Paragraph position="0"> In order to retrieve Mandarin audio documents, we should consider a number of linguistic characteristics of the Chinese language: The Chinese language has many dialects.</Paragraph>
      <Paragraph position="1"> Different dialects are characterized by their differences in the phonetics, vocabularies and syntax. Mandarin, also known as Putonglma (&amp;quot;the common language&amp;quot;), is the most widely used dialect. Another major dialect is Cantonese, predominant in Hong Kong, Macau, South China and many overseas Chinese communities.</Paragraph>
      <Paragraph position="2"> Chinese is a syllable-based language, where each syllable carries a lexical tone.</Paragraph>
      <Paragraph position="3"> Mandarin has about 400 base syllables and four lexical tones, plus a &amp;quot;light&amp;quot; tone for reduced syllables. There are about 1,200 distinct, tonal syllables for Mandarin. Certain syllable-tone  combinations are non-existent in the language.</Paragraph>
      <Paragraph position="4"> The acoustic correlates of the lexical tone include the syllable's fundamental frequency (pitch contour) and duration. However, these acoustic features are also highly dependent on prosodic variations of spoken utterances.</Paragraph>
      <Paragraph position="5"> The structure of Mandarin (base) syllables is (CG)V(X), where (CG) the syllable onset - C the initial consonant, G is the optional medial glide, V is the nuclear vowel, and X is the coda (which may be a glide, alveolar nasal or velar nasal). Syllable onsets and codas are optional.</Paragraph>
      <Paragraph position="6"> Generally C is known as the syllable initial, and the rest (GVX) syllable final. 4 Mandarin has approximately 21 initials and 39 finals. 5 In its written form, Chinese is a sequence of characters. A word may contain one or more characters. Each character is pronounced as a tonal syllable. The character-syllable mapping is degenerate. On one hand, a given character may have multiple syllable pronunciations - for example, the character/d&amp;quot; may be pronounced as /hang2/, 6/hang4/, or/xing2/. On the other hand, a given tonal syllable may correspond to multiple characters. Consider the two-syllable pronunciation/fu4 shu4/, which corresponds to a two-character word. Possible homophones include ~,, (meaning &amp;quot;rich&amp;quot;), ~ ~tR, (&amp;quot;negative number&amp;quot;), ~1~1~, (&amp;quot;complex number&amp;quot; or &amp;quot;plural&amp;quot;), ~1~ (&amp;quot;repeat&amp;quot;). 7 Aside from homographs and homophones, another source of ambiguity in the Chinese language is the definition of a Chinese word.</Paragraph>
      <Paragraph position="7"> The word has no delimiters, and the distinction between a word and a phrase is often vague. The lexical structure of the Chinese word is very different compared to English. Inflectional forms are minimal, while morphology and word derivations abide by a different set of rules. A  are very similar.</Paragraph>
      <Paragraph position="8"> 6 These are Mandarin pinyin, the number encodes the tone of the syllable.</Paragraph>
      <Paragraph position="9"> 7 Example drawn from \[Leung, 1999\].</Paragraph>
      <Paragraph position="10">  example, 8 ~ means red (a noun or an adjective), ~., means color (a noun), and ~., together means &amp;quot;the color red&amp;quot;(a noun) or simply &amp;quot;red&amp;quot; (an adjective). Alternatively, a word may take on totally different characteristics of its own, e.g. ~. means east (a noun or an adjective), ~ means west (a noun or an adjective), and .~.~ together means thing (a noun). Yet another case is where the compositional characters of a word do not form independent lexical entries in isolation, e.g. D~ means fancy (a verb), but its characters do not occur individually. Possible ways of deriving new words from characters are legion. The problem of identifying the words string in a character sequence is known as the segmentation / tokenization problem. Consider the syllable string:</Paragraph>
      <Paragraph position="12"> The corresponding character string has three possible segmentations - all are correct, but each involves a distinct set of words: (Meaning: It will be take place tonight as usual.) (Meaning: The evening banquet will take place as usual.) (Meaning: If this evening banquet takes place frequently...) The above considerations lead to a number of techniques we plan to use for our task. We concentrate on three equally critical problems related to our theme of translingual speech retrieval: (i) indexing Mandarin Chinese audio with word and subword units, (ii) translating variable-size units for cross-language information retrieval, and (iii) devising effective retrieval strategies for English text queries and Mandarin Chinese news audio.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML