File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1061_metho.xml

Size: 8,481 bytes

Last Modified: 2025-10-06 14:07:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1061">
  <Title>Robust Knowledge Discovery from Parallel Speech and Text Sources</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. PROJECT OBJECTIVES
</SectionTitle>
    <Paragraph position="0"> The first objective is to enhance multi-lingual information systems by exploiting the processing capabilities for resource-rich languages to enhance the capabilities for resource-impoverished language. The second objective is to advance information retrieval and knowledge information systems by providing them with considerably improved multi-lingual speech recognition capabilities. Our research plan proceeds in several steps to (i) collect and (ii) align multi-lingual parallel speech and text sources, (iii) exploit parallelism for improving ASR within a language, and to (iv) exploit parallelism for improving ASR across languages. The main information flows involved in aligning and exploiting parallel sources are illustrated in Figure 1. We will initially focus on German, English and Czech language sources. This section summarizes the major components of our project.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Parallel Speech and Text Sources
</SectionTitle>
      <Paragraph position="0"> The monolingual speech and text collections that we will use to develop techniques to exploit parallelism for improving ASR within a language are readily available. For instance, the North American News Text corpus of parallel news streams from 16 US newspapers and newswire is available from LDC. A 3-year period yields over 350 million words of multi-source news text.</Paragraph>
      <Paragraph position="1"> In addition to data developed within the TIDES and other HLT programs, we are in the process of identifying and creating our own multilingual parallel speech and text sources.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FBIS TIDES Multilingual Newstext Collection
</SectionTitle>
    <Paragraph position="0"> For the purposes of developing multilingual alignment techniques, we intend to use the 240 day, contemporaneous, multilingual news text collection made available for use to TIDES projects by FBIS.</Paragraph>
    <Paragraph position="1"> This corpus contains news in our initial target languages of English, German, and Czech. The collections are highly parallel, in that much of the stories are direct translations.</Paragraph>
    <Paragraph position="2"> Radio Prague Multilingual Speech and Text Corpus Speech and news text from Radio Prague was collected under the direction of J. Psutka with the consent of Radio Prague. The collection contains speech and text in 5 languages: Czech, English, German, French, and Spanish. The collection began June 1, 2000 and continued for approximately 3 months. The text collection contains the news scripts used for the broadcast; the broadcasts more or less follow the scripts. The speech is about 3 minutes per day in each language, which should yield a total of about 5 hours of speech per language.</Paragraph>
    <Paragraph position="3"> Our initial analysis of the Radio Prague corpus suggest that only approximately 5% of the stories coincide in topic, and that there is little, if any, direct translation of stories. We anticipate that this sparseness will make this corpus significantly hard to analyze than another, highly-parallel corpus. However, we expect this is the sort of difficulty that will likely be encountered in processing 'realworld' multilingual news sources.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Story-level Alignment
</SectionTitle>
      <Paragraph position="0"> Once we have the multiple streams of information we must be able to align them according to story. A story is the description of one or more events that happened in a single day and that are reported in a single article by a daily news source the next day. We expect that we will use the same techniques used in the Topic Detection (TDT) field ([5]). Independently of the specific details of the alignment procedure, there is now substantial evidence that related stories from parallel streams can be identified using standard statistical Information Retrieval (IR) techniques.</Paragraph>
      <Paragraph position="1"> Sentence Alignment As part of the infrastructure needed to incorporate cross-lingual information into language models, we are employing statistical MT systems to generate English/German and English/Czech alignments of sentences in the FBIS Newstext Collection. For the English/German sentence and single-word based alignments, we plan to use statistical models ([4]) [3] which generate both sentence and word alignments. For English/Czech sentence alignment, we will employ the statistical models trained as part of the Czech-English MT system developed during the 1999</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Johns Hopkins Summer Workshop ([2]).
2.3 Multi-Source Automatic Speech
Recognition
</SectionTitle>
      <Paragraph position="0"> The scenario we propose is extraction of information from parallel text followed by repeated recognition of parallel broadcasts, resulting in a gradual lowering the WER. The first pass is performed in order to find the likely topics discussed in the story and to identify the topics relevant to the query. In this process, the acoustic model will be improved by deriving pronunciation specifications for out-of-vocabulary words and fixed phrases extracted from the parallel stories. The language model will be improved by extending the coverage of the underlying word and phrase vocabulary, and by specializing the model's statistics to the narrow topic at hand. As long as a round of recognition yields new information, the corresponding improvement is incorporated into the recognizer modules and bootstrapping of the system continues.</Paragraph>
      <Paragraph position="1"> Story-specific Language Models from Parallel Speech and Text Our goal is to create language models combining specific but sparse statistics, derived from relevant parallel material, with reliable but unspecific statistics obtainable from large general corpora. We will create special n-gram language models from the available text, related or parallel to the spoken stories. We can then interpolate this special model with a larger pre-existing model, possibly derived from training text associated to the topic of the story. Our recent STIMULATE work demonstrated success in construction of topic-specific language models on the basis of hierarchically topicorganized corpora [8].</Paragraph>
      <Paragraph position="2"> Unlike building models from parallel texts, the training of story specific language models from recognized speech is also affected by recognition errors in the data which will be used for language modeling. Confidence measures can be used to estimate the correctness of individual words or phrases on the recognizer output.</Paragraph>
      <Paragraph position="3"> Using this information, n-gram statistics can be extracted from the recognizer output by selecting those events which are likely to be correct and which can therefore be used to adjust the original language model without introducing new errors to the recognition system. null Language Models with Cross-Lingual Lexical Triggers A trigger language model ([6], [7]) will be constructed for the target language from the text corpus, where the lexical triggers are not from the word-history in the target language, but from the aligned recognized stories in the source language. The trigger information becomes most important in those cases in which the baseline n-gram model in the target language does not supply sufficient information to predict a word. We expect that content words in the source language are good predictors for content words in the target language and that these words are difficult to predict using the target language alone, and the mutual information techniques used to identify trigger pairs will be useful here.</Paragraph>
      <Paragraph position="4"> Once a spoken source-language story has been recognized, the words found here there will be used as triggers in the language model for the recognition of the target-language news broadcasts.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML