File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1055_intro.xml
Size: 5,420 bytes
Last Modified: 2025-10-06 14:03:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1055"> <Title>Position Specific Posterior Lattices for Indexing Speech</Title> <Section position="3" start_page="443" end_page="444" type="intro"> <SectionTitle> 2 Previous Work </SectionTitle> <Paragraph position="0"> The main research effort aiming at spoken document retrieval (SDR) was centered around the SDR-TREC evaluations (Garofolo et al., 2000), although there is a large body of work in this area prior to the SDR-TREC evaluations, as well as more recent work outside this community. Most notable are the contributions of (Brown et al., 1996) and (James, 1995).</Paragraph> <Paragraph position="1"> One problem encountered in work published prior or outside the SDR-TREC community is that it doesn't always evaluate performance from a document retrieval point of view -- using a metric like Mean Average Precision (MAP) or similar, see trec_eval(NIST, www) -- but rather uses word-spotting measures, which are more technologyrather than user- centric. We believe that ultimately it is the document retrieval performance that matters and the word-spotting accuracy is just an indicator for how a SDR system might be improved.</Paragraph> <Paragraph position="2"> The TREC-SDR 8/9 evaluations -- (Garofolo et al., 2000) Section 6 -- focused on using Broadcast News speech from various sources: CNN, ABC, PRI, Voice of America. About 550 hrs of speech were segmented manually into 21,574 stories each comprising about 250 words on the average. The approximate manual transcriptions -- closed captioning for video -- used for SDR system comparison with text-only retrieval performance had fairly high WER: 14.5% video and 7.5% radio broadcasts.</Paragraph> <Paragraph position="3"> ASR systems tuned to the Broadcast News domain were evaluated on detailed manual transcriptions and were able to achieve 15-20% WER, not far from the accuracy of the approximate manual transcriptions. In order to evaluate the accuracy of retrieval systems, search queries --&quot;topics&quot; -- along with binary relevance judgments were compiled by human assessors.</Paragraph> <Paragraph position="4"> SDR systems indexed the ASR 1-best output and their retrieval performance -- measured in terms of MAP -- was found to be flat with respect to ASR WER variations in the range of 15%-30%. Simply having a common task and an evaluation-driven collaborative research effort represents a huge gain for the community. There are shortcomings however to the SDR-TREC framework.</Paragraph> <Paragraph position="5"> It is well known that ASR systems are very brittle to mismatched training/test conditions and it is unrealistic to expect error rates in the range 10-15% when decoding speech mismatched with respect to the training data. It is thus very important to consider ASR operating points which have higher WER.</Paragraph> <Paragraph position="6"> Also, the out-of-vocabulary (OOV) rate was very low, below 1%. Since the &quot;topics&quot;/queries were long and stated in plain English rather than using the keyword search paradigm, the query-side OOV (Q-OOV) was very low as well, an unrealistic situation in practice. (Woodland et al., 2000) evaluates the effect of Q-OOV rate on retrieval performance by reducing the ASR vocabulary size such that the Q-OOV rate comes closer to 15%, a much more realistic figure since search keywords are typically rare words. They show severe degradation in MAP performance -- 50% relative, from 44 to 22.</Paragraph> <Paragraph position="7"> The most common approach to dealing with OOV query words is to represent both the query and the spoken document using sub-word units -- typically phones or phone n-grams -- and then match sequences of such units. In his thesis, (Ng, 2000) shows the feasibility of sub-word SDR and advocates for tighter integration between ASR and IR technology. Similar conclusions are drawn by the excellent work in (Siegler, 1999).</Paragraph> <Paragraph position="8"> As pointed out in (Logan et al., 2002), word level indexing and querying is still more accurate, were it not for the OOV problem. The authors argue in favor of a combination of word and sub-word level indexing. Another problem pointed out by the paper is the abundance of word-spotting false-positives in the sub-word retrieval case, somewhat masked by the MAP measure.</Paragraph> <Paragraph position="9"> Similar approaches are taken by (Seide and Yu, 2004). One interesting feature of this work is a two-pass system whereby an approximate match is carried out at the document level after which the costly detailed phonetic match is carried out on only 15% of the documents in the collection.</Paragraph> <Paragraph position="10"> More recently, (Saraclar and Sproat, 2004) shows improvement in word-spotting accuracy by using lattices instead of 1-best. An inverted index from symbols -- word or phone -- to links allows to evaluate adjacency of query words but more general proximity information is harder to obtain -- see Section 4. Although no formal comparison has been carried out, we believe our approach should yield a more compact index.</Paragraph> <Paragraph position="11"> Before discussing our architectural design decisions it is probably useful to give a brief presentation of a state-of-the-art text document retrieval engine that is using the keyword search paradigm.</Paragraph> </Section> class="xml-element"></Paper>