File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1053_metho.xml
Size: 26,787 bytes
Last Modified: 2025-10-06 14:10:09
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1053"> <Title>Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-Scale Web-Search Architectures</Title> <Section position="3" start_page="415" end_page="416" type="metho"> <SectionTitle> 2 Indexing Speech Lattices, Internet Scale </SectionTitle> <Paragraph position="0"> Substantial investments are necessary to create and operate a web search engine, in software development and optimization, infrastructure, as well as operation and maintainance processes. This poses constraints on what can practically be done when integrating speech-indexing capabilities to such an engine.</Paragraph> <Section position="1" start_page="415" end_page="415" type="sub_section"> <SectionTitle> 2.1 Requirements </SectionTitle> <Paragraph position="0"> We have identified the following special requirements for speech indexing: * realize best possible accuracy - speech alternates must be indexed, with scores; * provide time information for individual hits - to facilitate easy audio preview and navigation in the UI; * encode necessary information for phrase matching phrase matching is a basic function of a search engine and an important feature for document ranking. This is non-trivial because boundaries of recognition alternates are generally not aligned.</Paragraph> <Paragraph position="1"> None of these capabilities are provided by text search engines. To add these capabilities to an existing web engine, we are facing practical constraints. First, the structure of the index store cannot be changed fundamentally. But we can reinterpret existing fields. We also assume that the index attaches a few auxiliary bits to each word hit. E.g., this is done in (early) Google (Brin, 1998) and MSN Search. These can be used for additional data that needs to be stored.</Paragraph> <Paragraph position="2"> Secondly, computation and disk access should remain of similar order of magnitude as for text search. Extra CPU cycles for phrase-matching loops are possible as long as disk access remains the dominating factor. The index size cannot be excessively larger than for indexing text. This precludes direct inversion of lattices (and unfortunately also the use of phonetic lattices).</Paragraph> <Paragraph position="3"> Last, while local code changes are possible, the over-all architecture and dataflow cannot be changed. E.g., this forbids the use of a two-stage method as in (Yu, HLT2005).</Paragraph> </Section> <Section position="2" start_page="415" end_page="415" type="sub_section"> <SectionTitle> 2.2 Approach </SectionTitle> <Paragraph position="0"> We take a three-step approach. First, following (Chelba, 2005), we use a posterior-probability representation, as posteriors are resilient to approximations and can be quantized with only a few bits. Second, we reduce the inherent redundancy of speech lattices by merging word hypotheses with same word identity and similar time boundaries, hence the name &quot;Time-based Merging for Indexing&quot; (TMI). Third, the resulting hypothesis set is represented in the index by reinterpreting existing data fields and repurposing auxiliary bits.</Paragraph> </Section> <Section position="3" start_page="415" end_page="416" type="sub_section"> <SectionTitle> 2.3 System Architecture </SectionTitle> <Paragraph position="0"> Fig. 1 shows the overall architecture of a search engine for audio/video search. At indexing time, a media decoder first extracts the raw audio data from different formats of audio found on the Internet. A music detector prevents music from being indexed. The speech is then fed into a large-vocabulary continuous-speech recognizer (LVCSR), which outputs word lattices. The lattice indexer converts the lattices into the TMI representation, which is then merged into the inverted index. Available textual metadata is also indexed.</Paragraph> <Paragraph position="1"> At search time, all query terms are looked up in the index. For each document containing all query terms (determined by intersection), individual hit lists of each query term are retrieved and fed into a phrase matcher to identify full and partial phrase hits. Using this information, the ranker computes relevance scores. To achieve acceptable response times, a full-scale web engine would split this process up for parallel execution on multiple servers.</Paragraph> <Paragraph position="2"> Finally the result presentation module will create snippets m e d i ad e c o d e r s p e e c hs t r e a m s p e e c hr e c o g n i z e r i n d e xl o o k u pr e s u l t p a g e q u e r y a u d i os t r e a m r e s u l tp r e s e n t a t i o n i n d e x i n gs e a r c hi n v e r t e d i n d e x w a v es t r e a m l a t t i c e i n d e x e r s p e e c hl a t t i c e T M I r e p r e s e n t a t i o nm e t ad a t a t e x ti n d e x e r r a n k e r t i m e i n f o r m a t i o n d o cl i s t p h r a s em a t c h h i ti n f o r m a t i o nh i tl i s t m u s i cd e t e c t o r for the returned documents and compose the result page. In audio search, snippets would contain time information for individual word hits to allow easy navigation and preview. null</Paragraph> </Section> </Section> <Section position="4" start_page="416" end_page="418" type="metho"> <SectionTitle> 3 Time-based Merging for Indexing </SectionTitle> <Paragraph position="0"> Our previous work (Yu, IEEE2005) has shown that in a word spotting task, ranking by phrase posteriors is in theory optimal if (1) a search hit is considered relevant if the query phrase was indeed said there, and (2) the user expects a ranked list of results such that the accumulative relevance of the top-n entries of the list, averaged over a range of n, is maximized. In the following, we will first recapitulate the lattice notation and how phrase posteriors are calculated from the lattice. We then introduce time-based merging, which leads to an approximate representation of the original lattice. We will describe two strategies of merging, one by directly clustering word hypotheses (arc-based merging) and one by grouping lattice nodes (node-based merging).</Paragraph> <Section position="1" start_page="416" end_page="416" type="sub_section"> <SectionTitle> 3.1 Posterior Lattice Representation </SectionTitle> <Paragraph position="0"> A lattice L = (N,A,nstart,nend) is a directed acyclic graph (DAG) with N being the set of nodes, A is the set of arcs, and nstart, nend [?] N being the unique initial and unique final node, respectively. Nodes represent times and possibly context conditions, while arcs represent word or phoneme hypotheses.1 Each node n [?] N has an associated time t[n] and possibly an acoustic or language-model context condition. Arcs are 4-tuples a = (S[a],E[a],I[a],w[a]). S[a], E[a] [?] N denote the start and end node of the arc. I[a] is the word identity. Last, w[a] shall be a weight assigned to the arc by the recognizer. Specifically, w[a] = pac(a)1/l * PLM(a) with acoustic likelihood pac(a), LM probability PLM, and LM weight l.</Paragraph> <Paragraph position="1"> 1Alternative definitions of lattices are possible, e.g. nodes representing words and arcs representing word transitions.</Paragraph> <Paragraph position="2"> In addition, we define paths pi = (a1,*** ,aK) as sequences of connected arcs. We use the symbols S, E, I, and w for paths as well to represent the respective properties for entire paths, i.e. the path start node S[pi] = S[a1], path end node E[pi] = E[aK], path label sequence I[pi] = (I[a1],*** ,I[aK]), and total path weight w[pi] = producttextKk=1 w[ak].</Paragraph> <Paragraph position="3"> Based on this, we define arc posteriors Parc[a] and node posteriors Pnode[n] as</Paragraph> <Paragraph position="5"> with forward-backward probabilities an, bn defined as:</Paragraph> <Paragraph position="7"> an and bn can be conveniently computed using the well-known forward-backward recursion, e.g. (Wessel, 2000).</Paragraph> <Paragraph position="8"> With this, an alternative equivalent representation is possible by using word posteriors as arc weights. The posterior lattice representation stores four fields with each edge: S[a], E[a], I[a], and Parc[a], and two fields with each node: t[n], and Pnode[a].</Paragraph> <Paragraph position="9"> With the posterior lattice representation, the phrase posterior of query string Q is computed as</Paragraph> <Paragraph position="11"> This posterior representation is lossless. Its advantage is that posteriors are much more resiliant to approximations than acoustic likelihoods. This paves the way for lossy approximations aiming at reducing lattice size.</Paragraph> </Section> <Section position="2" start_page="416" end_page="417" type="sub_section"> <SectionTitle> 3.2 Time-based Merging for Indexing </SectionTitle> <Paragraph position="0"> First, (Yu, HLT2005) has shown that node posteriors can be replaced by a constant, with no negative effect on search accuracy. This approximation simplifies the denominator in Eq. 1 to pK[?]1node.</Paragraph> <Paragraph position="1"> We now merge all nodes associated with the same time points. As a result, the connection condition for two arcs depends only on the boundary time point. This operation gave the name Time-based Merging for Indexing.</Paragraph> <Paragraph position="2"> TMI stores arcs with start and end time, while discarding the original node information that encoded dependency on LM state and phonetic context. This form is used, e.g., by (Wessel, 2000). Lattices are viewed as sets of items h = (ts[h],dur[h],I[h],P[h]), with ts[h] being the start time, dur[h] the time duration, I[h] the word identity, and P[h] the posterior probability. Arcs with same word identity and time boundaries but different start/end nodes are merged together, their posteriors being summed up.</Paragraph> <Paragraph position="3"> These item sets can be organized in an inverted index, similar to a text index, for efficient search. A text search engine stores at least two fields with each word hit: word position and document identity. For TMI, two more fields need to be stored: duration and posterior. Start times can be stored by repurposing the word-position information.</Paragraph> <Paragraph position="4"> Posterior and duration go into auxiliary bits. If the index has the ability to store side information for documents, bits can be saved in the main index by recording all time points in a look-up table, and storing start times and durations as table indices instead of absolute times. This works because the actual time values are only needed for result presentation. Note that the TMI index is really an extension of a linear-text index, and the same code base can easily accomodate indexing both speech content and textual metadata.</Paragraph> <Paragraph position="5"> With this, multi-word phrase matches are defined as a sequence of items h1...hK matching the query string</Paragraph> <Paragraph position="7"> calculated (using the approximate denominator) as</Paragraph> <Paragraph position="9"> summing over all item sequences with ts = ts[h1] and te = ts[hK]+dur[hK].</Paragraph> <Paragraph position="10"> Regular text search engines can not directly support this, but the code modification and additional CPU cost is small. The major factor is disk access, which is still linear with the index size.</Paragraph> <Paragraph position="11"> We call this index representation &quot;TMI-base.&quot; It provides a substantial reduction of number of index entries compared to the original lattices. However, it is obviously an approximative representation. In particular, there are now conditions under which two word hypotheses can be matched as part of a phrase that were not connected in the original lattice. This approximation seems sensible, though, as the words involved are still required to have precisely matching word boundaries. In fact it has been shown that this representation can be used for direct word-error minimization during decoding (Wessel, 2000). For further reduction of the index size, we are now relaxing the merging condition. The next two sections will introduce two alternate ways of merging.</Paragraph> </Section> <Section position="3" start_page="417" end_page="417" type="sub_section"> <SectionTitle> 3.3 Arc-Based Merging </SectionTitle> <Paragraph position="0"> A straightforward way is to allow tolerance of time boundaries. Practically, this is done by the following bottom-up clustering procedure: * collect arcs with same word identity; * find the arc a[?] with the best posterior, set the resulting item time boundary same as a[?]; * merge all overlapping arcs a satisfying t[S[a[?]]] [?]</Paragraph> <Paragraph position="2"> We call this method &quot;TMI-arc&quot; to denote its origin from direct clustering of arcs.</Paragraph> <Paragraph position="3"> Note that the resulting structure can generally not be directly represented as a lattice anymore, as formally connected hypotheses now may have slightly mismatching time boundaries. To compensate for this, the item connection condition in phrase matching needs to be relaxed as well: ts[hi+1][?]triangle1 [?] ts[hi]+dur[hi] [?] ts[hi+1]+triangle1. The storage cost for each TMI-arc item is same as for TMI-base, while the number of items will be reduced.</Paragraph> </Section> <Section position="4" start_page="417" end_page="418" type="sub_section"> <SectionTitle> 3.4 Node-Based Merging </SectionTitle> <Paragraph position="0"> An alternative way is to group ranges of time points, and then merge hypotheses whose time boundaries got grouped together.</Paragraph> <Paragraph position="1"> The simplest possibility is to quantize time points into fixed intervals, such as 250 ms. Hypotheses are merged if their quantized time boundaries are identical. This method we call &quot;TMI-timequant.&quot; Besides reducing index size by allowing more item merging, TMI-timequant has another important property: since start times and duration are heavily quantized, the number of bits used for storing the information with the items in the index can be significantly reduced.</Paragraph> <Paragraph position="2"> The disadvantage of this method is that loops are frequently being generated this way (quantized duration of 0), providing sub-optimal phrase matching constraints.</Paragraph> <Paragraph position="3"> To alleviate for this problem, we modify the merging by forbidding loops to be created: Two time points can be setup best path raw lattice keywords all sing. mult. all sing. mult.</Paragraph> <Paragraph position="4"> grouped together if (1) their difference is below a threshold (like 250 ms); and (2) if there is no word hypothesis starting and ending in the same group. As a refinement, the second point is relaxed by a pruning threshold in that hypotheses with posteriors below the threshold will not block nodes merging.</Paragraph> <Paragraph position="5"> Amongst the manifold of groupings that satisfy these two conditions, the one leading to the smallest number of groups is considered the optimal solution. It can be found using dynamic programming: * line up all existing time boundaries in ascending order, ti < ti+1,i = 1,*** ,N; * for each time point ti, find out the furthest time point that it can be grouped with given the constraints, denoting its index as T[ti]; * set group count C[t0] = 1; C[ti] = [?], i > 0; * set backpointer B[t0] = [?]1; B[ti] = ti, i > 0; * for i = 1,*** ,N: - for j = i+1,*** ,T[ti]: if C[tj+1] > C[ti]+1: [?] C[tj+1] = C[ti]+1; [?] B[tj+1] = ti; * trace back and merge nodes: - set k = N, repeat until k = [?]1: [?] group time points from B[tk] to tk[?]1; [?] k = B[tk].</Paragraph> <Paragraph position="6"> This method can be applied to the TMI-base representation, or alternatively directly to the posterior lattice. In this case, the above algorithm needs to be adapted to operate on nodes rather than time points. The above method is called &quot;TMI-node.&quot; If, as mentioned before, times and durations are stored as indexes into a look-up table, TMI-node is highly space efficient. In most cases, the index difference between end and start point is 1, and in practical terms, the index difference can be capped by a small number below 10.</Paragraph> </Section> </Section> <Section position="5" start_page="418" end_page="419" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="418" end_page="419" type="sub_section"> <SectionTitle> 4.1 Setup </SectionTitle> <Paragraph position="0"> We have evaluated our system on three different corpora, in an attempt to represent popular types of audio currently found on the Internet: * podcasts: short clips ranging from mainstream media like ABC and CNN to non-professionally produced edge content; * video clips, acquired from MSN Video; * online lectures: a subset of the MIT iCampus lecture collection (Glass, 2004).</Paragraph> <Paragraph position="1"> In relation to our goal of web-scale indexing, the podcast and video sets are miniscule in size (about 1.5 hours each). Nevertheless they are suitable for investigating the effectiveness of the TMI method w.r.t. phrase spotting accuracy. Experiments on relevance ranking were conducted only on the much larger lecture set (170 hours). For the iCampus lecture corpus, the same set of queries was used as in (Chelba, 2005), which was collected from a group of users. Example keywords are computer science and context free grammar. On the other two sets, an automatic procedure described in (Seide, 2004) was used to select keywords. Example keywords are playoffs, beach Florida, and American Express financial services.</Paragraph> <Paragraph position="2"> A standard speaker-independent trigram LVCSR system was used to generate raw speech lattices. For video and podcasts, models were trained on a combination of telephone conversations (Switchboard), broadcast news, and meetings, downsampled to 8 kHz, to accomodate for a wide range of audio types and speaking styles. For lectures, an older setup was used, based on a dictation engine without adaptation to the lecture task. Due to the larger corpus size, lattices for lectures were pruned much more sharply. Word error rates (WER) and corpus setups are listed in Table 1. It should be noted that the word-error rates vary greatly within the podcast and video corpora, ranging from 30% (clean broadcast news) to over 80% (accented reverberated speech with a cheering crowd).</Paragraph> <Paragraph position="3"> Each indexing method is evaluated by a phrase spotting task and a document retrieval task.</Paragraph> <Paragraph position="4"> We use the &quot;Figure Of Merit&quot; (FOM) metric defined by NIST for word-spotting evaluations. In its original form, FOM is the detection/false-alarm curve averaged over the range of [0..10] false alarms per hour per keyword. We generalized this metric to spotting of phrases, which can be multi-word or single-word. A multi-word phrase is matched if all of its words match in order.</Paragraph> <Paragraph position="5"> Since automatic word alignment can be troublesome for long audio files in the presence of errors in the reference transcript, we reduced the time resolution of the FOM metric and used the sentence as the basic time unit.</Paragraph> <Paragraph position="6"> A phrase hit is considered correct if an actual occurence of the phrase is found in the same sentence. Multiple hits of the same phrase within one sentence are counted as a single hit, their posterior probabilities being summed up for ranking.</Paragraph> <Paragraph position="7"> The segmentation of the audio files is based on the reference transcript. Segments are on average about 10 seconds long. In a real system, sentence boundaries are of course unknown, but previous experiments have shown that the actual segmentation does not have significant impact on the results.</Paragraph> <Paragraph position="8"> The choice and optimization of a relevance ranking formula is a difficult problem that is beyond the scope of this paper. We chose a simple document ranking method as described in (Chelba, 2005): Given query Q = (q1,*** ,qL), for each document D, expected term frequencies (ETF) of all sub-strings</Paragraph> <Paragraph position="10"> A document is returned if all query words are present. The relevance score is calculated as</Paragraph> <Paragraph position="12"> where the weights wlscript have the purpose to give higher weight to longer sub-strings. They were chosen as wlscript = 1+1000*lscript, no further optimization was performed.</Paragraph> <Paragraph position="13"> Only the lecture set is used for document retrieval evaluation. The whole set consists of 169 documents, with an average of 391 segments in each document. The evaluation metric is the mean average precision (mAP) as computed by the standard trec_eval package used by the TREC evaluations (NIST, 2005). Since actual relevance judgements were not available for this corpus, we use the output of a state-of-the-art text retrieval engine on the ground truth transcripts as the reference. The idea is that if human judgements are not available, the next best thing to do is to assess how close our spoken-document retrieval system gets to a text engine applied to reference transcripts. Although one should take the absolute mAP scores with a pinch of salt, we believe that comparing the relative changes of these mAP scores is meaningful.</Paragraph> </Section> <Section position="2" start_page="419" end_page="419" type="sub_section"> <SectionTitle> 4.2 Lattice Search and Best Path Baseline </SectionTitle> <Paragraph position="0"> Table 2 lists the word spotting and document retrieval result of direct search in the original raw lattice, as well as for searching the top-1 path. Results are listed separately for single- and multi-word queries. For the phrasespotting task, a consistent about 15% improvement is observed on all sets, re-emphasizing the importance of searching alternates. For document retrieval, the accuracy (mAP) is also significantly improved from 53% to 62%.</Paragraph> <Paragraph position="1"> Table 3 compares different indexing methods with respect to search accuracy and index size. We only show results for multi-words queries results, as it can be shown that results for single-word queries must be identical. The index size is measured as index entries per spoken word, i.e. it does not reflect that different indexing methods may require different numbers of bits in the actual index store. In addition to four types of TMI methods, we include an alternative posterior-lattice indexing method in our comparison called PSPL (position-specific posterior lattices) (Chelba, 2005). A PSPL index is constructed by enumerating all paths through a lattice, representing each path as a linear text, and adding each text to the index, each time starting over from word position 1. Each word hypothesis on each path is assigned the posterior probability of the entire path. Instances of the same word occuring at the same text position are merged, accumulating their posterior probabilities. This way, each index entry represents the posterior probability that a word occurs at a particular position in the actual spoken word sequence.</Paragraph> <Paragraph position="2"> PSPL is an attractive alternative to the work presented in lectures, and (d) relevance ranking for lectures.</Paragraph> <Paragraph position="3"> this paper because it continues to use the notion of a word position instead of time, with the advantage that existing implementations of phrase-matching conditions apply without modification.</Paragraph> <Paragraph position="4"> The results show that, comparing with the direct rawlattice search, all indexing methods have only slight impact on both word spotting and document retrieval accuracies. Against our expectation, in many cases improved accuracies are observed. These are caused by creating additonal paths compared to the original lattice, improving recall. It is not yet clear how to exploit this in a systematic manner.</Paragraph> <Paragraph position="5"> W.r.t. storage efficiency, the TMI merging methods have about 5 times less index entries than the original lattice for lectures (and an order of magnitude less for podcasts and videos that were recognized with rather wasteful pruning thresholds). This can be further improved by pruning.</Paragraph> <Paragraph position="6"> Index size and accuracy can be balanced by pruning low-scoring index entries. Experiments have shown that the optimal pruning strategy differs slightly from method to method. For the TMI set, the index is pruned by removing all entries with posterior probabilities below a certain fixed threshold. In addition, for TMI-node we enforce that the best path is not pruned. For PSPL, an index entry at a particular word position is removed if its posterior is worse by a fixed factor compared to the best index entry for the same word position. This also guarantees that the best path is never pruned.</Paragraph> <Paragraph position="7"> Fig. 2 depicts the trade-off of size and accuracy for different indexing methods. TMI-node provides the best trade-off. The last block of Table 3 shows results for all indexing methods when pruned with the respective pruning thresholds adjusted such that the number of index entries is approximately five times that for the top-1 transcript. We chose this size because reducing the index size still has limited impact on accuracy (0.5-points for podcasts, 3.5 for videos, and none for lectures) while keeping operating characteristics (storage size, CPU, disk) within an order of magnitude from text search.</Paragraph> </Section> </Section> <Section position="6" start_page="419" end_page="419" type="metho"> <SectionTitle> 5 The System </SectionTitle> <Paragraph position="0"> The presented technique was implemented in a research prototype shown in Fig. 3. About 780 hours of audio documents, including video clips from MSN Video and audio files from most popular podcasts, were indexed. The index is disk-based, its size is 830 MB, using a somewhat wasteful XML representation for research convenience.</Paragraph> <Paragraph position="1"> Typically, searches are executed within 0.5 seconds.</Paragraph> <Paragraph position="2"> The user interface resembles a typical text search engine. A media player is embedded for immediate withinpage playback. Snippets are generated for previewing the search results. Each word in a snippet has its original time point associated, and a click on it positions the media player to the corresponding time in the document.</Paragraph> </Section> class="xml-element"></Paper>