File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/p97-1049_concl.xml
Size: 2,400 bytes
Last Modified: 2025-10-06 13:57:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1049"> <Title>Hierarchical Non-Emitting Markov Models</Title> <Section position="7" start_page="383" end_page="384" type="concl"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> The power of the non-emitting model comes from its ability to represent additional information in its state distribution. In the proof of lemma 3.1 above, we used the state distribution to represent a long distance dependency. We conjecture, however, that the empirical success of the non-emitting model is due to its ability to remember to ignore (ie., to forget) a misleading history at a point of apparent independence. null A point of apparent independence occurs when we have adequate statistics for two strings z n-1 and yn but not yet for their concatenation z,,-lyn. In the most extreme case, the frequencies of z n-1 and yn are high, but the frequency of even the medial bigram zn-lyl is low. In such a situation, we would like to ignore the entire history z n-1 when predicting y'~, because all di(yjlxn-l~ -1) will be close to zero for i < n. To simplify the example, we assume that</Paragraph> <Paragraph position="2"> In such a situation, the interpolated model must repeatedly transition past some suffix of the history z ~-1 for each of the next n-1 predictions, and so the total probability assigned to pc(y nle) by the interpolated model is a product of n(n - 1)/2 probabilities. In contrast, the non-emitting model will immediately transition to the empty context in order to predict the first symbol Yl, and then it need never again transition past any suffix of x n-\]. Consequently, the total probability assigned to pe(yn\[e) by the non-emitting model is a product of only n- 1 probabilities.</Paragraph> <Paragraph position="3"> n--1 \] Given the same state transition probabilities, note that (4) must be considerably less than (5) because probabilities lie in \[0, 1\]. Thus, we believe that the empirical success of the non-emitting model comes from its ability to effectively ignore a misleading history rather than from its ability to remember distant events.</Paragraph> <Paragraph position="4"> Finally, we note the use of hierarchical non-emitting transitions is a general technique that may be employed in any time series model, including context models and backoff models.</Paragraph> </Section> class="xml-element"></Paper>