File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0618_metho.xml
Size: 6,907 bytes
Last Modified: 2025-10-06 14:15:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0618"> <Title>An Information-Theoretic Empirical Analysis of Dependency-Based Feature Types for Word Prediction Models</Title> <Section position="4" start_page="138" end_page="140" type="metho"> <SectionTitle> 2 Framework 2.1 Features for Language Models </SectionTitle> <Paragraph position="0"> A language model predicts a given word based on its history. By the laws of conditional probabilities, a language model can be represented in left-to-right fashion as</Paragraph> <Paragraph position="2"> where S denotes a sequence of words w0, w~ .....</Paragraph> <Paragraph position="3"> w,, and ha denotes the history of w~ (0 < i _< n).</Paragraph> <Paragraph position="4"> In order to construct a language model, the individual probabilities p(wilhi) should be estimated from the training set. Since there are too many possible histories but not enough evidence in the training set, several feature types must be used to divide the space of possible histories into equivalence classes via the map~ :h i fl,fx,'&quot;,fK )\[hi \] to make the model feasible in the implementation. In speech recognition, these feature types are most often fixed physical position based features, as in N-gram models. The feature types can be the words before the predicted word or the part-of-speech of the words before the predicted word. In order to remedy the linguistic implausibility and inefficient usage of the training set of N-gram models, we would like to incorporate grammatically-based feature types into the language model, which could incorporate the predictive power of words that lie outside of N-gram range tvHSl. However, we would like to do so without sacrificing the known performance advantages of N-gram models t91. We follow the general approach of the aforementioned authors in taking dependency grammar as a framework, since it extends N-gram models more naturally than stochastic context-free grammars.</Paragraph> <Paragraph position="5"> The feature types studied in this paper are combinations of the fixed physical distance features and grammatically based features listed in Table 1 and graphically depicted in Figure 1.</Paragraph> <Paragraph position="6"> To understand the feature types, consider the task of predicting &quot;{~\[1~ (zuo4 ye4, assignment)&quot; in the example sentence shown in Figure 2. We denote this word by O, which stands for &quot;observed&quot;. The word bigram feature B is the nearest preceding word of O, in this case &quot;~3~ (yingl wen2, English)&quot;. The nearest word modifying O is denoted by M, and is also &quot;~:3~5 (yingl wen2, English)&quot; in this case. Conversely, the nearest preceding word modified by O is denoted by R, &quot;4~ (zuo4, do)&quot; here. BP is the part of speech of &quot;:9~:3~5 (yingl wen2, English)&quot;, in this case &quot;n(noun)&quot;. Similarly, MP is the POS of &quot;~3~:(yingl wen2, English)&quot;, and RP is the POS &quot;v(verb)&quot; for &quot;~i~ (zuo4, do)&quot;. The modifying type or dependency relation between &quot;~: (yingl wen2, English)&quot; and &quot;{~IJL (zuo4 ye4, assignment)&quot; is denoted by MT, in this case &quot;np(noun phrase)&quot;. RT is the modifying type between &quot;~(zuo4, do)&quot; and &quot;{~ ~(zuo4 ye4, assignment)&quot;, here &quot;vp(verb phrase)&quot;.</Paragraph> <Paragraph position="7"> Faced with so many feature types, one of the dilemmas for language modeling is which feature types, or feature type combinations, should be used. The experience has shown that the feature types should not be selected by intuition.</Paragraph> <Paragraph position="9"> In order to obtain a more reliable reference to guide the addition of structural features to a stochastic language model, our objective is to establish in principle the amount of information available from various long distance dependency features and feature combinations. This can be regarded as an upper bound on the improvement that could be obtained by augmenting a language</Paragraph> <Paragraph position="11"> describe each feature type listed in Table 2 model with the corresponding features. We evaluate the informativeness of several feature types in bigram and dependency grammatical structure from the viewpoint of information theory. The experiments draw some conclusions on which feature types should be selected or should not be selected given specific baseline assumptions, and provide a ranking of the feature types according to their importance from this viewpoint.</Paragraph> <Section position="1" start_page="140" end_page="140" type="sub_section"> <SectionTitle> 2.2 Information-based Model for Feature Type Analysis </SectionTitle> <Paragraph position="0"> We now introduce some relevant concepts r from information theory that we adopt as a foundation for analyzing feature types.</Paragraph> <Paragraph position="1"> Information quantity (IQ). The information quantity of a feature type F to the predicted word O is defined using the standard definition of * \[10\] average mutual ififormatmn ; we define IQ as the average mutual information between F and O.</Paragraph> <Paragraph position="3"> Information gain (IG). The information gain of adding F 2 on top Of a baseline model that already employs Fi for predicting word O is defined as the average mutual information between the predicted word O and feature type F 2, given that feature type F~ is known.</Paragraph> <Paragraph position="5"> log p(F2 i F1)P(Oi F~)J Information redundancy (IR). The above two definitions lead naturally to a complementary concept of information redundancy. IR(F1,Fz;O) denotes the redundant information between Fi and F2 in predicting O, which is defined as the difference between IQ(F2;O) and IG(Fz;OIF1), or the difference between IQ(F~ ;O) and IG(F~ ;OIF2).</Paragraph> <Paragraph position="7"> We shall use IG to select the feature type series, and use IR to analyze the overlapped i degree between the variant and the baseline.</Paragraph> </Section> </Section> <Section position="5" start_page="140" end_page="140" type="metho"> <SectionTitle> 3 The Corpus Used in the Experiments </SectionTitle> <Paragraph position="0"> The training corpus used in our experiments is a treebank consisting of Chinese primary school texts mjt12\]. The basic statistics characterizing the training set are summarized in Table 2.</Paragraph> <Paragraph position="1"> In the experiments, we use 80% of the above corpus as a training set for estimating the various co-occurrence probabilities, while 10% of the corpus is used as a testing set to compute the information gain, information quantity, and information redundancy. The feature types we used in the experiments are those shown in</Paragraph> </Section> class="xml-element"></Paper>