XML Viewer - w99-0618

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0618_evalu.xml
Size: 10,726 bytes
Last Modified: 2025-10-06 14:00:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0618">
  <Title>An Information-Theoretic Empirical Analysis of Dependency-Based Feature Types for Word Prediction Models</Title>
  <Section position="6" start_page="140" end_page="145" type="evalu">
    <SectionTitle>
4 Experimental Results and Analysis
</SectionTitle>
    <Paragraph position="0"> Our experiments aim to quantitatively establish the amount of information intrinsically present in each feature type, and the information gain of each feature type on the top of various baselines. We were led to a number of conclusions on the predictive power of various feature types and feature types combinations, some in support of traditional linguistic intuition and some more surprising. These observations provide guidelines for language modeling.</Paragraph>
    <Paragraph position="1"> Below, we warm up with a well-known observation, and then move on to more focussed analysis.</Paragraph>
    <Section position="1" start_page="140" end_page="142" type="sub_section">
      <SectionTitle>
4.1 Grammatically motivated feature
</SectionTitle>
      <Paragraph position="0"> types do not easily yield as much predictive information as simple bigrams.</Paragraph>
      <Paragraph position="1"> From a traditional linguistics viewpoint, R (the nearest preceding word modified by the  /gonglyuan2/garden ~.P_/zou3/walk ~lJ/dao4/to ~J~_/hai3 bianl/seaside. (He walks from the garden to the seaside.) predicted word O) should be more significant for word prediction than the bigram predictor B (the nearest preceding word of the predicted word O).</Paragraph>
      <Paragraph position="2"> Consider the sentence showed in Figure 3, where O is &amp;quot;~ \[\]/di4tu2/map&amp;quot;, B is the aspectual marker &amp;quot;7/zheO&amp;quot;, and R is &amp;quot; /gua41hang&amp;quot;. It seems somehow obvious that R (&amp;quot;~/gua4/hang&amp;quot;) should be more predictive for 0 (&amp;quot;~ \[\] /di4tu2/map &amp;quot;) than B (the aspectual marker &amp;quot;~/zheO&amp;quot;). However, as is well known in speech recognition and statistical NLP research, the opposite turns out to be true. This is corroborated by the empirical information quantities shown in Table 3, which shows that B has the largest information quantity in all of the feature types. That bigram features outperform the grammatically-based features is commonly attributed to the predictive power of lexical  modifying the predicted word O) should be more significant for word prediction than B (the nearest preceding word of the predicted word O).</Paragraph>
      <Paragraph position="3"> For example, consider the sentence showed in Figure 4, where O is &amp;quot;~/zou3/walk&amp;quot;, then B is &amp;quot; \[\] /gong lyuan2/garden&amp;quot; and M is &amp;quot; /cong2/from&amp;quot;. Again, it seems that M (&amp;quot;,hJk /cong2/from&amp;quot;) ought to be more predictive to O (&amp;quot; j~ /zou3/walk&amp;quot;) than B (&amp;quot; ~_~ \[\] /gonglyuan2/garden&amp;quot;), but from Table 3 we see that the opposite is true.</Paragraph>
      <Paragraph position="4"> From a linguistic viewpoint, the explanation for the fact that R (IQ(R;O)=l.581) is less predictive than B (IQ(B;O)=3.826) may be as follows. Within a sentence, every word has</Paragraph>
      <Paragraph position="6"> exactly one B and one R feature. But on one hand, the B feature always lies to the left of O since it is by definition the preceding word, while on the other hand, R generally lies to the right of O in Chinese sentences (with a few notable except!ons such as prepositional phrases). When R is not in the history preceding O, it cannot be used to predict O.</Paragraph>
      <Paragraph position="7"> Similarly, a possible factor in the fact that M (IQ(M;O)=2.237) is less predictive than B is that M sometimes lies to the right of O. Another factor in the cas'e of M is that none of the leaf nodes in a dependency tree have an M.</Paragraph>
    </Section>
    <Section position="2" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
4.2 Although R (the word modified by the
</SectionTitle>
      <Paragraph position="0"> predicted word) is less effective than M (the word modifying the predicted word) when they are used individually for word prediction, R is more effective than M if they are used On top of a standard bigram model (the feature B).</Paragraph>
      <Paragraph position="1"> Consider the' following measurements from our experiments': IQ(R;O)=l.581 bits which is less than IQ(M;O)=2.237 bits, whereas IG(R;OIB)=0.683 bits which is greater than IG(M;OIB)=0.541 bits. That is, given a baseline bigram model employing only B features, augmenting thei model with R features brings more information than augmenting it with M features. Therefore, in principle, the language model which incorporates bigram and feature type R can achieve higher performance than the model which incorporates bigram and M.</Paragraph>
      <Paragraph position="2"> We believe: this because there is more information redundancy between M and B than between R and'B. From the above data, we see that there exists large information redundancy both between B and R (IR(B,R;O)=0.898) and between B and M (IR(B,M;O)=l.696). One explanation is that often B and M are in fact the same word, where the nearest preceding word modifies the predicted word. For example, consider the sentence in Figure 5, where &amp;quot;{~lk /zuo4ye4/assignment&amp;quot; is the predicted O, and B and M are the same word &amp;quot; ~ 3~2 /ying4wen2/English &amp;quot;.</Paragraph>
      <Paragraph position="3"> It is also possible that B and R are the same word, where the nearest preceding word is modified by the predicted word. For example, the dependency grammatical structure of the phrase &amp;quot;~/zai4/in ~/jiao4shi4/classroom&amp;quot; is showed in Figure 6. Here, &amp;quot; ~l~ /jiao4shi4/classroom&amp;quot; is the predicted O, and B and R are the same word &amp;quot;:i~E/zai4/in&amp;quot;. In Chinese (as well as in English), the head word typically lies at the end of the phrase. This makes B more likely to be M than R, so the information redundancy between B and M is larger than that between B and R.</Paragraph>
    </Section>
    <Section position="3" start_page="142" end_page="144" type="sub_section">
      <SectionTitle>
4.3 If M (the nearest preceding word
</SectionTitle>
      <Paragraph position="0"> modifying the predicted word O) is one of the feature types of the baseline, MT (the modifying type between M and O) will bring less information gain for word prediction.</Paragraph>
      <Paragraph position="1"> We are interested in knowing how much non-redundant information is present in MT if M is included in the baseline. To assess this, we conducted the following experiment, which focuses directly on the relationship between MT and the two words involved.</Paragraph>
      <Paragraph position="2">  We measured the information gain of MT over M to be only IG(MT;OIM)=0.110 bits, while the information redundancy of MT and M is a much larger IR(MT, M;O)=0.861 bits. This means that the prediction information for O in M (which at IQ(M;O)=2.237 bits is much larger, incidentally, than that in MT at IQ(MT;O)=0.971 bits) contains almost all the prediction information for O in MT. The corresponding  linguistic explanation may be as follows. The lexical identities of the predicted word O and its modifying word M involved in a dependency relation determine to a large extent the type of modification relation MT that holds between O and its modifying Word M.</Paragraph>
      <Paragraph position="3"> Consider the sentence in Figure 7. In the phrase &amp;quot; @ :N. /zuo2tianl/yesterday -V /xia4wu3/afternooW', just knowing the identity of the two words &amp;quot;@~/zuo2tianl/yesterday&amp;quot; and &amp;quot; -It ~/xia4v~u3/afternoon&amp;quot; is enough to predict with near certainty that the relation between them is time phrase (tp), thus giving the following dependency structure as Figure 7.</Paragraph>
    </Section>
    <Section position="4" start_page="144" end_page="145" type="sub_section">
      <SectionTitle>
4.4 If R (the nearest preceding word
</SectionTitle>
      <Paragraph position="0"> modified by the predicted word O) is one of the feature types of the baseline, RT (the modifying type between R and O) will bring less information gain for word prediction.</Paragraph>
      <Paragraph position="1"> This simply mirrors the immediately preceding point, except that R is the modified word (parent) instead of the modifying word (child). In this case, we measured the information gain iof RT over R to be only IG(RT;OIR)=0.271 bits, while the information redundancy of R T and R is a much larger IR(RT, R;O)=0.683: bits. This means that the information in R (IQ(R;O)=I.581 bits) contains almost all the information in MT (IQ(RT;O)=0.954 bits). The corresponding linguistic explanation is as follows. The lexical identities of the words (R, O) involved in a dependency relation determine to a large extent the type of modification relation RT that holds between O and the word it modifies, R.</Paragraph>
      <Paragraph position="2"> Consider the sentence in Figure 8, the identity of the words &amp;quot;~/xie3/write&amp;quot; and &amp;quot;J,~3~: /lun4wen2/paper&amp;quot; determine with near certainty that their relationship is verb phrase (vp): 4.5 Among the feature types in {B, BP, M, MP, MT, R, RP, RT}, the preference order for selecting feature types is B, R, M, RT, MT, BP, RP, MP.</Paragraph>
      <Paragraph position="3"> We used the metric IG to obtain a ranking for feature types according to their predictiveness. This ranking only considers information gain; it ignores complexity (for a practical application, we would also consider the complexity of the model at the same time.). To obtain this order, we performed a greedy search where at each step we selected the next most informative feature type (i.e., the feature type that has the largest information gain). The empirical information gain measurements in each search step is shown in Table 4, where the feature which has the boldface IG in each column is the feature type selected in that step, and IG(F;O\[NulI)=IQ(F;O).</Paragraph>
      <Paragraph position="4"> This preference ordering can serve as a guideline for selecting feature type combinations in a language model. That is to say, given the  feature type set {B, BP, M, MP, MT, R, RP, RT}, if a language model uses only one feature type, feature type B should be used; if a language model uses two feature types, the feature type combination {B, R} should be used, and so on.</Paragraph>
      <Paragraph position="5"> However, we can see from Figure 9 that the additional information gain falls off rapidly when more than three feature types are selected.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML