XML Viewer - p00-1060

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1060_metho.xml
Size: 17,407 bytes
Last Modified: 2025-10-06 14:07:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1060">
  <Title>An Information-Theory-Based Feature Type Analysis for the Modelling of Statistical Parsing SUI Zhifang +++ , ZHAO Jun + , Dekai WU + +</Title>
  <Section position="4" start_page="121" end_page="121" type="metho">
    <SectionTitle>
FFF ,,,
</SectionTitle>
    <Paragraph position="0"> G16 , depending on which we can divide the contextual condition Sh</Paragraph>
    <Paragraph position="2"/>
    <Paragraph position="4"> According to the equation of (2) and (3), we have the following equation:</Paragraph>
    <Paragraph position="6"> In this way, we can get a unite expression of probabilistic evaluation model for statistical syntactic parsing. The difference among the different parsing models lies mainly in that they use different feature types or feature type combination to divide the contextual condition into equivalent classes. Our ultimate aim is to determine which combination of feature types is optimal for the probabilistic evaluation model of statistical syntactic parsing. Unfortunately, the state of knowledge in this regard is very limited.</Paragraph>
    <Paragraph position="7"> Many probabilistic evaluation models have been published inspired by one or more of these feature types [Black, 1992] [Briscoe, 1993] [Charniak, 1997] [Collins, 1996] [Collins, 1997] [Magerman, 1995] [Eisner, 1996], but discrepancies between training sets, algorithms, and hardware environments make it difficult, if not impossible, to compare the models objectively. In the paper, we propose an information-theory-based feature type analysis model by which we can quantitatively analyse the predictive power of different feature types or feature type combinations for syntactic structure in a systematic way. The conclusion is expected to provide reliable reference for feature type selection in the probabilistic evaluation modelling for statistical syntactic parsing.</Paragraph>
  </Section>
  <Section position="5" start_page="121" end_page="121" type="metho">
    <SectionTitle>
3 The information-theory-based
</SectionTitle>
    <Paragraph position="0"> feature type analysis model for statistical syntactic parsing In the prediction of stochastic events, entropy and conditional entropy can be used to evaluate the predictive power of different feature types. To predict a stochastic event, if the entropy of the event is much larger than its conditional entropy on condition that a feature type is known, it indicates that the feature type grasps some of the important information for the predicted event.</Paragraph>
    <Paragraph position="1"> According to the above idea, we build the information-theory-based feature type analysis model, which is composed of four concepts: predictive information quantity, predictive information gain, predictive information redundancy and predictive information summation.</Paragraph>
    <Paragraph position="2"> G7A Predictive Information Quantity (PIQ) );( RFPIQ , the predictive information quantity of feature type F to predict derivation rule R, is defined as the difference between the entropy of R and the conditional entropy of R on condition that the feature type F is known.</Paragraph>
    <Paragraph position="3">  Predictive information quantity can be used to measure the predictive power of a feature type in feature type analysis.</Paragraph>
  </Section>
  <Section position="6" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Predictive Information Gain (PIG)
</SectionTitle>
    <Paragraph position="0"> For the prediction of rule R,</Paragraph>
    <Paragraph position="2"> ), the predictive information gain of taking F x as a variant model on top of a baseline model employing F</Paragraph>
    <Paragraph position="4"> as feature type combination, is defined as the difference between the conditional entropy of predicting R based on feature type combination F</Paragraph>
    <Paragraph position="6"> and the conditional entropy of predicting R based on feature type combination F</Paragraph>
    <Paragraph position="8"> Based on the above two definitions, we can further draw the definition of predictive information redundancy as follows.</Paragraph>
    <Paragraph position="10"> Predictive information redundancy can be used as a measure of the redundancy between the predictive information of a feature type and that of a feature type combination.</Paragraph>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
4.1 The classification of the feature
types
</SectionTitle>
      <Paragraph position="0"> The predicted event of our experiment is the derivation rule to extend the current non-terminal node. The feature types for prediction can be classified into two classes, history feature types and objective feature types. In the following, we will take the parsing tree shown in Figure-1 as the example to explain the classification of the feature types.</Paragraph>
      <Paragraph position="1"> In Figure-1, the current predicted event is the derivation rule to extend the framed non-terminal node VP, the part connected by the solid line belongs to history feature types, which is the already derived partial parsing tree, representing the structural environment of the current non-terminal node. The part framed by the larger rectangle belongs to the objective feature types, which is the word sequence containing the leaf nodes of the partial parsing tree rooted by the current node, representing the final objectives to be derived from the current node.</Paragraph>
    </Section>
    <Section position="2" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
4.2 The corpus used in the experiment
</SectionTitle>
      <Paragraph position="0"> The experimental corpus is derived from Penn TreeBank[Marcus,1993]. We semi-automatically assign a headword and a POS tag to each non-terminal node. 80% of the corpus (979,767 words) is taken as the training set, used for estimating the various co-occurrence probabilities, 10% of the corpus (133,814 words) is taken as the testing set, used to calculate predictive information quantity, predictive information gain, predictive information redundancy and predictive information summation. The other 10% of the corpus (133,814 words) is taken as the held-out set. The grammar rule set is composed of 8,126 CFG rules extracted from Penn TreeBank.</Paragraph>
      <Paragraph position="2"> where )(rc is the total number of time that r has been seen in the corpus.</Paragraph>
      <Paragraph position="3"> According to the escape mechanism in [Bell, 1992], we define the weights w</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="121" end_page="121" type="metho">
    <SectionTitle>
5 The information-theory-based
</SectionTitle>
    <Paragraph position="0"> feature type analysis The experiments led to a number of interesting conclusions on the predictive power of various feature types and feature type combinations, which is expected to provide reliable reference for the modelling of probabilistic parsing.</Paragraph>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
5.1 The analysis to the predictive
</SectionTitle>
      <Paragraph position="0"> information quantities of lexical feature types, part-of-speech feature types and constituent label feature types</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Goal
</SectionTitle>
    <Paragraph position="0"> One of the most important variation in statistical parsing over the last few years is that statistical lexical information is incorporated into the probabilistic evaluation model. Some statistical parsing systems show that the performance is improved after the lexical information is added.</Paragraph>
    <Paragraph position="1"> Our research aims at a quantitative analysis of the differences among the predictive information quantities provided by the lexical feature types, part-of-speech feature types and constituent label feature types from the view of information theory.</Paragraph>
  </Section>
  <Section position="9" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Data
</SectionTitle>
    <Paragraph position="0"> The experiment is conducted on the history feature types of the nodes whose structural distance to the current node is within 2.</Paragraph>
    <Paragraph position="1"> In Table-1, &amp;quot;Y&amp;quot; in PIQ(X of Y; R) represents the node, &amp;quot;X&amp;quot; represents the constitute label, the headword or POS of the headword of the node.</Paragraph>
    <Paragraph position="2"> In the following, the units of PIQ are bits.</Paragraph>
  </Section>
  <Section position="10" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Conclusion
</SectionTitle>
    <Paragraph position="0"> Among the feature types in the same structural position of the parsing tree, the predictive information quantity of lexical feature type is larger than that of part-of-speech feature type, and the predictive information quantity of part-of-speech feature type is larger than that of the constituent label feature type.</Paragraph>
    <Paragraph position="1"> Table-1: The predictive information quantity of the history feature type candidates PIQ(X of Y; R) X= constituent label X= headword X= POS of the headword Y= the current node 2.3609 3.7333 2.7708 Y= the parent 1.1598 2.3253 1.1784 Y= the grandpa 0.6483 1.6808 0.6612 Y= the first right brother of the current node 0.4730 1.1525 0.7502 Y= the first left brother of the current node 0.5832 2.1511 1.2186 Y= the second right brother of the current node 0.1066 0.5044 0.2525 Y= the second left brother of the current node 0.0949 0.6171 0.2697 Y= the first right brother of the parent 0.1068 0.3717 0.2133 Y= the first left brother of the parent 0.2505 1.5603 0.6145</Paragraph>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
5.2 The analysis to the influence of the
</SectionTitle>
      <Paragraph position="0"> structural relation and the structural distance to the predictive information quantities of the history feature types G7A Goal: In this experiment, we wish to find out the influence of the structural relation and structural distance between the current node and the node that the given feature type related to has to the predictive information quantities of these feature types.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Data:
</SectionTitle>
    <Paragraph position="0"> In Table-2, SR represents the structural relation between the current node and the node that the given feature type related to. SD represents the structural distance between the current node and the node that the given feature type related to.</Paragraph>
    <Paragraph position="1"> Table-2: The predictive information quantity of the selected history feature types  Among the history feature types which have the same structural relation with the current node (the relations are both parent-child relation, or both brother relation, etc), the one which has closer structural distance to the current node will provide larger predictive information quantity; Among the history feature types which have the same structural distance to the current node, the one which has parent relation with the current node will provide larger predictive information quantity than the one that has brother relation or mixed parent and brother relation to the current node (such as the parent's brother node).</Paragraph>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
5.3 The analysis to the predictive
</SectionTitle>
      <Paragraph position="0"> information quantities of the history feature types and the objective feature types</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Goal
</SectionTitle>
    <Paragraph position="0"> Many of the existing probabilistic evaluation models prefer to use history feature types other than objective feature types. We select some of history feature types and objective feature types, and quantitatively compare their predictive information quantities.</Paragraph>
  </Section>
  <Section position="13" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Data
</SectionTitle>
    <Paragraph position="0"> The history feature type we use here is the headword of the parent, which has the largest predictive information quantity among all the history feature types. The objective feature types are selected stochastically, which are the first word and the second word in the objective word sequence of the current node (Please see 4.1 and Figure-1 for detailed descriptions on the selected feature types).</Paragraph>
    <Paragraph position="1"> Table-3: The predictive information quantity of the selected history and objective feature types Class Feature type PIQ(Y;R) History feature type Y= headword of the parent 2.3253 Y= the first word in the objective word sequence 3.2398Objective feature type Y= the second word in the objective word sequence 3.0071</Paragraph>
  </Section>
  <Section position="14" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Conclusion
</SectionTitle>
    <Paragraph position="0"> Either of the predictive information quantity of the first word and the second word in the objective word sequence is larger than that of the headword of the parent node which has the largest predictive information quantity among all of the history feature type candidates. That is to say, objective feature types may have larger predictive power than that of the history feature type.</Paragraph>
    <Paragraph position="1"> 5.4 The analysis to the predictive information quantities of the objective features types selected respectively on the physical position information, the heuristic information of headword and modifier, and the exact headword information</Paragraph>
  </Section>
  <Section position="15" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Goal
</SectionTitle>
    <Paragraph position="0"> Not alike the structural history feature types, the objective feature types are sequential. Generally, the candidates of the objective feature types are selected according to the physical position.</Paragraph>
    <Paragraph position="1"> However, from the linguistic viewpoint, the physical position information can hardly grasp the relations between the linguistic structures.</Paragraph>
    <Paragraph position="2"> Therefore, besides the physical position information, our research try to select the objective feature types respectively according to the exact headword information and the heuristic information of headword and modifier. Through the experiment, we hope to find out what influence the exact headword information, the heuristic information of headword and modifier, and the physical position information have respectively to the predictive information quantities of the feature types.</Paragraph>
  </Section>
  <Section position="16" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Data:
</SectionTitle>
    <Paragraph position="0"> Table-4 gives the evidence for the claim.</Paragraph>
    <Paragraph position="1"> Table-4: the predictive information quantity of the selected objective feature types the information used to select the objective feature types PIQ(Y;R) the physical position information 3.2398 (Y= the first word in the objective word sequence) Heuristic information 1: determine whether a word has the possibility to act as the headword of the current constitute according to its POS  (Y= the first word in the objective word sequence which has the possibility to act as the headword of the current constitute) Heuristic information 2: determine whether a word has the possibility to act as the modifier of the current constitute according to its POS  (Y= the first word in the objective word sequence which has the possibility to act as the modifier of the current constitute) Heuristic information 3: given the current headword, determine whether a word has the possibility to modify the headword  (Y= the first word in the objective word sequence which has the possibility to modify the headword) the exact headword information 3.7333 (Y= the headword of the current constitute)</Paragraph>
  </Section>
  <Section position="17" start_page="121" end_page="121" type="metho">
    <SectionTitle>
G7A Conclusion
</SectionTitle>
    <Paragraph position="0"> The predictive information quantity of the headword of the current node is larger than that of a feature type selected according to the selected heuristic information of headword or modifier, and larger than that of a feature type selected according to the physical positions; The predictive information quantity of a feature type selected according to the physical positions is larger than that of a feature types selected according to the selected heuristic information of headword or modifier.</Paragraph>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
5.5 The selection of the feature type
</SectionTitle>
      <Paragraph position="0"> combination which has the optimal predictive information summation G7A Goal: We aim at proposing a method to select the feature types combination that has the optimal predictive information summation for prediction.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML