File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0729_metho.xml

Size: 7,989 bytes

Last Modified: 2025-10-06 14:07:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0729">
  <Title>Chunking with Maximum Entropy Models</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Maximum Entropy models
</SectionTitle>
    <Paragraph position="0"> Maximum Entropy (MaxEnt) models (Jaynes, 1957) are exponential models that implement the intuition that if there is no evidence to favour one alternative solution above another, both alternatives should be equally likely. In order to accomplish this, as much information as possible about the process you want to model must be collected. This information consists of frequencies of events relevant to the process.</Paragraph>
    <Paragraph position="1"> The frequencies of relevant events are considered to be properties of the process. When building a model we have to constrain our attention to models with these properties. In most cases the process is only partially described. The MaxEnt framework now demands that from all the models that satisfy these constraints, we choose the model with the flattest probability distribution. This is the model with the highest entropy (given the fact that the constraints are met). When we are looking for a conditional model P(w\]h), the MaxEnt solution has the form:</Paragraph>
    <Paragraph position="3"> where fi(h,w) refers to a (binary valued) feature function that describes a certain event; Ai is a parameter that indicates how important feature fi is for the model and Z(h) is a normalisation factor.</Paragraph>
    <Paragraph position="4"> In the last few years there has been an increasing interest in applying MaxEnt models for NLP applications (Ratnaparkhi, 1998; Berger et al., 1996; Rosenfeld, 1994; Ristad, 1998). The attraction of the framework lies in the ease with which different information sources used in the modelling process are combined and the good results that are reported with the use of these models. Another strong point of this framework is the fact that general software can easily be applied to a wide range of problems. For these experiments we have used off-the-shelf software (Maccent) (Dehaspe, 1997).</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="140" type="metho">
    <SectionTitle>
3 An MaxEnt chunker
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="139" type="sub_section">
      <SectionTitle>
3.1 Attributes used
</SectionTitle>
      <Paragraph position="0"> First need to be decided which information sources might help to predict the chunk tag. We need to work with the information that is included in the WSJ corpus, so the choice is first  * POS tags of surrounding words All these sources will be used, but in case of the information sources using surrounding words we will have to decide how much context is taken into account. I did not perform exhaustive tests on finding the best configuration, but following (Tjong Kim Sang and Veenstra, 1999; Ratnaparkhi, 1997) I only used very local context. In these experiments I used a left context of three words and a right context of two words. Experiments described in (Mufioz et al., 1999) successfully used larger contexts, but the few tests that I performed to confirm this did not give evidence that we could benefit significantly by extending the context. Apart from information  given by the WSJ corpus, information generated by the model itself will also be used: * Chunk tags of previous words It would of course sometimes be desirable to use the chunk tags of the following words also, but these are not instantly available and therefore we will need a cascaded approach. I have experimented with a cascaded chunker, but I did not improve the results significantly.</Paragraph>
      <Paragraph position="1"> In order to use previously predicted chunk tags, the evaluation part of the Maccent software had to be modified. The evaluation program needs a previously created file with all the attributes and the actual class, but the chunk tag of the previous two words cannot be provided beforehand as they are produced in the process of evaluation. A ca~scaded approach where after the first run the predicted tags are added to the file with test data is also not completely satisfactory as the provided tags are then predicted on basis of all the other attributes, but not the previous chunk tags. Ideally the information about the tags of the previous words would be added during evaluation. This required some modification of the evaluation script.</Paragraph>
    </Section>
    <Section position="2" start_page="139" end_page="140" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> The experiments are evaluated using the follow- null In all the experiments a left context of 3 words and a right context of 2 words was used. The part of speech tags of the surrounding words and the word itself were all used as atomic features. The lexical information used consisted of the previous word, the current word and the next word. The word W-2 was omitted because it did not seem to improve the model. Using only these atomic features, the model scored an tagging accuracy of about 95.5% and a F-score of about 90.5 %. Well below the reported results in the literature. Adding i~atures combining POS tags improved the results significantly to just below state of the art scores. Finally 2 complex features involving NP chunk tags predicted for previous words were added. The most successful set of features used in our experiments is given in figure 1. It is not claimed that this is the best  set of features possible for this task. Trying new feature combinations, by adding them manually and testing the new configuration is a time consuming and not very interesting activity. Especially when the scores are close to the best published scores, adding new features have little impact on the behaviour of the model. An algorithm that discovers the interaction between features and suggests which features could be combined to improve the model would be very helpful here. I did not include any complex features involving lexical information. It might be  useful to include more features with lexical information if more training data is available (for example the full R&amp;M data set consisting of section 2-21 of WSJ).</Paragraph>
      <Paragraph position="1"> For feature selection a simple count cut-off was used. I experimented with several combinations of thresholds and the number of iterations used to train the model. When the threshold was set to 2, unique contexts (can be problematic during training of the model; see (Ratnaparkhi, 1998)) did not occur very frequently anymore and an upper bound on the number of iterations did not seem to be necessary. It was found that (using a threshold of 2 for every single feature) after about 100 iterations the model did not improve very much anymore. Using the feature setup given in figure 1 a threshold of 2 for all the features and allowing the model to train over 100 iterations, the scores given in table 1 were obtained.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="140" end_page="140" type="metho">
    <SectionTitle>
4 Concluding remarks
</SectionTitle>
    <Paragraph position="0"> The first observation that I would like to make here, is the fact that it was relatively easy to get results that are comparable with previously published results. Even though some improvement is to be expected when more detailed features, more context and/or more training data is used, it seems to be necessary to incorporate other sources of information to improve significantly on these results.</Paragraph>
    <Paragraph position="1"> Further, it is not satisfactory to find out what attribute combinations to use by trying new combinations and testing them. It might be worth to examine ways to automatically detect which feature combinations are promising (Mikheev, forthcoming; Della Pietra et al., 1997).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML