File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1018_intro.xml

Size: 4,716 bytes

Last Modified: 2025-10-06 14:01:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1018">
  <Title>Evaluation and Extension of Maximum Entropy Models with Inequality Constraints</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The maximum entropy model (Berger et al., 1996; Pietra et al., 1997) has attained great popularity in the NLP field due to its power, robustness, and successful performance in various NLP tasks (Ratnaparkhi, 1996; Nigam et al., 1999; Borthwick, 1999).</Paragraph>
    <Paragraph position="1"> In the ME estimation, an event is decomposed into features, which indicate the strength of certain aspects in the event, and the most uniform model among the models that satisfy:</Paragraph>
    <Paragraph position="3"> in the training data (empirical expectation), and E</Paragraph>
    <Paragraph position="5"> ] is the expectation with respect to the model being estimated. A powerful and robust estimation is possible since the features can be as specific or general as required and does not need to be independent of each other, and since the most uniform model avoids overfitting the training data.</Paragraph>
    <Paragraph position="6"> In spite of these advantages, the ME model still suffers from a lack of data as long as it imposes the equality constraint (1), since the empirical expectation calculated from the training data of limited size is inevitably unreliable. A careful treatment is required especially in NLP applications since the features are usually very sparse. In this study, text categorization is used as an example of such tasks with sparse features.</Paragraph>
    <Paragraph position="7"> Previous work on NLP proposed several solutions for this unreliability such as the cut-off, which simply omits rare features, the MAP estimation with the Gaussian prior (Chen and Rosenfeld, 2000), the fuzzy maximum entropy model (Lau, 1994), and fat constraints (Khudanpur, 1995; Newman, 1977).</Paragraph>
    <Paragraph position="8"> Currently, the Gaussian MAP estimation (combined with the cut-off) seems to be the most promising method from the empirical results. It succeeded in language modeling (Chen and Rosenfeld, 2000) and text categorization (Nigam et al., 1999). As described later, it relaxes constraints like E</Paragraph>
    <Paragraph position="10"> is the model's parameter.</Paragraph>
    <Paragraph position="11"> This study follows this line, but explores the following box-type inequality constraints: A</Paragraph>
    <Paragraph position="13"> Here, the equality can be violated by the widths A</Paragraph>
    <Paragraph position="15"> . We refer to the ME model with the above inequality constraints as the inequality ME model. This inequality constraint falls into a type of fat constraints, a</Paragraph>
    <Paragraph position="17"> , as suggested by (Khudanpur, 1995). However, as noted in (Chen and Rosenfeld, 2000), this type of constraint has not yet been applied nor evaluated for NLPs.</Paragraph>
    <Paragraph position="18"> The inequality ME model differs from the Gaussian MAP estimation in that its solution becomes sparse (i.e., many parameters become zero) as a result of optimization with inequality constraints. The features with a zero parameter can be removed from the model without changing its prediction behavior.</Paragraph>
    <Paragraph position="19"> Therefore, we can consider that the inequality ME model embeds feature selection in its estimation.</Paragraph>
    <Paragraph position="20"> Recently, the sparseness of the solution has been recognized as an important concept in constructing robust classifiers such as SVMs (Vapnik, 1995). We believe that the sparse solution improves the robustness of the ME model as well.</Paragraph>
    <Paragraph position="21"> We also extend the inequality ME model so that the constraint widths can move using slack variables. If we penalize the slack variables by their 2norm, we obtain a natural integration of the inequality ME model and the Gaussian MAP estimation.</Paragraph>
    <Paragraph position="22"> While it incorporates the quadratic stabilization of the parameters as in the Gaussian MAP estimation, the sparseness of the solution is preserved.</Paragraph>
    <Paragraph position="23"> We evaluate the inequality ME models empirically, using two text categorization datasets. The results show that the inequality ME models outperform the cut-off and the Gaussian MAP estimation. Such high accuracies are achieved with a fairly small number of active features, indicating that the sparse solution can effectively enhance the performance. In addition, the 2-norm extended model is shown to be more robust in several situations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML