File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1018_metho.xml

Size: 14,159 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1018">
  <Title>Evaluation and Extension of Maximum Entropy Models with Inequality Constraints</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Maximum Entropy Model
</SectionTitle>
    <Paragraph position="0"> The ME estimation of a conditional model p(y|x) from the training examples {(x</Paragraph>
    <Paragraph position="2"> as the following optimization problem.</Paragraph>
    <Paragraph position="4"> To be precise, we have also the constraints</Paragraph>
    <Paragraph position="6"> 1=0 x [?]X. Note that although we explain using a conditional model throughout the paper, the discussion can be applied easily to a joint model by considering the condition x is fixed.</Paragraph>
    <Paragraph position="7"> The empirical expectations and model expectations in the equality constraints are defined as follows.</Paragraph>
    <Paragraph position="9"> where c(*) indicates the number of times * occurred in the training data, and L is the number of training examples.</Paragraph>
    <Paragraph position="10"> By the Lagrange method, p(y|x) is found to have the following parametric form:</Paragraph>
    <Paragraph position="12"> (x,y)). The dual objective function becomes:</Paragraph>
    <Paragraph position="14"> (x,y)).</Paragraph>
    <Paragraph position="15"> The ME estimation becomes the maximization of L(l). And it is equivalent to the maximization of the log-likelihood: LL(l)=log</Paragraph>
    <Paragraph position="17"> This optimization can be solved using algorithms such as the GIS algorithm (Darroch and Ratcliff, 1972) and the IIS algorithm (Pietra et al., 1997). In addition, gradient-based algorithms can be applied since the objective function is concave.</Paragraph>
    <Paragraph position="18"> Malouf (2002) compares several algorithms for the ME estimation including GIS, IIS, and the limited-memory variable metric (LMVM) method, which is a gradient-based method, and shows that the LMVM method requires much less time to converge for real NLP datasets. We also observed that the LMVM method converges very quickly for the text categorization datasets with an improvement in accuracy. Therefore, we use the LMVM method (and its variant for the inequality models) throughout the experiments. Thus, we only show the gradient when mentioning the training. The gradient of the objective function (8) is computed as:</Paragraph>
    <Paragraph position="20"/>
  </Section>
  <Section position="5" start_page="0" end_page="4" type="metho">
    <SectionTitle>
3 The Inequality ME Model
</SectionTitle>
    <Paragraph position="0"> The maximum entropy model with the box-type inequality constraints (2) can be formulated as the following optimization problem:</Paragraph>
    <Paragraph position="2"> By using the Lagrange method for optimization problems with inequality constraints, the following parametric form is derived.</Paragraph>
    <Paragraph position="4"> are the Lagrange multipliers corresponding to constraints (10) and (11). The Karush-Kuhn-Tucker conditions state that, at the optimal point,</Paragraph>
    <Paragraph position="6"> These conditions mean that the equality constraint is maximally violated when the parameter is non-zero, and if the violation is strictly within the widths, the parameter becomes zero. We call a feature upper active when a</Paragraph>
    <Paragraph position="8"> negationslash=0, we call that feature active.</Paragraph>
    <Paragraph position="9">   Inactive features can be removed from the model without changing its behavior. Since A</Paragraph>
    <Paragraph position="11"> feature should not be upper active and lower active at the same time.</Paragraph>
    <Paragraph position="12">  The inequality constraints together with the constraints null summationtext y p(y|x)[?]1=0define the feasible region in the original probability space, on which the entropy varies and can be maximized. The larger the widths, the more the feasible region is enlarged. Therefore, it can be implied that the possibility of a feature becoming inactive (the global maximal point is strictly within the feasible region with respect to that feature's constraints) increases if the corresponding widths become large.</Paragraph>
    <Paragraph position="13">  The term 'active' may be confusing since in the ME research, a feature is called active when f i (x,y) &gt; 0 for an event. However, we follow the terminology in the constrained optimization.</Paragraph>
    <Paragraph position="14">  This is only achieved with some tolerance in practice. The solution for the inequality ME model would become sparse if the optimization determines many features as inactive with given widths. The relation between the widths and the sparseness of the solution is shown in the experiment.</Paragraph>
    <Paragraph position="15"> The dual objective function becomes:</Paragraph>
    <Paragraph position="17"> Thus, the estimation is formulated as:</Paragraph>
    <Paragraph position="19"> L(a,b).</Paragraph>
    <Paragraph position="20"> Unlike the optimization in the standard maximum entropy estimation, we now have bound constraints on parameters which state that parameters must be non-negative. In addition, maximizing L(a,b) is no longer equivalent to maximizing the log-likelihood LL(a,b). Instead, we maximize:</Paragraph>
    <Paragraph position="22"> Although we can use many optimization algorithms to solve this dual problem since the objective function is still concave, a method that supports bounded parameters must be used. In this study, we use the BLMVM algorithm (Benson and Mor'e, ), a variant of the limited-memory variable metric (LMVM) algorithm, which supports bound constraints.</Paragraph>
    <Paragraph position="23">  The gradient of the objective function is:</Paragraph>
    <Paragraph position="25"/>
  </Section>
  <Section position="6" start_page="4" end_page="6" type="metho">
    <SectionTitle>
4 Soft Width Extension
</SectionTitle>
    <Paragraph position="0"> In this section, we present an extension of the inequality ME model, which we call soft width. The soft width allows the widths to move as A</Paragraph>
    <Paragraph position="2"> using slack variables, but with some penalties in the objective function. This soft width extension is analogous to the soft margin extension of the SVMs, and in fact, the mathematical discussion is similar. If we penalize the slack variables  Although we consider only the gradient-based method here as noted earlier, an extension of GIS or IIS to support bounded parameters would also be possible.</Paragraph>
    <Paragraph position="3"> by their 2-norm, we obtain a natural combination of the inequality ME model and the Gaussian MAP estimation. We refer to this extension using 2-norm penalty as the 2-norm inequality ME model. As the Gaussian MAP estimation has been shown to be successful in several tasks, it should be interesting empirically, as well as theoretically, to incorporate the Gaussian MAP estimation into the inequality model.</Paragraph>
    <Paragraph position="4"> We first review the Gaussian MAP estimation in the following, and then we describe our extension.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 The Gaussian MAP estimation
</SectionTitle>
      <Paragraph position="0"> In the Gaussian MAP ME estimation (Chen and Rosenfeld, 2000), the objective function is:</Paragraph>
      <Paragraph position="2"> which is derived as a consequence of maximizing the log-likelihood of the posterior probability, using a Gaussian distribution centered around zero with the variance s</Paragraph>
      <Paragraph position="4"> as a prior on parameters. The gradient becomes:</Paragraph>
      <Paragraph position="6"> =0.</Paragraph>
      <Paragraph position="7"> Therefore, the Gaussian MAP estimation can also be considered as relaxing the equality constraints. The significant difference between the inequality ME model and the Gaussian MAP estimation is that the parameters are stabilized quadratically in the Gaussian MAP estimation (16), while they are stabilized linearly in the inequality ME model (14).</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
4.2 2-norm penalty extension
</SectionTitle>
      <Paragraph position="0"> Our 2-norm extension to the inequality ME model is as follows.</Paragraph>
      <Paragraph position="2"> It is also possible to impose 1-norm penalties in the objective function. It yields an optimization problem which is identical to the inequality ME model except that the parameters are upper-bounded as 0 [?] a  is the penalty constants. The parametric form is identical to the inequality ME model (12). However, the dual objective function becomes:</Paragraph>
      <Paragraph position="4"> It can be seen that this model is a natural combination of the inequality ME model and the Gaussian MAP estimation. It is important to note that the solution sparseness is preserved in the above model.</Paragraph>
      <Paragraph position="5"> 5 Calculation of the Constraint Width</Paragraph>
      <Paragraph position="7"> , in the inequality constraints are desirably widened according to the unreliability of the feature (i.e., the unreliability of the calculated empirical expectation). In this paper, we examine two methods to determine the widths.</Paragraph>
      <Paragraph position="8"> The first is to use a common width for all features fixed by the following formula.</Paragraph>
      <Paragraph position="9"> A</Paragraph>
      <Paragraph position="11"> where W is a constant, width factor, to control the widths. This method can only capture the global reliability of all the features. That is, only the reliability of the training examples as a whole can be captured. We call this method single.</Paragraph>
      <Paragraph position="12"> The second, which we call bayes, is a method that determines the widths based on the Bayesian framework to differentiate between the features depending on their reliabilities.</Paragraph>
      <Paragraph position="13"> For many NLP applications including text categorization, we use the following type of features.</Paragraph>
      <Paragraph position="15"> can be interpreted as follows.</Paragraph>
      <Paragraph position="17"> This is only for estimating the unreliability, and is not used to calculate the actual empirical expectations in the constraints.</Paragraph>
      <Paragraph position="18"> Here, a source of unreliability is ~p(y|h</Paragraph>
      <Paragraph position="20"> terior distribution of th from the training examples by Bayesian estimation and utilize the variance of the distribution. With the uniform distribution as the prior, k times out of n trials give the posterior distribution: p(th)=Be(1+k,1+n[?]k), where Be(a,b) is the beta distribution. The variance is calculated as follows.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="6" end_page="10" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> For the evaluation, we use the &amp;quot;Reuters-21578, Distribution 1.0&amp;quot; dataset and the &amp;quot;OHSUMED&amp;quot; dataset.</Paragraph>
    <Paragraph position="1"> The Reuters dataset developed by David D. Lewis is a collection of labeled newswire articles.</Paragraph>
    <Paragraph position="2">  We adopted &amp;quot;ModApte&amp;quot; split to split the collection, and we obtained 7,048 documents for training, and 2,991 documents for testing. We used 112 &amp;quot;TOP-ICS&amp;quot; that actually occurred in the training set as the target categories.</Paragraph>
    <Paragraph position="3"> The OHSUMED dataset (Hersh et al., 1994) is a collection of clinical paper abstracts from the MED-LINE database. Each abstract is manually assigned MeSH terms. We simplified a MeSH term, like &amp;quot;A/B/C mapsto- A&amp;quot;, and used the most frequent 100 simplified terms as the target categories. We extracted 9,947 abstracts for training, and 9,948 abstracts for testing from the file &amp;quot;ohsumed.91.&amp;quot; A documents is converted to a bag-of-words vector representation with TFIDF values, after the stop  Available from http://www.daviddlewis.com/resources/ words are removed and all the words are downcased. Since the text categorization task requires that multiple categories are assigned if appropriate, we constructed a binary categorizer, p</Paragraph>
    <Paragraph position="5"> greater than 0.5, the category is assigned. To construct a conditional maximum entropy model, we used the feature function of the form (22), where</Paragraph>
    <Paragraph position="7"> (d) returns the TFIDF value of the i-th word of the document vector.</Paragraph>
    <Paragraph position="8"> We implemented the estimation algorithms as an extension of an ME estimation tool, Amis,  using the Toolkit for Advanced Optimization (TAO) (Benson et al., 2002), which provides the LMVM and the BLMVM optimization modules. For the inequality ME estimation, we added a hook that checks the KKT conditions after the normal convergence test.  We compared the following models:  For the inequality ME models, we compared the two methods to determine the widths, single and bayes, as described in Section 5. Although the Gaussian MAP estimation can use different s</Paragraph>
    <Paragraph position="10"> ture, we used a common variance s for gaussian.</Paragraph>
    <Paragraph position="11"> Thus, gaussian roughly corresponds to single in the way of dealing with the unreliability of features.</Paragraph>
    <Paragraph position="12"> Note that, for inequality models, we started with all possible features and rely on their ability to remove unnecessary features automatically by solution sparseness. The average maximum number of features in a categorizer is 63,150.0 for the Reuters dataset and 116,452.0 for the OHSUMED dataset.</Paragraph>
    <Paragraph position="13">  The tolerance for the normal convergence test (relative improvement) and the KKT check is 10 [?]4 . We stop the training if the KKT check has been failed many times and the ratio of the bad (upper and lower active) features among the active features is lower than 0.01.</Paragraph>
    <Paragraph position="14">  Here, we fix the penalty constants C</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML