XML Viewer - w04-3223

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3223_metho.xml
Size: 8,353 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3223">
  <Title>Incremental Feature Selection and lscript1 Regularization for Relaxed Maximum-Entropy Modeling</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Bounded Constraint Relaxation for
Maximum Entropy Estimation
</SectionTitle>
    <Paragraph position="0"> As shown by Lebanon and Lafferty (2001), in terms of convex duality, a regularization term for the dual problem corresponds to a &amp;quot;potential&amp;quot; on the constraint values in the primal problem. For a dual problem of regularized likelihood estimation for log-linear models, the corresponding primal problem is a maximum entropy problem subject to relaxed constraints. Let H(p) denote the entropy with respect to probability function p, and g : IRn - IR be a convex potential function, and ~p[*] and p[*] be expectations with respect to the empirical distribution ~p(x,y) = 1m summationtextmj=1 d(xj,x)d(yj,y) and the model distribution p(x|y)~p(y). The primal problem can then be stated as Maximize H(p)[?]g(c) subject to p[fi][?] ~p[fi] = ci,i = 1,... ,n Constraint relaxation is achieved in that equality of the feature expectations is not enforced, but a certain amount of overshooting or undershooting is allowed by a parameter vector c [?] IRn whose potential is determined by a convex function g(c) that is combined with the entropy term H(p).</Paragraph>
    <Paragraph position="1"> In the case of lscript2 regularization, the potential function for the primal problem is a quadratic penalty of the form 12g summationtexti c2i for g = 1s2</Paragraph>
    <Paragraph position="3"> (Lebanon and Lafferty, 2001). In order to recover the specific form of the primal problem for our case, we have to start from the given dual problem. Following Lebanon and Lafferty (2001), the dual function for regularized estimation can be expressed in terms of the dual function L(pl,l) for the unregularized case and the convex conjugate g[?](l) of the potential function g(c). In our case the negative of L(pl,l) corresponds to the likelihood term L(l), and the negative of the convex conjugate g[?](l) is the lscript1 regularizer. Thus our dual problem can be stated as</Paragraph>
    <Paragraph position="5"> Since for convex and closed functions, the conjugate of the conjugate is the original function, i.e.</Paragraph>
    <Paragraph position="6"> g[?][?] = g (Boyd and Vandenberghe, 2004), the potential function g(c) for the primal problem can be recovered by calculating the conjugate g[?][?] of the conjugate g[?](l) = gbardbllbardbl11. In our case, we get</Paragraph>
    <Paragraph position="8"> where bardblcbardbl[?] = max{|c1|,... ,|cn|}. A proof for this proposition is given in the Appendix. The resulting potential function g(c) is the indicator function on the interval [[?]g,g]. That is, it restricts the allowable amount of constraint relaxation to at most +-g. From this perspective, increasing g means to allow for more slack in constraint satisfaction, which in turn allows to fit a more uniform, less overfitting distribution to the data. For features that are included in the model, the parameter values have to be adjusted away from zero to meet the constraints |p[fi][?] ~p[fi] |[?] g, i = 1,... ,n (3) Initialization: Initialize selected features S to [?], and zero-weighted features Z to the full feature set, yielding the uniform distribution pl(0),S(0).</Paragraph>
    <Paragraph position="9"> n-best grafting: For steps t = 1,... ,T,  For features that meet the constraints without parameter adjustment, parameter values can be kept at zero, effectively discarding the features. Note that equality of equations 3 and 1 connects the maximum entropy problem to likelihood regularization.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Standardization
</SectionTitle>
    <Paragraph position="0"> Note that the Ohmp regularizer presented above penalizes the model parameters uniformly, corresponding to imposing a uniform variance onto all model parameters. This motivates a normalization of input data to the same scale. A standard technique to achieve this is to linearly rescale each feature count to zero mean and standard deviation of one over all training data. The same rescaling has to be done for training and application of the model to unseen data. As we will see in the experimental evaluation presented below, a standardization of input data can also dramatically improve convergence behavior in unregularized optimization . Furthermore, parameter values estimated from standardized feature counts are directly interpretable to humans. Combined with feature selection, interpretable parameter weights are particularly useful for error analysis of the model's feature design.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Incremental n-best Feature Selection
</SectionTitle>
    <Paragraph position="0"> The basic idea of the &amp;quot;grafting&amp;quot; (for &amp;quot;gradient feature testing&amp;quot;) algorithm presented by (Perkins et al., 2003) is to assume a tendency of lscript1 regularization to produce a large number of zero-valued parameters at the function's optimum, thus to start with all-zero weights, and incrementally add features to the model only if adjusting their parameter weights away from zero sufficiently decreases the optimization criterion. This idea allows for efficient, incremental feature selection, and at the same time avoids numerical problems caused by the discontinuity of the gradient in lscript1 regularization. Furthermore, the regularizer is incorporated directly into a criterion for feature selection, based on the observation made above: It only makes sense to add a feature to the model if the regularizer penalty is outweighed by the reduction in negative log-likelihood. Thus features considered for selection have to pass the following test:</Paragraph>
    <Paragraph position="2"> In the grafting procedure suggested by (Perkins et al., 2003), this gradient test is applied to each feature, and at each step the feature passing the test with maximum magnitude is added to the model.</Paragraph>
    <Paragraph position="3"> Adding one feature at a time effectively discards noisy and irrelevant features, however, the overhead introduced by grafting can outweigh the gain in efficiency if there is a moderate number of noisy and truly redundant features. In such cases, it is beneficial to add a number of n &gt; 1 features at each step, where n is adjusted by cross-validation or on a held-out data set. In the experiments on maximum-entropy parsing presented below, a feature set of linguistically motivated features is used that exhibits only a moderate amount of redundancy. We will see that for such cases, n-best feature selection considerably improves computational complexity, and also achieves slightly better generalization performance.</Paragraph>
    <Paragraph position="4"> After adding n [?] 1 features to the model in a grafting step, the model is optimized with respect to all parameters corresponding to currently included features. This optimization is done by calling a gradient-based general purpose optimization routine for the regularized objective function. We use a conjugate gradient routine for this purpose (Minka, 2001; Malouf, 2002)2. The gradient of our criterion with respect to a parameter li is:</Paragraph>
    <Paragraph position="6"> 2Note that despite gradient feature testing, the parameters for some features can be driven to zero in conjugate gradient optimization of the lscript1-regularized objective function. Care has to be taken to catch those features and prune them explicitly to avoid numerical instability.</Paragraph>
    <Paragraph position="7"> The sign of li decides if g is added or subtracted from the gradient for feature fi. For a feature that is newly added to the model and thus has weight li = 0, we use the feature gradient test to determine the sign. If [?]L(l)[?]li &gt; g, we know that [?]C(l)[?]li &gt; 0, thus we let sign(li) = [?]1 in order to decrease C.</Paragraph>
    <Paragraph position="8"> Following the same rationale, if [?]L(l)[?]li &lt; [?]g we set sign(li) = +1. An outline of an n-best grafting algorithm is given in Fig. 1.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML