File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3223_evalu.xml

Size: 12,486 bytes

Last Modified: 2025-10-06 13:59:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3223">
  <Title>Incremental Feature Selection and lscript1 Regularization for Relaxed Maximum-Entropy Modeling</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Train and Test Data
</SectionTitle>
      <Paragraph position="0"> In the experiments presented in this paper, we evaluate lscript2, lscript1, and lscript0 regularization on the task of stochastic parsing with maximum-entropy models For our experiments, we used a stochastic parsing system for LFG that we trained on section 02-21 of the UPenn Wall Street Journal treebank (Marcus et al., 1993) by discriminative estimation of a conditional maximum-entropy model from partially labeled data (see Riezler et al. (2002)). For estimation and best-parse searching, efficient dynamic-programming techniques over features forests are employed (see Kaplan et al. (2004)). For the setup of discriminative estimation from partially labeled data, we found that a restriction of the training data to sentences with a relatively low ambiguity rate was possible at no loss in accuracy compared to training from all sentences. Furthermore, data were restricted to sentences of which a discriminative learner can possibly take advantage, i.e. sentences where the set of parses assigned to the labeled string is a proper subset of the parses assigned to the unlabeled string. Together with a restriction to examples that could be parsed by the full grammar and did not have to use a backoff mechanism of fragment parses, this resulted in a training set of 10,000 examples with at most 100 parses. Evaluation was done on the PARC 700 dependency bank3, which is an LFG annotation of 700 examples randomly extracted from section 23 of the UPenn WSJ treebank. To tune regularization parameters, we split the PARC 700 into a heldout and test set of equal size.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Feature Construction
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the 11 feature templates that were used in our experiments to create 60,109 features.</Paragraph>
      <Paragraph position="1"> On the around 300,000 parses for 10,000 sentences in our final training set, 10,986 features were active, resulting in a matrix of active features times parses that has 66 million non-zero entries. The scale of this experiment is comparable to experiments where</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Local Templates
</SectionTitle>
      <Paragraph position="0"> cs label label constituent label is present in parse cs adj label parent label, constituent child label is child label child of constituent parent label cs right branch constituent has right child cs conj nonpar depth non-parallel conjuncts within depth levels fs attrs attrs f-structure attribute is one of attrs fs attr value attr, value attribute attr has value value fs attr subsets attr sum of cardinalities of subsets of attr lex subcat pred, args sets verb pred has one of args sets as arguments  cs embedded label, size chain of size constituents labeled label embedded into one another cs sub label ancestor label, constituent descendant label descendant label is descendant of ancestor label fs aunt subattr aunts, parents, one of descendants is descendant of one of descendants parents which is a sister of one of aunts much larger, but sparser feature sets are employed4.</Paragraph>
      <Paragraph position="1"> The reason why the matrix of non-zeroes is less sparse in our case is that most of our feature templates are instantiated to linguistically motivated cases, and only a few feature templates encode all possible conjunctions of simple feature tests. Redundant features are introduced mostly by the latter templates, whereas the former features are generalizations over possible combinations of grammar constants. We conjecture that feature sets like this are typical for natural language applications.</Paragraph>
      <Paragraph position="2"> Efficient feature detection is achieved by a combination of hashing and dynamic programming on the packed representation of c- and f-structures (Maxwell and Kaplan, 1993). Features can be described as local and non-local, depending on the size of the graph that has to be traversed in their computation. For each local template one of the parameters is selected as a key for hashing. Non-local features are treated as two (or more) local sub-features. Packed structures are traversed depth-first, visiting each node only once. Only the features keyed on the label of the current node are considered for matching. For each non-local feature, the contexts of matching subfeatures are stored at the respective nodes, propagated upward in dynamic programing fashion, and conjoined with contexts of other sub-features of the feature. Fully matched features are associated with the corresponding contexts resulting in a feature-annotated and/or-forest. This annotated  that has 55 million entries for a shallow parsing experiment where 260,000 features were employed.</Paragraph>
      <Paragraph position="3"> and/or forest is exploited for dynamic programming computation in estimation and best parse selection.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the results of an evaluation of five different systems of the test split of the PARC 700 dependency bank. The presented systems are unregularized maximum-likelihood estimation of a log-linear model including the full feature set (mle), standardized maximum-likelihood estimation as described in Sect. 4 (std), lscript0 regularization using frequency-based cutoff, lscript1 regularization using n-best grafting, and lscript2 regularization using a Gaussian prior. All lscriptp regularization runs use a standardization of the feature space. Special regularization parameters were adjusted on the heldout split, resulting in a cutoff threshold of 16, and penalization factors of 20 and 100 for lscript1 and lscript2 regularization respectively, with an optimal choice of 100 features to be added in each n-best grafting step.</Paragraph>
      <Paragraph position="1"> Performance of these systems is evaluated firstly with respect to F-score on matching dependency relations. Note that the F-score values on the PARC 700 dependency bank range between a lower bound of 68.0% for averaging over all parses and an upper bound of 83.6% for the parses producing the best possible matches. Furthermore, compression of the full feature set by feature selection, number of conjugate gradient iterations, and computation time (in hours:minutes of elapsed time) are reported.5 5All experiments were run on one CPU of a dual processor AMD Opteron 244 with 1.8GHz clock speed and 4GB of main memory.</Paragraph>
      <Paragraph position="2">  and elapsed time for unregularized and standardized maximum-likelihood estimation, and lscript0, lscript1, and lscript2 regularization on test split of PARC 700 dependency bank.</Paragraph>
      <Paragraph position="3"> mle std lscript0 lscript2 lscript1 F-score 77.9 78.1 78.1 78.9 79.3 compr. 0 0 18.4 0 82.7 cg its. 761 371 372 34 226 time 129:12 66:41 60:47 6:19 5:25 Unregularized maximum-likelihood estimation using the full feature set exhibits severe overtraining problems, as the relation of F-score to the number of conjugate gradient iterations shows. Standardization of input data can alleviate this problem by improving convergence behavior to half the number of conjugate gradient iterations. lscript0 regularization achieves its maximum on the heldout data for a threshold of 16, which results in an estimation run that is slightly faster than standardized estimation using all features, due to a compression of the full feature set by 18%. lscript2 regularization benefits from a very tight prior (standard deviation of 0.1 corresponding to penalty 100) that was chosen on the heldout set. Despite the fact that no reduction of the full feature set is achieved, this estimation run increases the F-score to 78.9% and improves computation time by a factor of 20 compared to unregularized estimation using all features. lscript1 regularization for n-best grafting, however, even improves upon this result by increasing the F-score to 79.3%, further decreasing computation time to 5:25 hours, at a compression of the full feature set of 83%.</Paragraph>
      <Paragraph position="4">  conjugate gradient iterations.</Paragraph>
      <Paragraph position="5"> As shown in Fig. 2, for feature selection from linguistically motivated feature sets with only a moderate amount of truly redundant features, it is crucial to choose the right number n of features to be added in each grafting step. The number of conjugate gradient iterations decreases rapidly in the number of features added at each step, whereas F-score evaluated on the test set does not decrease (or increases slightly) until more than 100 features are added in each step. 100-best grafting thus reduces estimation time by a factor of 10 at no loss in F-score compared to 1-best grafting. Further increasing n results in a significant drop in F-score, while smaller n is computationally expensive, and also shows slight over-training effects.</Paragraph>
      <Paragraph position="6">  tions, and elapsed time for gradient-based incremental feature selection without regularization, and with lscript2, and lscript1 regularization on test split of PARC 700 dependency bank.</Paragraph>
      <Paragraph position="7"> mle-ifs lscript2-ifs lscript1 F-score 78.8 79.1 79.3 compr. 88.1 81.7 82.7 cg its. 310 274 226 time 6:04 6:56 5:25 In another experiment we tried to assess the relative contribution of regularization and incremental feature selection to the lscript1-grafting technique. Results of this experiments are shown in Table 3. In this experiment we applied incremental feature selection using the gradient test described above to unregularized maximum-likelihood estimation (mleifs) and lscript2-regularized maximum-likelihood estimation (lscript2-ifs). Threshold parameters g are adjusted on the heldout set, in addition to and independent of regularization parameters such as the variance of the Gaussian prior. Results are compared to lscript1-regularized grafting as presented above. For all runs a number of 100 features to be added in each grafting step is chosen. The best result for the mle-ifs run is achieved at a threshold of 25, yielding an F-score of 78.8%. This shows that incremental feature selection is a powerful tool to avoid overfitting. A further improvement in F-score to 79.1% is achieved by combining incremental feature selection with the lscript2 regularizer at a variance of 0.1 for the Gaussian prior and a threshold of 15. Both runs provide excellent compression rates and convergence times.</Paragraph>
      <Paragraph position="8"> However, they are still outperformed by the lscript1 run that achieves a slight improvement in F-score to 79.3% and a slightly better runtime. Furthermore, by integrating regularization naturally into thresholding for feature selection, a separate thresholding parameter is avoided in lscript1-based incremental feature selection.</Paragraph>
      <Paragraph position="9"> A theoretical account of the savings in computational complexity that can be achieved by n-best grafting can be given as follows. Perkins et al. (2003) assess the computational complexity for standard gradient-based optimization with the full feature set by [?] cmp2t, for a multiple c of p line minimizations for p derivatives over m data points, each of which has cost t. In contrast, for grafting, the cost is assessed by adding up the costs for feature testing and optimization for s grafting steps as [?] (msp+13cms3)t. For n-best grafting as proposed in this paper, the number of steps can be decomposed into s = n * t for n features added at each of t steps. This results in a cost of [?] mtp for feature testing, and [?] 13cmn2t3t for optimization. If we assume that t lessmuch n lessmuch s, this indicates considerable savings compared to both 1-best grafting and standard gradient-based optimization.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML