File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1071_intro.xml
Size: 5,577 bytes
Last Modified: 2025-10-06 14:03:34
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1071"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Progressive Feature Selection Algorithm for Ultra Large Feature Spaces</Title> <Section position="4" start_page="561" end_page="562" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Before presenting the PFS algorithm, we first give a brief review of the conditional maximum entropy modeling, its training process, and the SGC algorithm. This is to provide the background and motivation for our PFS algorithm.</Paragraph> <Section position="1" start_page="561" end_page="561" type="sub_section"> <SectionTitle> 2.1 Conditional Maximum Entropy Model </SectionTitle> <Paragraph position="0"> The goal of CME is to find the most uniform conditional distribution of y given observation x, ( )xyp , subject to constraints specified by a set of features ()yxf i , , where features typically take the value of either 0 or 1 (Berger et al., 1996). More precisely, we want to maximize</Paragraph> <Paragraph position="2"> is the empirical expected feature count from the training data and</Paragraph> <Paragraph position="4"> is the feature expectation from the conditional model ( )xyp .</Paragraph> <Paragraph position="5"> This results in the following exponential model:</Paragraph> <Paragraph position="7"> is the weight corresponding to the feature f j , and Z(x) is a normalization factor. A variety of different phenomena, including lexical, structural, and semantic aspects, in natural language processing tasks can be expressed in terms of features. For example, a feature can be whether the word in the current position is a verb, or the word is a particular lexical item. A feature can also be about a particular syntactic subtree, or a dependency relation (e.g., Charniak and Johnson, 2005).</Paragraph> </Section> <Section position="2" start_page="561" end_page="562" type="sub_section"> <SectionTitle> 2.2 Selective Gain Computation Algorithm </SectionTitle> <Paragraph position="0"> In real world applications, the number of possible features can be in the millions or beyond.</Paragraph> <Paragraph position="1"> Including all the features in a model may lead to data over-fitting, as well as poor efficiency and memory overflow. Good feature selection algorithms are required to produce efficient and high quality models. This leads to a good amount of work in this area (Ratnaparkhi et al., 1994; Berger et al., 1996; Pietra et al, 1997; Zhou et al., 2003; Riezler and Vasserman, 2004) In the most basic approach, such as Ratnaparkhi et al. (1994) and Berger et al. (1996), training starts with a uniform distribution over all values of y and an empty feature set. For each candidate feature in a predefined feature space, it computes the likelihood gain achieved by including the feature in the model. The feature that maximizes the gain is selected and added to the current model. This process is repeated until the gain from the best candidate feature only gives marginal improvement. The process is very slow, because it has to re-compute the gain for every feature at each selection stage, and the computation of a parameter using Newton's method becomes expensive, considering that it has to be repeated many times.</Paragraph> <Paragraph position="2"> The idea behind the SGC algorithm (Zhou et al., 2003) is to use the gains computed in the previous step as approximate upper bounds for the subsequent steps. The gain for a feature needs to be re-computed only when the feature reaches the top of a priority queue ordered by gain. In other words, this happens when the feature is the top candidate for inclusion in the model. If the re-computed gain is smaller than that of the next candidate in the list, the feature is re-ranked according to its newly computed gain, and the feature now at the top of the list goes through the same gain re-computing process.</Paragraph> <Paragraph position="3"> This heuristics comes from evidences that the gains become smaller and smaller as more and more good features are added to the model. This can be explained as follows: assume that the Maximum Likelihood (ML) estimation lead to the best model that reaches a ML value. The ML value is the upper bound. Since the gains need to be positive to proceed the process, the difference between the Likelihood of the current and the ML value becomes smaller and smaller. In other words, the possible gain each feature may add to the model gets smaller. Experiments in Zhou et al. (2003) also confirm the prediction that the gains become smaller when more and more features are added to the model, and the gains do not get unexpectively bigger or smaller as the model grows. Furthermore, the experiments in Zhou et al. (2003) show no significant advantage for looking ahead beyond the first element in the feature list. The SGC algorithm runs hundreds to thousands of times faster than the original IFS algorithm without degrading classification performance. We used this algorithm for it enables us to find high quality CME models quickly.</Paragraph> <Paragraph position="4"> The original SGC algorithm uses a technique proposed by Darroch and Ratcliff (1972) and elaborated by Goodman (2002): when considering a feature f</Paragraph> <Paragraph position="6"> (x, y)=1, and subsequently adjusts the corresponding normalizing factors Z(x) in (3). An implementation often uses a mapping table, which maps features to the training instance pairs (x, y).</Paragraph> </Section> </Section> class="xml-element"></Paper>