File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0430_metho.xml
Size: 12,022 bytes
Last Modified: 2025-10-06 14:08:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0430"> <Title>Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Conditional Random Fields </SectionTitle> <Paragraph position="0"> Conditional Random Fields (CRFs) (Lafferty et al., 2001) are undirected graphical models used to calculate the conditional probability of values on designated output nodes given values assigned to other designated input nodes.</Paragraph> <Paragraph position="1"> In the special case in which the output nodes of the graphical model are linked by edges in a linear chain, CRFs make a first-order Markov independence assumption, and thus can be understood as conditionally-trained finite state machines (FSMs). In the remainder of this section we introduce the likelihood model, inference and estimation procedures for CRFs.</Paragraph> <Paragraph position="2"> Let o = <o1,o2,...oT> be some observed input data sequence, such as a sequence of words in text in a document, (the values on n input nodes of the graphical model). Let S be a set of FSM states, each of which is associated with a label, l [?] L, (such as ORG). Let s = <s1,s2,...sT> be some sequence of states, (the val-ues on T output nodes). By the Hammersley-Clifford theorem, CRFs define the conditional probability of a state sequence given an input sequence to be</Paragraph> <Paragraph position="4"> where Zo is a normalization factor over all state sequences, fk(st[?]1,st,o,t) is an arbitrary feature function over its arguments, and lk is a learned weight for each feature function. A feature function may, for example, be defined to have value 0 in most cases, and have value 1 if and only if st[?]1 is state #1 (which may have label OTHER), and st is state #2 (which may have label LOCATION), and the observation at position t in o is a word appearing in a list of country names. Higher l weights make their corresponding FSM transitions more likely, so the weight lk in this example should be positive. More generally, feature functions can ask powerfully arbitrary questions about the input sequence, including queries about previous words, next words, and conjunctions of all these, and fk(*) can range [?][?]...[?].</Paragraph> <Paragraph position="5"> CRFs define the conditional probability of a label sequence based on total probability over the state sequences, PL(l|o) = summationtexts:l(s)=l PL(s|o), where l(s) is the sequence of labels corresponding to the labels of the states in sequence s.</Paragraph> <Paragraph position="6"> Note that the normalization factor, Zo, is the sum of the &quot;scores&quot; of all possible state sequences, Zo =summationtext</Paragraph> <Paragraph position="8"> parenrightBig , and that the number of state sequences is exponential in the input sequence length, T. In arbitrarily-structured CRFs, calculating the normalization factor in closed form is intractable, but in linear-chain-structured CRFs, as in forward-backward for hidden Markov models (HMMs), the probability that a particular transition was taken between two CRF states at a particular position in the input sequence can be calculated efficiently by dynamic programming. We define slightly modified forward values, at(si), to be the &quot;unnormalized probability&quot; of arriving in state si given the observations <o1,...ot> . We set a0(s) equal to the probability of starting in each state s, and recurse:</Paragraph> <Paragraph position="10"> The backward procedure and the remaining details of Baum-Welch are defined similarly. Zo is thensummationtexts aT(s). The Viterbi algorithm for finding the most likely state sequence given the observation sequence can be correspondingly modified from its HMM form.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Training CRFs </SectionTitle> <Paragraph position="0"> The weights of a CRF, L={l,...}, are set to maximize the conditional log-likelihood of labeled sequences in some</Paragraph> <Paragraph position="2"> where the second sum is a Gaussian prior over parameters (with variance s) that provides smoothing to help cope with sparsity in the training data.</Paragraph> <Paragraph position="3"> When the training labels make the state sequence un-ambiguous (as they often do in practice), the likelihood function in exponential models such as CRFs is convex, so there are no local maxima, and thus finding the global optimum is guaranteed. It has recently been shown that quasi-Newton methods, such as L-BFGS, are significantly more efficient than traditional iterative scaling and even conjugate gradient (Malouf, 2002; Sha and Pereira, 2003). This method approximates the second-derivative of the likelihood by keeping a running, finite-sized window of previous first-derivatives.</Paragraph> <Paragraph position="4"> L-BFGS can simply be treated as a black-box optimization procedure, requiring only that one provide the first-derivative of the function to be optimized. Assuming that the training labels on instance j make its state path unambiguous, let s(j) denote that path, and then the first-derivative of the log-likelihood is</Paragraph> <Paragraph position="6"> where Ck(s,o) is the &quot;count&quot; for feature k given s and o, equal to summationtextTt=1 fk(st[?]1,st,o,t), the sum of fk(st[?]1,st,o,t) values for all positions, t, in the sequence s. The first two terms correspond to the difference between the empirical expected value of feature fk and the model's expected value: ( ~E[fk][?]EL[fk])N. The last term is the derivative of the Gaussian prior.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 3 Efficient Feature Induction for CRFs </SectionTitle> <Paragraph position="0"> Typically the features, fk, are based on some number of hand-crafted atomic observational tests (such as word is capitalized or word is &quot;said&quot;, or word appears in lexicon of country names), and a large collection of features is formed by making conjunctions of the atomic tests in certain user-defined patterns; (for example, the conjunctions consisting of all tests at the current sequence position conjoined with all tests at the position one step ahead--specifically, for instance, current word is capitalized and next word is &quot;Inc&quot;). There can easily be over 100,000 atomic tests (mostly based on tests for the identity of words in the vocabulary), and ten or more shifted-conjunction patterns--resulting in several million features (Sha and Pereira, 2003). This large number of features can be prohibitively expensive in memory and computation; furthermore many of these features are irrelevant, and others that are relevant are excluded.</Paragraph> <Paragraph position="1"> In response, we wish to use just those time-shifted conjunctions that will significantly improve performance.</Paragraph> <Paragraph position="2"> We start with no features, and over several rounds of feature induction: (1) consider a set of proposed new features, (2) select for inclusion those candidate features that will most increase the log-likelihood of the correct state path s(j), and (3) train weights for all features. The proposed new features are based on the hand-crafted observational tests--consisting of singleton tests, and binary conjunctions of tests with each other and with features currently in the model. The later allows arbitrary-length conjunctions to be built. The fact that not all singleton tests are included in the model gives the designer great freedom to use a very large variety of observational tests, and a large window of time shifts.</Paragraph> <Paragraph position="3"> To consider the effect of adding a new feature, define the new sequence model with additional feature, g, having weight u, to be</Paragraph> <Paragraph position="5"> in the denominator is simply the additional portion of normalization required to make the new function sum to 1 over all state sequences.</Paragraph> <Paragraph position="6"> Following (Della Pietra et al., 1997), we efficiently assess many candidate features in parallel by assuming that the l parameters on all included features remain fixed while estimating the gain, G(g), of a candidate feature, g, based on the improvement in log-likelihood it provides, GL(g) = maxu GL(g,u) = maxu LL+gu [?]LL.</Paragraph> <Paragraph position="7"> where LL+gu includes [?]u2/2s2.</Paragraph> <Paragraph position="8"> In addition, we make this approach tractable for CRFs with two further reasonable and mutually-supporting approximations specific to CRFs. (1) We avoid dynamic programming for inference in the gain calculation with a mean-field approximation, removing the dependence among states. (Thus we transform the gain from a sequence problem to a token classification problem. However, the original posterior distribution over states given each token, PL(s|o) = at(s|o)bt+1(s|o)/Zo, is still calculated by dynamic programming without approximation.) Furthermore, we can calculate the gain of aggregate features irrespective of transition source, g(st,o,t), and expand them after they are selected. (2) In many sequence problems, the great majority of the tokens are correctly labeled even in the early stages of training. We significantly gain efficiency by including in the gain calculation only those tokens that are mislabeled by the current model. Let {o(i) : i = 1...M} be those tokens, and o(i) be the input sequence in which the ith error token occurs at position t(i). Then algebraic simplification using these approximations and previous definitions gives 2s2, where Zo(i)(L,g,u) (with non-bold o) is simplysummationtext s PL(s|o(i))exp(ug(s,o(i),t(i))). The optimal val-ues of the u's cannot be solved in closed form, but Newton's method finds them all in about 12 quick iterations. There are two additional important modeling choices: (1) Because we expect our models to still require several thousands of features, we save time by adding many of the features with highest gain each round of induction rather than just one; (including a few redundant features is not harmful). (2) Because even models with a small select number of features can still severely overfit, we train the model with just a few BFGS iterations (not to convergence) before performing the next round of feature induction. Details are in (McCallum, 2003).</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Web-augmented Lexicons </SectionTitle> <Paragraph position="0"> Some general-purpose lexicons, such a surnames and location names, are widely available, however, many natural language tasks will benefit from more task-specific lexicons, such as lists of soccer teams, political parties, NGOs and English counties. Creating new lexicons entirely by hand is tedious and time consuming.</Paragraph> <Paragraph position="1"> Using a technique we call WebListing, we build lexicons automatically from HTML data on the Web. Previous work has built lexicons from fixed corpora by determining linguistic patterns for the context in which relevant words appear (Collins and Singer, 1999; Jones et al., 1999). Rather than mining a small corpus, we gather data from nearly the entire Web; rather than relying on fragile linguistic context patterns, we leverage robust formatting regularities on the Web. WebListing finds co-occurrences of seed terms that appear in an identical HTML formatting pattern, and augments a lexicon with other terms on the page that share the same formatting. Our current implementation uses GoogleSets, which we understand to be a simple implementation of this approach based on using HTML list items as the formatting regularity. We are currently building a more sophisticated replacement.</Paragraph> </Section> class="xml-element"></Paper>