File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-2013_intro.xml
Size: 6,210 bytes
Last Modified: 2025-10-06 14:01:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-2013"> <Title>Named Entity Extraction with Conditional Markov Models and Classifiers</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 3 Extraction </SectionTitle> <Paragraph position="0"> The extraction component discards the specific labels LOC, ORG, etc. (from now on we refer to these as sort labels) and only predicts whether a token is at the beginning of (B), inside (I), or outside (O) a phrase (we will call these bare labels phrase tags). While this move is not unproblematic, we determined empirically that overall performance was higher using only bare phrase tags without sort labels, compared with a single-phase approach that tries to predict phrase tags and sort labels together using a single (conditional or joint) Markov model.</Paragraph> <Paragraph position="1"> The underlying rationale was to enable the extractor to concentrate on any morpho-syntactic regularities across different sorts of phrases without having to determine the sort label yet, which may require more context: for example, Spanish named entities can contain de, and this is the case across all sorts; or certain names like Holanda are ambiguous between LOC and ORG depending on whether they refer to countries on the one hand, or their governments or national soccer teams on the other. In light of this it makes sense to delay the assignment of sort labels and concentrate on extracting candidate phrases first.</Paragraph> <Paragraph position="2"> Our extraction approach uses conditional Markov models, and we shall illustrate it using a first order model. Generalizations to higher order models are straightforwardly possible. The problem we are trying to solve is this: we want to find a sequence of phrase tags t given a sequence of words w. We find the optimal ta0 as</Paragraph> <Paragraph position="4"> where the conditional model P is expressed in terms of a joint generative model G of tags and words, and a language model W.</Paragraph> <Paragraph position="5"> Since t and w have the same length n, we regard the training data as a sequence of pairs, rather than a pair of sequences (the two representations are isomorphic via a zip operation familiar to Python or Haskell programmers), and decompose the generative model G using a first order Markov assumption:</Paragraph> <Paragraph position="7"> Doing the same for W and using a designated start event a3 w0a6 t0a5 instead of the start distribution S we obtain:</Paragraph> <Paragraph position="9"> We further decompose the conditional distribution</Paragraph> <Paragraph position="11"> In addition to the first order assumption above, the only other assumption we make is that</Paragraph> <Paragraph position="13"> This is starting to look familiar: T is a conditional distribution over a finite set of phrase tags, so in principle any probabilistic classifier that uses (features derived from) the variables that T is conditioned on could be substituted in its place. Approaches like this have apparently been used informally in practice for some time, perhaps with a classifier instead of T that does not necessarily return a proper probability distribution over tags. Probability models that predict the next tag conditional on the current tag and an observed word have been criticized for a weakness known as the Label Bias Problem (Lafferty et al., 2001); on the other hand, the practical effectiveness of approaches like the one proposed here for a very similar task was demonstrated by Punyakanok and Roth (2001).</Paragraph> <Paragraph position="14"> Finding the optimal tag sequence for a given sequence of words can be done in the usual fashion using Viterbi decoding. Training is fully supervised, since we have labeled training data, but could in principle be extended to the (partly) unsupervised case. We only implemented supervised training, which is mostly trivial. When using a simple conditional next-tag model it is especially important to have good estimates of T a3 ti a4 wia6 wi</Paragraph> <Paragraph position="16"> strategy of backing off to less and less informative contexts. In the worst case, T a3 ti a4ti</Paragraph> <Paragraph position="18"> mated very reliably from the training data (in fact, good estimates for much longer tag histories can be found). When conditioning on words, the situation is rather different. For example, we see relatively few events of the form a3 wia6 wi</Paragraph> <Paragraph position="20"> data (out of the space of all possible events of that form), and so we may back off to a3 wia6 ui a8 1a6 tia8 1a5 ,where u ia8 1 is binary valued and indicates whetherthe preceding word started with an upper case letter. We have not determined an optimal back-off strategy, and for now we use an intuitively plausible strategy that tries to use as much conditioning information as possible and backs off to strictly less informative histories. In all cases it is important to always condition on the preceding tag ti a8 1 , or else we would be left with no information about likely tag sequences.</Paragraph> <Paragraph position="21"> We used first and second order models of this form and manually searched for good parameter settings on a held-out portion of the training data. It turns out that the second order model performs about the same as the first order model, but is at a disadvantage because of data sparseness.</Paragraph> <Paragraph position="22"> Therefore we only consider first order models in the rest of this paper. The performance of the first order model on the development data sets is summarized in Table 1. Note that these figures can be obtained for any system by first piping its output through sed using the command s/-\(LOC\|MISC\|ORG\|PER\)/-FOO/g.</Paragraph> <Paragraph position="23"> As will become clearer below, within each language it so happens that the extraction component performs better than the classification component, i.e. for now the performance bottleneck is the classification component.</Paragraph> <Paragraph position="24"> Spanish dev. precision recall Fb opment data sets for the two languages used in this shared task.</Paragraph> </Section> class="xml-element"></Paper>