File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1141_metho.xml

Size: 24,738 bytes

Last Modified: 2025-10-06 14:10:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1141">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition</Title>
  <Section position="5" start_page="1121" end_page="1123" type="metho">
    <SectionTitle>
3 Label Consistency
</SectionTitle>
    <Paragraph position="0"> The intuition for modeling label consistency is that within a particular document, different occur- null a certain label and the other occurrence is given a certain label. We show these counts both within documents, as well as over the whole corpus. As we would expect, most pairs of the same entity sequence are labeled the same(i.e. the diagonal has most of the density) at both the document and corpus levels. These statistics are from the CoNLL 2003 English training set.  entity label, and the token subsequence is assigned a certain entity label. We show these counts both within documents, as well as over the whole corpus. Rows correspond to sequences, and columns to subsequences. These statistics are from the CoNLL 2003 English training set.</Paragraph>
    <Paragraph position="1"> rences of a particular token sequence (or similar token sequences) are unlikely to have different entity labels. While this constraint holds strongly at the level of a document, there exists additional value to be derived by enforcing this constraint less strongly across different documents. We want to model label consistency as a soft and not a hard constraint; while we want to encourage different occurrences of similar token sequences to get labeled as the same entity, we do not want to force this to always hold, since there do exist exceptions, as can be seen from the off-diagonal entries of tables 1 and 2.</Paragraph>
    <Paragraph position="2"> A named entity recognition system modeling this structure would encourage all the occurrences of the token sequence to the same entity type, thereby sharing evidence among them. Thus, if the system has strong evidence about the label of a given token sequence, but is relatively unsure about the label to be assigned to another occurrence of a similar token sequence, the system can gain significantly by using the information about the label assigned to the former occurrence, to label the relatively ambiguous token sequence, leading to accuracy improvements.</Paragraph>
    <Paragraph position="3"> The strength of the label consistency constraint, can be seen from statistics extracted from the CoNLL 2003 English training data. Table 1 shows the counts of entity labels pairs assigned for each pair of identical token sequences both within a document and across the whole corpus. As we would expect, inconsistent labelings are relatively rare and most pairs of the same entity sequence are labeled the same(i.e. the diagonal has most of the density) at both the document and corpus levels. A notable exception to this is the labeling of the same text as both organization and location within the same document and across documents.</Paragraph>
    <Paragraph position="4"> This is a due to the large amount of sports news in the CoNLL dataset due to which city and country names are often also team names. We will see that our approach is capable of exploiting this as well, i.e. we can learn a model which would not penalize an Organization-Location inconsistency as strongly as it penalizes other inconsistencies.</Paragraph>
    <Paragraph position="5"> In addition, we also want to model subsequence constraints: having seen Albert Einstein earlier in a document as a person is a good indicator that a subsequent occurrence of Einstein should also be labeled as a person. Here, we would expect that a subsequence would gain much more by knowing the label of a supersequence, than the other way around.</Paragraph>
    <Paragraph position="6"> However, as can be seen from table 2, we find that the consistency constraint does not hold nearly so strictly in this case. A very common case of this in the CoNLL dataset is that of documents containing references to both The China Daily, a newspaper, and China, the country (Finkel et al., 2005). The first should be labeled as an organization, and second as a location. The counts of sub-sequence labelings within a document and across documents listed in Table 2, show that there are many off-diagonal entries: the China Daily case is among the most common, occurring 328 times in the dataset. Just as we can model off-diagonal pat- null terns with exact token sequence matches, we can also model off-diagonal patterns for the token sub-sequence case.</Paragraph>
    <Paragraph position="7"> In addition, we could also derive some value by enforcing some label consistency at the level of an individual token. Obviously, our model would learn much lower weights for these constraints, when compared to label consistency at the level of token sequences.</Paragraph>
  </Section>
  <Section position="6" start_page="1123" end_page="1124" type="metho">
    <SectionTitle>
4 Our Approach to Handling non-local
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1123" end_page="1124" type="sub_section">
      <SectionTitle>
Dependencies
</SectionTitle>
      <Paragraph position="0"> To handle the non-local dependencies between same and similar token sequences, we define three sets of feature pairs where one member of the feature pair corresponds to a function of aggregate statistics of the output of the first CRF at the document level, and the other member corresponds to a function of aggregate statistics of the output of the first CRF over the whole test corpus.</Paragraph>
      <Paragraph position="1"> Thus this gives us six additional feature types for the second round CRF, namely Document-level Token-majority features, Document-level Entity-majority features, Document-level Superentity-majority features, Corpus-level Token-majority features, Corpus-level Entity-majority features and Corpus-level Superentity-majority features.</Paragraph>
      <Paragraph position="2"> These feature types are described in detail below.</Paragraph>
      <Paragraph position="3"> All these features are a function of the output labels of the first CRF, where predictions on the test set are obtained by training on all the data, and predictions on the train data are obtained by 10 fold cross-validation (details in the next section).</Paragraph>
      <Paragraph position="4"> Our features fired based on document and corpus level statistics are: * Token-majority features: These refer to the majority label assigned to the particular token in the document/corpus. Eg: Suppose we have three occurrences of the token Australia, such that two are labeled Location and one is labeled Organization, our tokenmajority feature would take value Location for all three occurrences of the token. This feature can enable us to capture some dependence between token sequences corresponding to a single entity and having common tokens. null * Entity-majority features: These refer to the majority label assigned to the particular entity in the document/corpus. Eg: Suppose we have three occurrences of the entity sequence (we define it as a token sequence labeled as a single entity by the first stage CRF) Bank of Australia, such that two are labeled Organization and one is labeled Location, our entitymajority feature would take value Organization for all tokens in all three occurrences of the entity sequence. This feature enables us to capture the dependence between identical entity sequences. For token labeled as not a Named Entity by the first CRF, this feature returns the majority label assigned to that token when it occurs as a single token named entity.</Paragraph>
      <Paragraph position="5"> * Superentity-majority features: These refer to the majority label assigned to supersequences of the particular entity in the document/corpus. By entity supersequences, we refer to entity sequences, that strictly contain within their span, another entity sequence.</Paragraph>
      <Paragraph position="6"> For example, if we have two occurrences of Bank of Australia labeled Organization and one occurrence of Australia Cup labeled Miscellaneous, then for all occurrences of the entity Australia, the superentity-majority feature would take value Organization. This feature enables us to take into account labels assigned to supersequences of a particular entity, while labeling it. For token labeled as not a Named Entity by the first CRF, this feature returns the majority label assigned to all entities containing the token within their span.</Paragraph>
      <Paragraph position="7"> The last feature enables entity sequences to benefit from labels assigned to entities which are entity supersequences of it. We attempted to add subentity-majority features, analogous to the superentity-majority features to model dependence on entity subsequences, but got no benefit from it. This is intuitive, since the basic sequence model would usually be much more certain about labels assigned to the entity supersequences, since they are longer and have more contextual information. As a result of this, while there would be several cases in which the basic sequence model would be uncertain about labels of entity subsequences but relatively certain about labels of token supersequences, the converse is very unlikely. Thus, it is difficult to profit from labels of entity subsequences while labeling entity sequences. We also attempted using more fine  grained features corresponding to the majority label of supersequences that takes into account the position of the entity sequence in the entity supersequence(whether the entity sequence occurs in the start, middle or end of the supersequence), but could obtain no additional gains from this.</Paragraph>
      <Paragraph position="8"> It is to be noted that while deciding if token sequences are equal or hold a subsequencesupersequence relation, we ignore case, which clearly performs better than being sensitive to case. This is because our dataset contains several entities in allCaps such as AUSTRALIA, especially in news headlines. Ignoring case enables us to model dependences with other occurrences with a different case such as Australia.</Paragraph>
      <Paragraph position="9"> It may appear at first glance, that our framework can only learn to encourage entities to switch to the most popular label assigned to other occurrences of the entity sequence and similar entity sequences. However this framework is capable of learning interesting off-diagonal patterns as well.</Paragraph>
      <Paragraph position="10"> To understand this, let us consider the example of different occurrences of token sequences being labeled Location and Organization. Suppose, the majority label of the token sequence is Location.</Paragraph>
      <Paragraph position="11"> While this majority label would encourage the second CRF to switch the labels of all occurrences of the token sequence to Location, it would not strongly discourage the CRF from labeling these as Organization, since there would be several occurrences of token sequences in the training data labeled Organization, with the majority label of the token sequence being Location. However it would discourage the other labels strongly. The reasoning is analogous when the majority label is Organization.</Paragraph>
      <Paragraph position="12"> In case of a tie (when computing the majority label), if the label assigned to a particular token sequence is one of the majority labels, we fire the feature corresponding to that particular label being the majority label, instead of breaking ties arbitrarily. This is done to encourage the second stage CRF to make its decision based on local information, in the absence of compelling non-local information to choose a different label.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1124" end_page="1125" type="metho">
    <SectionTitle>
5 Advantages of our approach
</SectionTitle>
    <Paragraph position="0"> With our two-stage approach, we manage to get improvements on the F1 measure over existing approaches that model non-local dependencies. At the same time, the simplicity of our two-stage approach keeps inference time down to just the inference time of two sequential CRFs, when compared to approaches such as those of Finkel et al.</Paragraph>
    <Paragraph position="1"> (2005) who report that their inference time with Gibbs sampling goes up by a factor of about 30, compared to the Viterbi algorithm for the sequential CRF.</Paragraph>
    <Paragraph position="2"> Below, we give some intuition about areas for improvement in existing work and explain how our approach incorporates the improvements.</Paragraph>
    <Paragraph position="3"> * Most existing work to capture labelconsistency, has attempted to create all parenleftbign2parenrightbig pairwise dependencies between the different occurrences of an entity, (Finkel et al., 2005; Sutton and McCallum, 2004), where n is the number of occurrences of the given entity. This complicates the dependency graph making inference harder. It also leads to the penalty for deviation in labeling to grow linearly with n, since each entity would be connected to Th(n) entities. When an entity occurs several times, these models would force all occurrences to take the same value. This is not what we want, since there exist several instances in real-life data where different entities like persons and organizations share the same name. Thus, our approach makes a certain entity's label depend on certain aggregate information of other labels assigned to the same entity, and does not enforce pairwise dependencies.</Paragraph>
    <Paragraph position="4"> * We also exploit the fact that the predictions of a learner that takes non-local dependencies into account would have a good amount of overlap with a sequential CRF, since the sequence model is already quite competitive.</Paragraph>
    <Paragraph position="5"> We use this intuition to approximate the aggregate information about labels assigned to other occurrences of the entity by the non-local model, with the aggregate information about labels assigned to other occurrences of the entity by the sequence model. This intuition enables us to learn weights for non-local dependencies in two stages; we first get predictions from a regular sequential CRF and in turn use aggregate information about predictions made by the CRF as extra features to train a second CRF.</Paragraph>
    <Paragraph position="6"> * Most work has looked to model non-local dependencies only within a document (Finkel  et al., 2005; Chieu and Ng, 2002; Sutton and McCallum, 2004; Bunescu and Mooney, 2004). Our model can capture the weaker but still important consistency constraints across the whole document collection, whereas previous work has not, for reasons of tractability. Capturing label-consistency at the level of the whole test corpus is particularly helpful for token sequences that appear only once in their documents, but occur a few times over the corpus, since they do not have strong non-local information from within the document.</Paragraph>
    <Paragraph position="7"> * For training our second-stage CRF, we need to get predictions on our train data as well as test data. Suppose we were to use the same train data to train the first CRF, we would get unrealistically good predictions on our train data, which would not be reflective of its performance on the test data. One option is to partition the train data. This however, can lead to a drop in performance, since the second CRF would be trained on less data. To overcome this problem, we make predictions on our train data by doing a 10-fold cross validation on the train data. For predictions on the test data, we use all the training data to train the CRF. Intuitively, we would expect that the quality of predictions with 90% of the train data would be similar to the quality of predictions with all the training data. It turns out that this is indeed the case, as can be seen from our improved performance.</Paragraph>
  </Section>
  <Section position="8" start_page="1125" end_page="1126" type="metho">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1125" end_page="1125" type="sub_section">
      <SectionTitle>
6.1 Dataset and Evaluation
</SectionTitle>
      <Paragraph position="0"> We test the effectiveness of our technique on the CoNLL 2003 English named entity recognition dataset downloadable from http://cnts.uia.ac.be/conll2003/ner/. The data comprises Reuters newswire articles annotated with four entity types: person (PER), location (LOC), organization (ORG), and miscellaneous (MISC). The data is separated into a training set, a development set (testa), and a test set (testb). The training set contains 945 documents, and approximately 203,000 tokens and the test set has 231 documents and approximately 46,000 tokens. Performance on this task is evaluated by measuring the precision and recall of annotated entities (and not tokens), combined into an F1 score. There is no partial credit for labeling part of an entity sequence correctly; an incorrect entity boundary is penalized as both a false positive and as a false negative.</Paragraph>
    </Section>
    <Section position="2" start_page="1125" end_page="1126" type="sub_section">
      <SectionTitle>
6.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> It can be seen from table 3, that we achieve a 12.6% relative error reduction, by restricting ourselves to features approximating non-local dependency within a document, which is higher than other approaches modeling non-local dependencies within a document. Additionally, by incorporating non-local dependencies across documents in the test corpus, we manage a 13.3% relative error reduction, over an already competitive baseline. We can see that all three features approximating non-local dependencies within a document yield reasonable gains. As we would expect the additional gains from features approximating non-local dependencies across the whole test corpus are relatively small.</Paragraph>
      <Paragraph position="1"> We use the approximate randomization test (Yeh, 2000) for statistical significance of the difference between the basic sequential CRF and our second round CRF, which has additional features derived from the output of the first CRF. With a 1000 iterations, our improvements were statistically significant with a p-value of 0.001. Since this value is less than the cutoff threshold of 0.05, we reject the null hypothesis.</Paragraph>
      <Paragraph position="2"> The simplicity of our approach makes it easy to incorporate dependencies across the whole corpus, which would be relatively much harder to incorporate in approaches like (Bunescu and Mooney, 2004) and (Finkel et al., 2005). Additionally, our approach makes it possible to do inference in just about twice the inference time with a single sequential CRF; in contrast, approaches like Gibbs Sampling that model the dependencies directly can increase inference time by a factor of 30 (Finkel et al., 2005).</Paragraph>
      <Paragraph position="3"> An analysis of errors by the first stage CRF revealed that most errors are that of single token entities being mislabeled or missed altogether followed by a much smaller percentage of multiple token entities mislabelled completely. All our features directly encode information that is useful to reducing these errors. The widely prevalent boundary detection error is that of missing a single-token entity (i.e. labeling it as Other(O)). Our approach helps correct many such errors based on occurrences of the token in other  performance against (Bunescu and Mooney, 2004) and (Finkel et al., 2005) and find that we manage higher relative improvement than existing work despite starting from a very competitive baseline CRF. named entities. Other kinds of boundary detection errors involving multiple tokens are very rare. Our approach can also handle these errors by encouraging certain tokens to take different labels.</Paragraph>
      <Paragraph position="4"> This together with the clique features encoding the markovian dependency among neighbours can correct some multiple-token boundary detection errors.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="1126" end_page="1127" type="metho">
    <SectionTitle>
7 Related Work
</SectionTitle>
    <Paragraph position="0"> Recent work looking to directly model non-local dependencies and do approximate inference are that of Bunescu and Mooney (2004), who use a Relational Markov Network (RMN) (Taskar et al., 2002) to explicitly model long-distance dependencies, Sutton and McCallum (2004), who introduce skip-chain CRFs, which add additional non-local edges to the underlying CRF sequence model (which Bunescu and Mooney (2004) lack) and Finkel et al. (2005) who hand-set penalties for inconsistency in labels based on the training data and then use Gibbs Sampling for doing approximate inference where the goal is to obtain the label sequence that maximizes the product of the CRF objective function and their penalty. Unfortunately, in the RMN model, the dependencies must be defined in the model structure before doing any inference, and so the authors use heuristic part-of-speech patterns, and then add dependencies between these text spans using clique templates. This generates an extremely large number of overlapping candidate entities, which renders necessary additional templates to enforce the constraint that text subsequences cannot both be different entities, something that is more naturally modeled by a CRF. Another disadvantage of this approach is that it uses loopy belief propagation and a voted perceptron for approximate learning and inference, which are inherently unstable algorithms leading to convergence problems, as noted by the authors. In the skip-chain CRFs model, the decision of which nodes to connect is also made heuristically, and because the authors focus on named entity recognition, they chose to connect all pairs of identical capitalized words. They also utilize loopy belief propagation for approximate learning and inference. It is hard to directly extend their approach to model dependencies richer than those at the token level.</Paragraph>
    <Paragraph position="1"> The approach of Finkel et al. (2005) makes it possible a to model a broader class of long-distance dependencies than Sutton and McCallum (2004), because they do not need to make any initial assumptions about which nodes should be connected and they too model dependencies between whole token sequences representing entities and between entity token sequences and their token supersequences that are entities. The disadvantage of their approach is the relatively ad-hoc selection of penalties and the high computational cost of running Gibbs sampling.</Paragraph>
    <Paragraph position="2"> Early work in discriminative NER employed two stage approaches that are broadly similar to ours, but the effectiveness of this approach appears to have been overlooked in more recent work.</Paragraph>
    <Paragraph position="3"> Mikheev et al. (1999) exploit label consistency information within a document using relatively ad hoc multi-stage labeling procedures. Borth- null wick (1999) used a two-stage approach similar to ours with CMM's where Reference Resolution features which encoded the frequency of occurrences of other entities similar to the current token sequence, were derived from the output of the first stage. Malouf (2002) and Curran and Clark (2003) condition the label of a token at a particular position on the label of the most recent previous instance of that same token in a previous sentence of the same document. This violates the Markov property and therefore instead of finding the maximum likelihood sequence over the entire document (exact inference), they label one sentence at a time, which allows them to condition on the maximum likelihood sequence of previous sentences.</Paragraph>
    <Paragraph position="4"> While this approach is quite effective for enforcing label consistency in many NLP tasks, it permits a forward flow of information only, which can result in loss of valuable information. Chieu and Ng (2002) propose a solution to this problem: for each token, they define additional features based on known information, taken from other occurrences of the same token in the document. This approach has the advantage of allowing the training procedure to automatically learn good weights for these &amp;quot;global&amp;quot; features relative to the local ones. However, it is hard to extend this to incorporate other types of non-local structure.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML