File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1045_metho.xml

Size: 10,490 bytes

Last Modified: 2025-10-06 14:09:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1045">
  <Title>Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling</Title>
  <Section position="4" start_page="365" end_page="365" type="metho">
    <SectionTitle>
4 Datasets and Evaluation
</SectionTitle>
    <Paragraph position="0"> We test the effectiveness of our technique on two established datasets: the CoNLL 2003 English named entity recognition dataset, and the CMU Seminar Announcements information extraction dataset.</Paragraph>
    <Section position="1" start_page="365" end_page="365" type="sub_section">
      <SectionTitle>
4.1 The CoNLL NER Task
</SectionTitle>
      <Paragraph position="0"> This dataset was created for the shared task of the Seventh Conference on Computational Natural Language Learning (CoNLL),4 which concerned named entity recognition. The English data is a collection of Reuters newswire articles annotated with four entity types: person (PER), location (LOC), organization (ORG), and miscellaneous (MISC). The data is separated into a training set, a development set (testa), and a test set (testb). The training set contains 945 documents, and approximately 203,000 tokens. The development set has 216 documents and approximately 51,000 tokens, and the test set has 231 documents and approximately 46,000 tokens.</Paragraph>
      <Paragraph position="1"> We evaluate performance on this task in the manner dictated by the competition so that results can be properly compared. Precision and recall are evaluated on a per-entity basis (and combined into an F1 score). There is no partial credit; an incorrect entity  boundary is penalized as both a false positive and as a false negative.</Paragraph>
    </Section>
    <Section position="2" start_page="365" end_page="365" type="sub_section">
      <SectionTitle>
4.2 The CMU Seminar Announcements Task
</SectionTitle>
      <Paragraph position="0"> This dataset was developed as part of Dayne Freitag's dissertation research Freitag (1998).5 It consists of 485 emails containing seminar announcements at Carnegie Mellon University. It is annotated for four fields: speaker, location, start time, and end time. Sutton and McCallum (2004) used 5-fold cross validation when evaluating on this dataset, so we obtained and used their data splits, so that results can be properly compared. Because the entire dataset is used for testing, there is no development set. We also used their evaluation metric, which is slightly different from the method for CoNLL data. Instead of evaluating precision and recall on a per-entity basis, they are evaluated on a per-token basis. Then, to calculate the overall F1 score, the F1 scores for each class are averaged.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="365" end_page="367" type="metho">
    <SectionTitle>
5 Models of Non-local Structure
</SectionTitle>
    <Paragraph position="0"> Our models of non-local structure are themselves just sequence models, defining a probability distribution over all possible state sequences. It is possible to flexibly model various forms of constraints in a way that is sensitive to the linguistic structure of the data (e.g., one can go beyond imposing just exact identity conditions). One could imagine many ways of defining such models; for simplicity we use the form</Paragraph>
    <Paragraph position="2"> where the product is over a set of violation types L, and for each violation type l we specify a penalty parameter thl. The exponent #(l,s,o) is the count of the number of times that the violation l occurs in the state sequence s with respect to the observation sequence o. This has the effect of assigning sequences with more violations a lower probability. The particular violation types are defined specifically for each task, and are described in the following two sections.</Paragraph>
    <Paragraph position="3"> This model, as defined above, is not normalized, and clearly it would be expensive to do so. This  a token sequence is labeled as different entity types in the same document. Taken from the CoNLL training set.</Paragraph>
    <Paragraph position="4">  labeled differently from an occurrence of a subsequence of it elsewhere in the document. Rows correspond to sequences, and columns to subsequences. Taken from the CoNLL training set. doesn't matter, however, because we only use the model for Gibbs sampling, and so only need to compute the conditional distribution at a single position i (as defined in Equation 1). One (inefficient) way to compute this quantity is to enumerate all possible sequences differing only at position i, compute the score assigned to each by the model, and renormalize. Although it seems expensive, this computation can be made very efficient with a straightforward memoization technique: at all times we maintain data structures representing the relationship between entity labels and token sequences, from which we can quickly compute counts of different types of violations.</Paragraph>
    <Section position="1" start_page="366" end_page="367" type="sub_section">
      <SectionTitle>
5.1 CoNLL Consistency Model
</SectionTitle>
      <Paragraph position="0"> Label consistency structure derives from the fact that within a particular document, different occurrences of a particular token sequence are unlikely to be labeled as different entity types. Although any one occurrence may be ambiguous, it is unlikely that all instances are unclear when taken together.</Paragraph>
      <Paragraph position="1"> The CoNLL training data empirically supports the strength of the label consistency constraint. Table 3 shows the counts of entity labels for each pair of identical token sequences within a document, where both are labeled as an entity. Note that inconsistent labelings are very rare.6 In addition, we also 6A notable exception is the labeling of the same text as both organization and location within the same document. This is a consequence of the large portion of sports news in the CoNLL want to model subsequence constraints: having seen Geoff Woods earlier in a document as a person is a good indicator that a subsequent occurrence of Woods should also be labeled as a person. However, if we examine all cases of the labelings of other occurrences of subsequences of a labeled entity, we find that the consistency constraint does not hold nearly so strictly in this case. As an example, one document contains references to both The China Daily, a newspaper, and China, the country.</Paragraph>
      <Paragraph position="2"> Counts of subsequence labelings within a document are listed in Table 4. Note that there are many off-diagonal entries: the China Daily case is the most common, occurring 328 times in the dataset.</Paragraph>
      <Paragraph position="3"> The penalties used in the long distance constraint model for CoNLL are the Empirical Bayes estimates taken directly from the data (Tables 3 and 4), except that we change counts of 0 to be 1, so that the distribution remains positive. So the estimate of a PER also being an ORG is 53151; there were 5 instance of an entity being labeled as both, PER appeared 3150 times in the data, and we add 1 to this for smoothing, because PER-MISC never occured. However, when we have a phrase labeled differently in two different places, continuing with the PER-ORG example, it is unclear if we should penalize it as PER that is also an ORG or an ORG that is also a PER. To deal with this, we multiply the square roots of each estimate together to form the penalty term. The penalty term is then multiplied in a number of times equal to the length of the offending entity; this is meant to &amp;quot;encourage&amp;quot; the entity to shrink.7 For example, say we have a document with three entities, Rotor Volgograd twice, once labeled as PER and once as ORG, and Rotor, labeled as an ORG. The likelihood of a</Paragraph>
    </Section>
    <Section position="2" start_page="367" end_page="367" type="sub_section">
      <SectionTitle>
5.2 CMU Seminar Announcements
Consistency Model
</SectionTitle>
      <Paragraph position="0"> Due to the lack of a development set, our consistency model for the CMU Seminar Announcements is much simpler than the CoNLL model, the numbers where selected due to our intuitions, and we did not spend much time hand optimizing the model.</Paragraph>
      <Paragraph position="1"> Specifically, we had three constraints. The first is that all entities labeled as start time are normalized, and are penalized if they are inconsistent. The second is a corresponding constraint for end times.</Paragraph>
      <Paragraph position="2"> The last constraint attempts to consistently label the speakers. If a phrase is labeled as a speaker, we assume that the last word is the speaker's last name, and we penalize for each occurrance of that word which is not also labeled speaker. For the start and end times the penalty is multiplied in based on how many words are in the entity. For the speaker, the penalty is only multiplied in once. We used a hand selected penalty of exp[?]4.0.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="367" end_page="367" type="metho">
    <SectionTitle>
6 Combining Sequence Models
</SectionTitle>
    <Paragraph position="0"> In the previous section we defined two models of non-local structure. Now we would like to incorporate them into the local model (in our case, the trained CRF), and use Gibbs sampling to find the most likely state sequence. Because both the trained CRF and the non-local models are themselves sequence models, we simply combine the two models into a factored sequence model of the following form</Paragraph>
    <Paragraph position="2"> where M is the local CRF model, L is the new non-local model, and F is the factored model.8 In this form, the probability again looks difficult to compute (because of the normalizing factor, a sum over all hidden state sequences of length N). However, since we are only using the model for Gibbs sampling, we never need to compute the distribution explicitly. Instead, we need only the conditional probability of each position in the sequence, which can be computed as PF(si|s[?]i,o) [?] PM(si|s[?]i,o)PL(si|s[?]i,o). (7) 8This model double-generates the state sequence conditioned on the observations. In practice we don't find this to be a problem.</Paragraph>
    <Paragraph position="3">  the results from Sutton and McCallum (2004) for comparison. At inference time, we then sample from the Markov chain defined by this transition probability.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML