XML Viewer - p05-1060

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1060_metho.xml
Size: 21,993 bytes
Last Modified: 2025-10-06 14:09:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1060">
  <Title>Multi-Field Information Extraction and Cross-Document Fusion</Title>
  <Section position="3" start_page="0" end_page="485" type="metho">
    <SectionTitle>
2 Training by Automatic Annotation
</SectionTitle>
    <Paragraph position="0"> Typically, statistical extraction systems (such as HMMs and CRFs) are trained using hand-annotated data. Annotating the necessary data by hand is time-consuming and brittle, since it may require large-scale re-annotation when the annotation scheme changes. For the special case of Rote extractors, a more attractive alternative has been proposed by Brin (1998), Agichtein and Gravano (2000), and Ravichandran and Hovy (2002).</Paragraph>
    <Paragraph position="1">  Essentially, for any text snippet of the form A1pA2qA3, these systems estimate the probability that a relationship r(p,q) holds between entities p and q, given the interstitial context, as2</Paragraph>
    <Paragraph position="3"> That is, the probability of a relationship r(p,q) is the number of times that pattern xA2y predicts any relationship r(x,y) in the training set T. c(.) is the count. We will refer to x as the hook3 and y as the target. In this paper, the hook is always an individual. Training a Rote extractor is straightforward given a set T of example relationships r(x,y). For each hook, download a separate set of relevant documents (a hook corpus, Dx) from the Web.4 Then for any particular pattern A2 and an element x, count how often the pattern xA2 predicts y and how often it retrieves a spurious -y.5 This annotation method extends to training other statistical models with positive examples, for example a Na&amp;quot;ive Bayes (NB) unigram model. In this model, instead of looking for an exact A2 pattern as above, each individual word in the pattern A2 is used to predict the presence of a relationship.</Paragraph>
    <Paragraph position="5"> We perform add-lambda smoothing for out-of-vocabulary words and thus assign a positive probability to any sequence. As before, a set of relevant  estimate much more reliable, but it is possible to use this method of estimation even when this constraint does not hold. documents is downloaded for each particular hook.</Paragraph>
    <Paragraph position="6"> Then every hook and target is annotated. From that markup, we can pick out the interstitial A2 patterns and calculate the necessary probabilities.</Paragraph>
    <Paragraph position="7"> Since the NB model assigns a positive probability to every sequence, we need to pick out likely targets from those proposed by the NB extractor. We construct a background model which is a basic unigram language model, P(A2) = producttexta[?]A2 P(a). We then pick targets chosen by the confidence estimate</Paragraph>
    <Paragraph position="9"> However, this confidence estimate does not workwell in our dataset.</Paragraph>
    <Paragraph position="10"> We propose to use negative examples to estimate</Paragraph>
    <Paragraph position="12"> each relationship, we define the target set Er to be all potential targets and model it using regular expressions.7 In training, for each relationship r(p,q), we markup the hook p, the target q, and all spurious targets (-q [?] {Er [?]q}) which provide negative examples. Targets can then be chosen with the following confidence estimate</Paragraph>
    <Paragraph position="14"> We call this NB+E in the following experiments.</Paragraph>
    <Paragraph position="15"> The above process describes a general method for automatically annotating a corpus with positive and negative examples, and this corpus can be used to train statistical models that rely on annotated data.8 In this paper, we test automatic annotation using Conditional Random Fields (CRFs) (Lafferty et al., 2001) which have achieved high performance for information extraction. CRFs are undirected graphical models that estimate the conditional probability of a state sequence given an output sequence</Paragraph>
    <Paragraph position="17"> parenrightbigg 6-r stands in for all other possible relationships (including no relationship) between p and q. P(A2  |-r(p,q)) is estimated as P(A2  |r(p,q)) is, except with spurious targets.</Paragraph>
    <Paragraph position="18"> 7e.g., Ebirthyear = {\d\d\d\d}. This is the only source of human knowledge put into the system and required only around 4 hours of effort, less effort than annotating an entire corpus or writing information extraction rules.</Paragraph>
    <Paragraph position="19">  ship r(p,q) from a sentence pA2q. Left: CRF Extraction with a background model (B). Right: CRF+E As before but with spurious target prediction (pA2-q).</Paragraph>
    <Paragraph position="20"> We use the Mallet system (McCallum, 2002) for training and evaluation of the CRFs. In order to examine the improvement by using negative examples, we train CRFs with two topologies (Figure 1). The first, CRF, models the target relationship and background sequences and is trained on a corpus where targets (positive examples) are annotated. The second, CRF+E, models the target relationship, spurious targets and background sequences, and it is trained on a corpus where targets (positive examples) as well as spurious targets (negative examples) are annotated.</Paragraph>
    <Section position="1" start_page="484" end_page="485" type="sub_section">
      <SectionTitle>
Experimental Results
</SectionTitle>
      <Paragraph position="0"> To test the performance of the different extractors, we collected a set of 152 semi-structured mini-biographies from an online site (www.infoplease.com), and used simple rules to extract a biographic fact database of birthday and month (henceforth birthday), birth year, occupation, birth place, and year of death (when applicable).</Paragraph>
      <Paragraph position="1"> An example of the data can be found in Table 1. In our system, we normalized birthdays, and performed capitalization normalization for the remaining fields. We did no further normalization, such as normalizing state names to their two letter acronyms (e.g., California - CA). Fifteen names were set aside as training data, and the rest were used for testing. For each name, 150 documents were downloaded from Google to serve as the hook corpus for either training or testing.9 In training, we automatically annotated documents using people in the training set as hooks, and in testing, tried to get targets that exactly matched what was present in the database. This is a very strict method of evaluation for three reasons. First, since the facts were automatically collected, they contain  entry contains incomplete information about various celebrities. Here, Aaron Neville's birth state is missing, and Frank Zappa could be equally well described as a guitarist or rock-star. errors and thus the system is tested against wrong answers.10 Second, the extractors might have retrieved information that was simply not present in the database but nevertheless correct (e.g., someone's occupation might be listed as writer and the retrieved occupation might be novelist). Third, since the retrieved targets were not normalized, there system may have retrieved targets that were correct but were not recognized (e.g., the database birthplace is New York, and the system retrieves NY).</Paragraph>
      <Paragraph position="2"> In testing, we rejected candidate targets that were not present in our target set models Er. In some cases, this resulted in the system being unable to find the correct target for a particular relationship, since it was not in the target set.</Paragraph>
      <Paragraph position="3"> Before fusion (Section 3), we gathered all the facts extracted by the system and graded them in isolation. We present the per-extraction precision Pre-Fusion Precision = # Correct Extracted Targets# Total Extracted Targets We also present the pseudo-recall, which is the average number of times per person a correct target was extracted. It is difficult to calculate true recall without manual annotation of the entire corpus, since it cannot be known for certain how many times the document set contains the desired information.11</Paragraph>
      <Paragraph position="5"> The precision of each of the various extraction methods is listed in Table 2. The data show that on average the Rote method has the best precision, 10These deficiencies in testing also have implications for training, since the models will be trained on annotated data that has errors. The phenomenon of missing and inaccurate data was most prevalent for occupation and birthplace relationships, though it was observed for other relationships as well.</Paragraph>
      <Paragraph position="6"> 11It is insufficient to count all text matches as instances that the system should extract. To obtain the true recall, it is necessary to decide whether each sentence contains the desired relationship, even in cases where the information is not what the biographies have listed.</Paragraph>
      <Paragraph position="7">  while the NB+E extractor has the worst. Training the CRF with negative examples (CRF+E) gave better precision in extracted information then training it without negative examples. Table 3 lists the pseudo-recall or average number of correctly extracted targets per person. The results illustrate that the Rote has the worst pseudo-recall, and the plain CRF, trained without negative examples, has the best pseudo-recall.</Paragraph>
      <Paragraph position="8"> To test how the extraction precision changes as more documents are retrieved from the ranked results from Google, we created retrieval sets of 1, 5, 15, 30, 75, and 150 documents per person and repeated the above experiments with the CRF+E extractor. The data in Figure 2 suggest that there is a gradual drop in extraction precision throughout the corpus, which may be caused by the fact that documents further down the retrieved list are less relevant, and therefore less likely to contain the relevant  fusion precision drops.</Paragraph>
      <Paragraph position="9"> However, even though the extractor's precision drops, the data in Figure 3 indicate that there continue to be instances of the relevant biographic data.  are added.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="485" end_page="486" type="metho">
    <SectionTitle>
3 Cross-Document Information Fusion
</SectionTitle>
    <Paragraph position="0"> The per-extraction performance was presented in Section 2, but the final task is to find the single correct target for each person.12 In this section, we examine two basic methodologies for combining candidate targets. Masterson and Kushmerick (2003) propose Best which gives each candidate a score equal to its highest confidence extraction:</Paragraph>
    <Paragraph position="2"> Voting, which counts the number of times each candidate x was extracted: Vote(x) = |C(x) &gt; 0|.</Paragraph>
    <Paragraph position="3"> Each of these methods ranks the candidate targets by score and chooses the top-ranked one.</Paragraph>
    <Paragraph position="4"> The experimental setup used in the fusion experiments was the same as before: training on 15 people, and testing on 137 people. However, the post-fusion evaluation differs from the pre-fusion evaluation. After fusion, the system returns one consensus target for each person and thus the evaluation is on the accuracy of those targets. That is, missing tar12This is a simplifying assumption, since there are many cases where there might exist multiple possible values, e.g., a person may be both a writer and a musician.</Paragraph>
    <Paragraph position="5"> 13C(x) is either the confidence estimate (NB+E) or the probability score (Rote,CRF,CRF+E).</Paragraph>
    <Paragraph position="6">  Additionally, since the targets are ranked, we also calculated the mean reciprocal rank (MRR).15 The data in Table 4 show the average system performance with the different fusion methods. Frequency voting gave anywhere from a 2% to a 20% improvement over picking the highest confidence candidate.  CRF+E has best average performance (67.8%).</Paragraph>
    <Paragraph position="7"> Table 5 shows the results of using each of these extractors to extract correct relationships from the top 150 ranked documents downloaded from the 14For year of death, we only graded cases where the person had died.</Paragraph>
    <Paragraph position="8"> 15The reciprocal rank = 1 / the rank of the correct target. Web. CRF+E was a top performer in 3/5 of the cases. In the other 2 cases, the NB+E was the most successful, perhaps because NB+E's increased recall was more useful than CRF+E's improved precision. null Retrieval Set Size and Performance As with pre-fusion, we performed a set of experiments with different retrieval set sizes and used the CRF+E extraction system trained on 150 documents per person. The data in Figure 4 show that performance improves as the retrieval set size increases. Most of the gains come in the first 30 documents, where average performance increased from 14% (1 document) to 63% (30 documents). Increasing the retrieval set size to 150 documents per person yielded an additional 5% absolute improvement.</Paragraph>
    <Paragraph position="9">  person Post-fusion errors come from two major sources.</Paragraph>
    <Paragraph position="10"> The first source is the misranking of correct relationships. The second is the case where relevant information is not retrieved at all, which we measure as Post-Fusion Missing = # Missing Targets# People The data in Figure 5 suggest that the decrease in missing targets is a significant contributing factor to the improvement in performance with increased document size. Missing targets were a major problem for Birthplace, constituting more than half the errors (32% at 150 documents).</Paragraph>
  </Section>
  <Section position="5" start_page="486" end_page="487" type="metho">
    <SectionTitle>
4 Cross-Field Bootstrapping
</SectionTitle>
    <Paragraph position="0"> Sections 2 and 3 presented methods for training separate extractors for particular relationships and for doing fusion across multiple documents. In this section, we leverage data interdependencies to improve performance.</Paragraph>
    <Paragraph position="1"> The method we propose is to bootstrap across fields and use knowledge of one relationship to improve performance on the extraction of another. For  (f) indicates that the best fused result was taken. birth year(f) means birth years were annotated using the system that discovered the most accurate birth years.</Paragraph>
    <Paragraph position="2"> example, to extract birth year given knowledge of the birthday, in training we mark up each hook corpus Dx with the known birthday b : birthday(x,b) and the target birth year y : birthyear(x,y) and add an additional feature to the CRF that indicates whether the birthday has been seen in the sentence.16 In testing, for each hook, we first find the birthday using the methods presented in the previous sections, annotate the corpus with the extracted birthday, and then apply the birth year CRF (see Figure 6 next page).</Paragraph>
    <Paragraph position="3"> 16The CRF state model doesn't change. When bootstrapping from multiple fields, we add the conjunctions of the fields as features.</Paragraph>
    <Paragraph position="4"> Table 6 shows the effect of using this bootstrapped data to estimate other fields. Based on the relative performance of each of the individual extraction systems, we chose the following schedule for performing the bootstrapping: 1) Birthday, 2) Birth year, 3) Occupation, 4) Birthplace. We tried adding in all knowledge available to the system at each point in the schedule.17 There are gains in accuracy for birth year, occupation and birthplace by using cross-field bootstrapping. The performance of the plain CRF+E averaged across all five fields is 67.4%, while for the best bootstrapped system it is 74.6%, a gain of 7%.</Paragraph>
    <Paragraph position="5"> Doing bootstrapping in this way improves for people whose information is already partially correct. As a result, the percentage of people who have completely correct information improves to 37% from 13.8%, a gain of 24% over the non-bootstrapped CRF+E system. Additionally, erroneous extractions do not hurt accuracy on extraction of other fields. Performance in the bootstrapped system for birthyear, occupation and birth place when the birthday is wrong is almost the same as performance in the non-bootstrapped system.</Paragraph>
  </Section>
  <Section position="6" start_page="487" end_page="488" type="metho">
    <SectionTitle>
5 Training Set Size Reduction
</SectionTitle>
    <Paragraph position="0"> One of the results from Section 2 is that lower ranked documents are less likely to contain the relevant biographic information. While this does not have an dramatic effect on the post-fusion accuracy (which improves with more documents), it suggests that training on a smaller corpus, with more relevant documents and more sentences with the desired information, might lead to equivalent or improved performance. In a final set of experiments we looked at system performance when the extractor is trained on fewer than 150 documents per person.</Paragraph>
    <Paragraph position="1"> The data in Figure 7 show that training on 30 documents per person yields around the same performance as training on 150 documents per person. Average performance when the system was trained on 30 documents per person is 70%, while average performance when trained on 150 documents per per-son is 68%. Most of this loss in performance comes from losses in occupation, but the other relationships 17This system has the extra knowledge of which fused method is the best for each relationship. This was assessed by inspection.</Paragraph>
    <Paragraph position="2">  Frank Zappa was born on December 21.</Paragraph>
    <Paragraph position="3"> 1. BirthdayZappa : December 21, 1940.</Paragraph>
    <Paragraph position="4"> 2. Birthyear1. Birthday 2. Birthyear 3. Birthplace  December 21, is extracted and the text marked. In step 2, cooccurrences with the discovered birthday make 1940 a better candidate for birthyear. In step (3), the discovered birthyear appears in contexts where the discovered birthday does not and improves extraction of birth place.</Paragraph>
    <Paragraph position="5">  have either little or no gain from training on additional documents. There are two possible reasons why more training data may not help, and even may hurt performance.</Paragraph>
    <Paragraph position="6"> One possibility is that higher ranked retrieved documents are more likely to contain biographical facts, while in later documents it is more likely that automatically annotated training instances are in fact false positives. That is, higher ranked documents are cleaner training data. Pre-Fusion precision results (Figure 8) support this hypothesis since it appears that later instances are often contaminating earlier models.</Paragraph>
    <Paragraph position="7">  creased training documents.</Paragraph>
    <Paragraph position="8"> The data in Figure 9 suggest an alternate possibility that later documents also shift the prior toward a model where it is less likely that a relationship is observed as fewer targets are extracted.</Paragraph>
  </Section>
  <Section position="7" start_page="488" end_page="489" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> The closest related work to the task of biographic fact extraction was done by Cowie et al. (2000) and Schiffman et al. (2001), who explore the problem of biographic summarization.</Paragraph>
    <Paragraph position="1"> There has been rather limited published work in multi-document information extraction. The closest work to what we present here is Masterson and Kushmerick (2003), who perform multi-document information extraction trained on manually annotated training data and use Best Confidence to resolve each particular template slot. In summarizarion, many systems have examined the multi-document case. Notable systems are SUMMONS (Radev and McKeown, 1998) and RIPTIDE (White et al., 2001), which assume perfect extracted information and then perform closed domain summarization. Barzilay et al. (1999) does not explicitly extract facts, but instead picks out relevant repeated elements and combines them to obtain a summary which retains the semantics of the original.</Paragraph>
    <Paragraph position="2"> In recent question answering research, information fusion has been used to combine multiple candidate answers to form a consensus answer.</Paragraph>
    <Paragraph position="3"> Clarke et al. (2001) use frequency of n-gram occurrence to pick answers for particular questions. Another example of answer fusion comes in (Brill et al., 2001) which combines the output of multiple question answering systems in order to rank answers. Dalmas and Webber (2004) use a WordNet cover heuristic to choose an appropriate location from a large candidate set of answers.</Paragraph>
    <Paragraph position="4"> There has been a considerable amount of work in training information extraction systems from annotated data since the mid-90s. The initial work in the field used lexico-syntactic template patterns learned using a variety of different empirical approaches (Riloff and Schmelzenbach, 1998; Huffman, 1995;  Soderland et al., 1995). Seymore et al. (1999) use HMMs for information extraction and explore ways to improve the learning process.</Paragraph>
    <Paragraph position="5"> Nahm and Mooney (2002) suggest a method to learn word-to-word relationships across fields by doing data mining on information extraction results. Prager et al. (2004) uses knowledge of birth year to weed out candidate years of death that are impossible. Using the CRF extractors in our data set, this heuristic did not yield any improvement. More distantly related work for multi-field extraction suggests methods for combining information in graphical models across multiple extraction instances (Sutton et al., 2004; Bunescu and Mooney, 2004) .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML