File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1082_intro.xml
Size: 19,113 bytes
Last Modified: 2025-10-06 14:02:55
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1082"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 652-659, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics A Methodology for Extrinsically Evaluating Information Extraction Performance</Title> <Section position="3" start_page="2" end_page="656" type="intro"> <SectionTitle> 2 Methodology </SectionTitle> <Paragraph position="0"> Figure 1 gives an overview of the methodology. The left portion of the figure shows source documents provided both to a system and a human to produce two extraction databases, one corresponding to SERIF's automated performance and one corresponding to doubleannotated, human accuracy. By merging portions of those two sources in varying degrees (&quot;blends&quot;), one can derive several extracted databases ranging from machine quality, through varying percentages of improved performance, up to human accuracy. This method of blending databases provides a means of answering hypothetical questions, i.e., what if the state-of-the-art were x% closer to human accuracy, with a single set of answer keys.</Paragraph> <Paragraph position="1"> A person using a given extraction database performs a task, in our case, QA. The measures of effectiveness in our study were time to complete the task and percent of questions answered correctly. An extrinsic measure of the value of improved IE technology performance is realized by rotating users through different extraction databases and questions sets.</Paragraph> <Paragraph position="2"> In our preliminary study, databases of fully automated IE and manual annotation (the gold standard) were populated with entities, relationships, and co-reference links from 946 documents. The two initial databases representing machine extraction and human extraction respectively were then blended to produce a continuum of database qualities from machine to human performance. ACE Value Scores were measured for each database. Pilot studies were conducted to develop questions for a QA task.</Paragraph> <Paragraph position="3"> Each participant answered four sets of questions, each with a different extraction database representing a different level of IE accuracy. An answer capture tool recorded the time to answer each question and additional data to confirm that the participant followed the study protocol. The answers were then evaluated for accuracy and the relationship between QA performance and IE quality was established.</Paragraph> <Paragraph position="4"> Each experiment used four databases. The first experiment used databases spanning the range from solely machine extraction to solely human extraction. Based on the results of this experiment, two further experiments focused on smaller ranges in database quality to study the relationship between IE and QA performance.</Paragraph> <Section position="1" start_page="652" end_page="653" type="sub_section"> <SectionTitle> 2.1 Source Document Selection, Annota- </SectionTitle> <Paragraph position="0"> tion, and Extraction Source documents were selected based on the availability of manual annotation. We identified 946 broadcast news and newswire articles from recent ACE efforts, all annotated by the LDC according to the ACE guidelines for the relevant year (2002, 2003, 2004). Entities, relations, and within-document co-reference were marked.</Paragraph> <Paragraph position="1"> Inter-document co-reference annotation was added by BBN. The 946 news articles comprised 363 articles (187,720 words) from news-wire and 583 (122,216 words) from broadcast news. With some corrections to deal with errors and changes in guidelines, the annotations were loaded as the human (DB-quality 100) database.</Paragraph> <Paragraph position="2"> The 2004 ACE evaluation plan, available at http://www.nist.gov/speech/tests/ace/ace04/doc/ace04-evalplanv7.pdf, contains a full description of the scoring metric used in the evaluation. Entity type weights were 1 and the level weights were NAM=1.0, NOM=0.5, and PRO=0.1.</Paragraph> <Paragraph position="3"> SERIF, BBN's automatic IE system based on its predecessor, SIFT (Miller, 2000), was run on the 946 ACE documents to create the machine (DBquality 0) database. SERIF is a statistically trained software system that automatically performs entity, co-reference, and relationship information extraction.</Paragraph> <Paragraph position="4"> Intermediate IE performance was simulated by blending the human and automatically generated databases in various degrees using an interpolation algorithm developed specifically for this study. To create a blended database, DB-quality n, all of the entities, relationships, and co-reference links common to the human and automatically generated databases are copied into a new one. Then, n% of the entity mentions in the human database (100), but not in the automatic IE system output (0), are copied; and, (100 - n)% of the entity mentions in the automatically generated database, but not in the human database, are copied. Next, the relationships for which both of the constituent entity mentions have been copied are also copied to the blended database. Finally, co-reference links and entities for the already copied entity mentions are copied into the blended database.</Paragraph> <Paragraph position="5"> For the first experiment, two intermediate extraction databases were created: DB-qualities 33 and 67. For the second experiment, two additional databases were created: 16.5 and 50. The first intermediate databases were both created using the 0 and 100 databases as seeds. The 16.5 database was created by mixing the 0 and the 33 databases in a 50% blend. The 50 database was created by doing the same with the 33 and 67 databases. For Experiment 3, 41 and 58 data-bases were created by mixing the 33 and 50, and To validate the interpolation algorithm and blending procedure, we applied NIST's 2004 ACE Scorer to the eight extraction databases.</Paragraph> <Paragraph position="6"> Polynomial approximations were fitted against both the entity and relation extraction curves.</Paragraph> <Paragraph position="7"> Entity performance was found to vary linearly with DB blend (R = .9853) and relation performance was found to vary with the square of DB blend (R = .9961). Table 1 shows the scores for each blend, and Table 2 shows the counts of entities, relationships, and descriptions.</Paragraph> </Section> <Section position="2" start_page="653" end_page="653" type="sub_section"> <SectionTitle> 2.2 Question Answering Task </SectionTitle> <Paragraph position="0"> Extraction effectiveness was measured by how well a person could answer questions given a database of facts, entities, and documents. Participants answered four sets of questions using four databases. They accessed the database using BBN's FactBrowser (Miller, 2001) and recorded their answers and source citations in a separate tool developed for this study, AnswerPad.</Paragraph> <Paragraph position="1"> Each database represented a different data-base quality. In some databases, facts were missing, or incorrect facts were recorded.</Paragraph> <Paragraph position="2"> Consequently, answers were more accessible in some databases than in others, and participants had to vary their question answering strategy depending on the database.</Paragraph> <Paragraph position="3"> Participants were given five minutes to answer each question. To ensure that they had actually located the answer rather than relied on world knowledge, they were required to provide source citations for every answer. The instructions emphasized that the investigation was a test of the system, and not of their world knowledge or web search skills. Compliance with these instructions was high. Users resorted to knowledge-based proper noun searches only one percent of the time. In addition, keyword search was disabled to force participants to rely on the database features.</Paragraph> </Section> <Section position="3" start_page="653" end_page="654" type="sub_section"> <SectionTitle> 2.3 Participants </SectionTitle> <Paragraph position="0"> Study participants were recruited through local web lists and at local colleges and universities.</Paragraph> <Paragraph position="1"> Participants were restricted to college students and recent graduates with PC (not Mac) experience, without reading disabilities, for whom English was their native language. No other screening was necessary because the design called for each participant to serve as his or her own control, and because opportunities to use world knowledge in answering the questions were minimized through the interface and procedures. null During the first two months of the study 23 participants were used to help develop questions, participant criteria, and the overall test procedure. Then, experiments were conducted comparing the 0, 33, 67, and 100 database blends (Experiment 1, 20 subjects); the 0, 16.5, 33, and 50 database blends (Experiment 2, 20 subjects), and the 33, 41, 50, and 58 database blends (Experiment 3, 24 subjects).</Paragraph> </Section> <Section position="4" start_page="654" end_page="655" type="sub_section"> <SectionTitle> 2.4 Question Selection and Validation </SectionTitle> <Paragraph position="0"> Questions were developed over two months of pilot studies. The goal was to find a set of questions that would be differentially supported by the 0, 33, 67, and 100 databases. We explored both &quot;random&quot; and &quot;engineered&quot; approaches. The random approach called for creating questions using only the documents, without reference to the kind of information extracted. Using a list of keywords, one person generated 86 questions involving relationships and entities pertaining to politics and the military by scanning the 946 ACE documents to find references to each keyword and devising questions based on the information she found.</Paragraph> <Paragraph position="1"> The alternative, engineered approach involved eliminating questions that were not supported by the types of information extracted by SERIF, and generating additional questions to fit the desired pattern of increasing support with increased human annotation. This approach ensured that the question sets reflected the structural differences that are assumed to exist in the database, and produced psychophysical data that link degree of QA support to human performance parameters. The IE results from four of the databases (0, 33, 67 and 100) were used to develop questions that received differential support from the different quality databases. For example, such a question could be answered using the automatically extracted results, but might be more straightforwardly answered given human annotation.</Paragraph> <Paragraph position="2"> Sixty-four questions, plus an additional ten practice questions, were created using the engineering approach. Additional criteria that were followed in creating the question sets were: 1) Questions had to contain at least one reasonable entry hook into all four databases, e.g., the terms U.S. and America were considered too broad to be reasonable; and, 2) For ease of scoring, listtype questions had to specify the number of answers required. Alternative criteria were considered but rejected because they correlated with the aforementioned set. The following are examples of engineered questions.</Paragraph> <Paragraph position="3"> municipal elections between two Shiite groups in the year 1998? Two question lists, one with 86 questions generated by the random procedure and one with 64 questions generated by the engineered procedure, were analyzed with respect to the degree of support afforded by each of the four databases as viewed through FactBrowser. Four a priori criteria were established to assess degree of support - or its opposite, the degree of expected difficulty - for each question in each of the four databases. Ranked from easiest to hardest, they are listed in Table 3.</Paragraph> <Paragraph position="4"> The question can be answered...</Paragraph> <Paragraph position="5"> 1. Directly with fact or description (answer is highlighted in FactBrowser citation) 2. Indirectly with fact or description (answer is not highlighted) 3. With name mentioned in question (long list of mentions without context) 4. Via database crawling istics, listed from easiest to hardest Table 4 shows the question difficulty levels for both question types, for each of four databases. Analysis of the engineered set was done on all 64 questions. Analysis for randomly generated questions was done on a random sample of 44 of the 86 questions. Fifteen questions did not meet the question criteria, leaving 29.</Paragraph> <Paragraph position="6"> The randomly generated questions showed a statistically significant, but small, variation in expected difficulty, in part due to the number of unanswerable questions. While the questions were made up with respect to information found in the documents, the process did not consider the types of extracted entities and relations. This problem might have been mitigated by limiting the search to questions involving entities and relations that were part of the extraction task. By contrast, the engineered question set showed a highly significant decrease in expected difficulty as the percentage of human annotation in the database increased (P < 0.0001 for chi-square analysis). This result is not surprising, given that the questions were constructed with reference to the list of entities in the four data- null bases. The analysis confirms that the experimental manipulation of different degrees of support provided by the four databases was achieved for this question set.</Paragraph> </Section> <Section position="5" start_page="655" end_page="655" type="sub_section"> <SectionTitle> Function of Database Quality </SectionTitle> <Paragraph position="0"> Preliminary human testing with both question sets suggested that the a priori difficulty indicators predict human question answering performance. Experiments with the randomly generated questions, therefore, were unlikely to reveal much about the databases or about human question answering performance. On the other hand, an examination of how different levels of data-base quality affect human performance, in a psychophysical experiment where structure is varied systematically, promised to address the question of how much support is needed for good performance. null Based on the question difficulties, and pilot study timing and performance results, the 64 questions were grouped into four, 16-question balanced sets.</Paragraph> </Section> <Section position="6" start_page="655" end_page="655" type="sub_section"> <SectionTitle> 2.5 Procedure </SectionTitle> <Paragraph position="0"> Participants were tested individually at our site, in sessions lasting roughly four hours. Training prior to the test lasted for approximately a half hour. Training consisted of a walk-through of the interface features followed by guided practice with sample questions. The test consisted of four question sets, each with a different database. Participants were informed that they would be using a different database for each question set and that some might be easier to use than others.</Paragraph> <Paragraph position="1"> Questions were automatically presented and responses were captured in AnswerPad, a software tool designed for the study. AnswerPad is shown in Figure 2.</Paragraph> <Paragraph position="2"> Key features of the tool include: * Limiting view to current question set disallowing participants to view previous</Paragraph> </Section> <Section position="7" start_page="655" end_page="656" type="sub_section"> <SectionTitle> Answer Capture Interface </SectionTitle> <Paragraph position="0"> Participants were given written documentation as part of their training. The participants were instructed to cut-and-paste question answers and document citations from source documents into AnswerPad.</Paragraph> <Paragraph position="1"> Extracted facts and entities, and source documents were accessed through FactBrowser.</Paragraph> <Paragraph position="2"> FactBrowser, shown in Figure 3, is web-browser based and is invoked via a button in AnswerPad.</Paragraph> <Paragraph position="3"> FactBrowser allows one to enter a string, which is matched against the database of entity mentions. The list of entities that have at least one mention partially matching the string are returned (e.g., &quot;Laura Bush&quot;) along with an icon indicating the type of the entity and the number of documents in which the entity appears.</Paragraph> <Paragraph position="4"> Clicking on the entity in the left panel causes the top right panel to display all of the descriptions, facts, and mentions for the entity. Selecting one of these displays citations in which the description, fact, or mention occurs. Clicking on the citation opens up a document view in the lower right corner of the screen and highlights the extracted information in the text. When a document is displayed, all of the entities detected in the document are listed down the left side of the The browsing tool was instrumented to record command invocations so that the path a participant took to answer a question could be recreated, and the participant's adherence to protocol could be verified. Furthermore, the find function (Ctrl-F) was disabled to prevent users from performing ad hoc searches of the documents instead of using the extracted data.</Paragraph> <Paragraph position="5"> The order of question sets and the order of database conditions were counterbalanced across participants, so that, for every four participants, every question set and database appeared once in every ordinal position, and every question set was paired once with every database. This avoided carryover effects from question order.</Paragraph> </Section> <Section position="8" start_page="656" end_page="656" type="sub_section"> <SectionTitle> 2.6 Data Collected </SectionTitle> <Paragraph position="0"> Based on the initial results from Experiment 1, a 70% target effectiveness threshold was identified to occur between the 33 and 67 database blends. To refine and verify this finding, Experiment 2 examined the 0, 16.5, 33, and 50 database blends. Experiment 3 examined the 33, 41, 50, and 58 database blends.</Paragraph> <Paragraph position="1"> AnswerPad collected participant-provided answers to questions and the corresponding citations. In addition, AnswerPad recorded the time spent answering the questions. A limit of five minutes was imposed based on pilot study results. The browsing tool logged commands invoked while the user searched the fact-base for question answers. Questions were manually scored based on the answers in the provided corpus. No partial credit was given. The maximum score, for each database condition, was 16, for a total maximum score of 64.</Paragraph> </Section> </Section> class="xml-element"></Paper>