File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/m98-1003_abstr.xml
Size: 21,692 bytes
Last Modified: 2025-10-06 13:49:15
<?xml version="1.0" standalone="yes"?> <Paper uid="M98-1003"> <Title>Analyzing the Complexity of a Domain With Respect To An Information Extraction Task</Title> <Section position="1" start_page="0" end_page="7" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In this paper we describe a method of classifying facts #28information#29 into categories or levels; where each level signi#0Ces a di#0Berent degree of syntactic complexity related to a fact. Based on this classi#0Ccation mechanism, we also propose a method of evaluating a domain by assigning to it a #5Cdomain number&quot; based on the levels of a set of standard facts present in the articles of that domain.</Paragraph> <Paragraph position="1"> Introduction The Message Understanding Conferences #28MUCs#29 have been held with the goal of qualitatively evaluating message understanding systems. While the MUCs held thus far have been quite successful at providing such an evaluation, very little work has been done in analyzing the di#0Eculty of understanding a text in a particular domain; both, independently, as well as in comparison to understanding a text in some other domain.</Paragraph> <Paragraph position="2"> The organizers of MUC-5 attempted to compare the di#0Eculty of the EJV #28English JointVentures#29 task in MUC-5 to the terrorist task of MUC-3 and MUC-4. The criteria used for comparing these two tasks included the vocabulary size, the average sentence length, the average number of sentences per text, the number of texts, etc. #5BSundheim 1993#5D. The organizers of MUC-6 did not attempt to compare the di#0Eculty of the MUC-6 task to the previous MUC tasks saying that #5Cthe problem of coming up with a reasonable, objectiveway of measuring relative task di#0Eculty has not been adequately addressed&quot; #5BSundheim 1995#5D. In this paper we describe a method of classifying facts #28information#29 into categories or levels; where each level signi#0Ces a di#0Berent degree of syntactic complexity related to a fact. Based upon this classi#0Ccation mechanism, we also propose a method of evaluating a domain by assigning to it a #5Cdomain number&quot; based on the levels of a set of standard facts present in the articles of that domain. In addition, using the proposed classi#0Ccation mechanism, we analyze the complexity of the MUC-7 information extraction task and compare it to the complexities of the information extraction tasks of MUC-4, MUC-5, and MUC-6.</Paragraph> <Section position="1" start_page="0" end_page="2" type="sub_section"> <SectionTitle> De#0Cnitions </SectionTitle> <Paragraph position="0"> Network: A network consists of a collection of nodes interconnected by an accompanying set of arcs. Each node denotes an object and each arc represents a binary relation between the objects. #5BHendrix 1979#5D APartial Network: A partial network is a collection of nodes interconnected by an accompanying set of arcs where the collection of nodes is a subset of a collection of nodes forming a network, and the accompanying set of arcs is a subset of the set of arcs accompanying the set of nodes which form the network. Figure 1 shows a sample network for the following piece of text: #5CThe Extraditables,&quot; or the Armed Branch of the Medellin Cartel have claimed responsibility for the murder of two employees of Bogota's daily El Espectador on Nov 15. The murders took place in Medellin.</Paragraph> <Paragraph position="1"> The Level of A Fact The level of a fact, F, in a piece of text is de#0Cned by the following algorithm: 1. Build a network, S, for the piece of text.</Paragraph> <Paragraph position="2"> 2. Suppose the fact, F, consists of several nodes fx We de#0Cne the level of the fact, F, with respect to the network, S, to be equal to k, the number of arcs linking the nodes which comprise the fact F.</Paragraph> <Paragraph position="3"> Observations Given the de#0Cnition of the level of a fact, the following observations can be made: #0F Thelevelofafact isrelatedtothe conceptof#5Csemanticvicinity&quot;de#0Cned bySchubertet al. #5BSchubert 1979#5D. The semantic vicinity ofanodeinanetwork consists of the nodes and the arcs reachable from that node by traversing a small number of arcs. The fundamental assumption used here is that #5Cthe knowledge required to perform an intellectual task generally lies in the semantic vicinity of the concepts involved in the task&quot; #5BSchubert 1979#5D.</Paragraph> <Paragraph position="4"> The level of a fact is equal to the number of arcs that one needs to traverse to reach all the concepts #28nodes#29 which comprise the fact of interest.</Paragraph> <Paragraph position="5"> piece of text. Therefore, the level of a fact with respect to a network built at the word level #28i.e. words represent objects and the relationships between the objects#29 will be greater than the level of a fact with respect to a network built at the phrase level #28i.e. noun groups represent objects while verb groups and preposition groups represent the relationships between the objects#29.</Paragraph> </Section> <Section position="2" start_page="2" end_page="4" type="sub_section"> <SectionTitle> Examples </SectionTitle> <Paragraph position="0"> Let S be the network shown in Figure 1. S has been built at the phrase level.</Paragraph> <Paragraph position="1"> #0F The city mentioned, in S, is an example of a level-0 fact because the #5Ccity&quot; fact consists only of one node #5CMedellin.&quot; #0F The type of attack, in S, is an example of a level-1 fact.</Paragraph> <Paragraph position="2"> We de#0Cne the type of attack in the network to be an attack designator suchas#5Cmurder,&quot; #5Cbombing,&quot; or #5Cassassination&quot; with one modi#0Cer giving the victim, perpetrator, date, location, or other information. In this case the type of attack fact is composed of the #5Cthe murder&quot; and the #5Ctwo employees&quot; nodes and their connector. This makes the type of attack a level-1 fact.</Paragraph> <Paragraph position="3"> The type of attack could appear as a level-0 fact as in #5Cthe Medellin bombing&quot; #28assuming that the network is built at the phrase level#29 because in this case both the attack designator #28bombing#29 and the modi#0Cer #28Medellin#29 occur in the same node. The type of attack fact occurs as a level-2 fact in the following sentence #28once again assuming that the network is built at the phrase level#29: #5C10 people were killed in the o#0Bensive which included several bombings.&quot; In this case there is no direct connector between the attack designator #28several bombings#29 and its modi#0Cer #2810 people#29. They are connected by the intermediatory #5Cthe o#0Bensive&quot; node; thereby making the type of attack a level-2 fact. The type of attack can also appear at higher levels.</Paragraph> <Paragraph position="4"> #0F In S, the date of the murder of the two employees is an example of a level-2 fact.</Paragraph> <Paragraph position="5"> This is because the attack designator #28the murder#29 along with its modi#0Cer #28two employees#29 account for one level and the arc to #5CNov 15&quot; accounts for the second level.</Paragraph> <Paragraph position="6"> The date of the attack, in this case, is not a level-1 fact #28because of the two nodes #5Cthe murder&quot; and #5CNov 15&quot;#29 because the phrase #5Cthe murder on Nov 15&quot; does not tell one that an attack actually took place. The article could have been talking about a seminar on murders that took place on Nov15and not about the murder of two employees which took place then.</Paragraph> <Paragraph position="7"> #0F In S, the location of the murder of the two employees is an example of a level-2 fact.</Paragraph> <Paragraph position="8"> The exact same argument as the date of the murder of the two employees applies here.</Paragraph> <Paragraph position="9"> #0F The complete information, in S, about the victims is an example of a level-2 fact because to know that two employees of Bogota's Daily El Espectador were victims, one has to know that they were murdered.</Paragraph> <Paragraph position="10"> The attack designator #28the murder#29 with its modi#0Cer #28two employees#29 accounts for one level, while the connector between #5Ctwo employees&quot; and #5CBogota's Daily El Espectador&quot; accounts for the other. #0F Similarly, the complete information, in S, about the perpetrators of the murder of the two employees is an example of a level-5 fact. The breakup of the 5 levels is as follows: the fact that two employees were murdered accounts for one level; the fact that #5CThe Extraditables&quot; have claimed responsibility for the murders accounts for two additional levels; and the fact that the Extraditables are the #5Carmed branch of the Medellin Cartel&quot; account for the remaining two levels.</Paragraph> <Paragraph position="11"> Justi#0Ccation of the Methodology The level of a fact quanti#0Ces the #5Cspread&quot; in the information that makes up the fact. Therefore, the higher the level of a fact, the greater is the #5Cspread&quot; in the information that makes up the fact. This means that more processing has to be done to identify and link all the individual pieces of information that makeup the fact. In fact, an exploratory study done by Beth Sundheim during MUC-3 showed #5Ca degradation in correctness of message processing as the information distribution in the message became more complex, that is, as slot #0Clls were drawn from larger portions of the message&quot; #5BHirschman 1992#5D.</Paragraph> <Paragraph position="12"> An argument can be made that there are other factors, apart from the spread of information, which in#0Duence the di#0Eculty of extracting a fact from text. Some of these factors include the amount of training done on an information extraction system, the quality of training, and the frequency of occurrence of the patterns that a system has been trained on. While these factors do in#0Duence the performance of an information extraction system and they do give some indication as to how di#0Ecult it was for a particular system to extract the fact, they do not give a system independentway of determining the complexity of extracting the fact.</Paragraph> <Paragraph position="13"> In #5BHirschman 1992#5D, Lynette Hirschman proposed the following hypothesis: there are facts that are simply harder to extract, across all systems. Based on our de#0Cnition of the level of a fact, we analyzed the performances of several information extraction systems on the MUC-4 terrorist reports domain. Our analysis shows that all the systems consistently did muchworse on higher level facts. In addition to con#0Crming Hirschman's hypothesis, the analysis also shows that higher level facts are indeed harder to extract. #5BBagga 1998#5D gives the complete details about the analysis.</Paragraph> <Paragraph position="14"> Building the Networks As mentioned earlier, the level of a fact for a piece of text depends on the network constructed for the text. Since there is no unique network corresponding to a piece of text, care has to be taken so that the networks are built consistently.</Paragraph> <Paragraph position="15"> For the set of experiments described in the rest of the paper we used the following algorithm to build the networks: 1. Every article was broken up into a non-overlapping sequence of noun groups #28NGs#29, verb groups #28VGs#29, and preposition groups #28PGs#29. The rules employed to identify the NGs, VGs, and PGs were almost the same as the ones employed by SRI's FASTUS system .</Paragraph> <Paragraph position="16"> 2. The nodes of the network consisted of the NGs while the transitions between the nodes consisted of the VGs and the PGs.</Paragraph> <Paragraph position="17"> 3. Identi#0Ccation of coreferent nodes and prepositional phrase attachments were done manually.</Paragraph> <Paragraph position="18"> Obviously, if one were to employ a di#0Berent algorithm for building the networks, one would get di#0Berent numbers for the level of a fact. But, if the algorithm were employed consistently across all the facts of interest and across all articles in a domain, the numbers on the level of a fact would be consistently di#0Berent and one would still be able to analyze the relative complexity of extracting that fact from a piece of text in the domain.</Paragraph> <Paragraph position="19"> We wish to thank Jerry Hobbs of SRI for providing us with the rules of their partial parser.</Paragraph> <Paragraph position="20"> Analysis of the MUC-7 Domain Based on our de#0Cnition of the level of a fact, we analyzed the MUC-7 domain which consisted of reports on air vehicle launches. We selected a set of standard facts from the o#0Ecial MUC-7 template that we felt captured most of the information in the template. This set consists of: Using the algorithm described in the previous section, we built the networks for 32 of the 64 relevant articles in the 100 article MUC-7 test set. From the network for each article, we calculated the level of each of the standard facts mentioned in the article. The standard facts appeared 619 times in the 32 articles analyzed. Figure 2 shows the level distribution of all the standard facts. In addition, Figures 3, 4, 5, and 6 show the level distributions of each of the standard facts.</Paragraph> <Paragraph position="21"> Figures 4 and 5 show that the curves for the Vehicle and the Vehicle Type facts, and the Payload and Payload Type facts are nearly the same. The main reason being that the phrases #28in the text#29 which describe the vehicles and the payloads, in most cases, also mention their types. For example, the phrase #5CLong March 3B Rocket&quot; describes the air launchvehicle and also mentions its type.</Paragraph> <Paragraph position="22"> The Vehicle Owner and Manufacturer facts, and the Payload Owner and Manufacturer facts all occur at higher levels than the Vehicle and the Payload facts indicating that they are harder to extract. The Payload recipient fact appears to be the most complex fact occurring most frequently at level-9 #28Figure 6#29. Evaluating the Di#0Eculty of the MUC-7 Domain We extended our analysis to analyze the di#0Eculty of understanding a text in the MUC-7 domain. Obviously, the di#0Eculty of understanding a text in a domain depends directly on the expected level of a fact in that domain. We de#0Cne this expected level of a fact in a domain to be the domain number of the domain. The domain number is measured in level units #28LUs#29. Two domains can therefore be compared on the basis of their domain numbers.</Paragraph> <Paragraph position="23"> The formula used to calculate the domain number is:</Paragraph> <Paragraph position="25"> is the number of times one of the standard facts appeared at level-l in the articles of the domain.</Paragraph> <Paragraph position="26"> Based on the levels of the standard facts in the MUC-7 test set, we calculated the domain number of the air vehicle launch domain to be 2.44 LUs. The fact that the domain number of this domain is greater than 2 is to be expected given the fact that most curves in Figures 3, 4, 5, and 6 peak at level-2 or higher.</Paragraph> </Section> <Section position="3" start_page="4" end_page="7" type="sub_section"> <SectionTitle> Analysis of MUC-4 </SectionTitle> <Paragraph position="0"> The MUC-4 domain consisted of articles reporting terrorist activities in Latin America. Based on the o#0Ecial MUC-4 template, we selected a set of standard facts that we felt captured most of the information in the template. They are: #28The full de#0Cnition of each fact is not included here.#29 We then built the networks #28using the algorithm described earlier#29 for the relevant articles from the MUC-4 TST3 set of 100 articles. From the network for each article, we calculated the levels of each of the #0Cve standard facts. The level distribution of the #0Cve facts for the MUC-4 TST3 set is shown in Figure 7. The level distribution of the #0Cve facts combined is shown in Figure 8.</Paragraph> <Paragraph position="1"> Based on the data collected above, we made the following observations: #0F There were 69 relevant articles in the MUC-4 TST3 set of 100 articles, each reporting one or more terrorist attacks.</Paragraph> <Paragraph position="2"> #0F The #0Cve facts of interest appeared 570 times in the 69 articles.</Paragraph> <Paragraph position="3"> #0F Anumber of articles reported the same fact at two di#0Berent places and at two di#0Berent levels in the same article. The #0Crst, usually, in the #0Crst paragraph of the text which reported the attack without giving too many details, and, the second, later in the article when the attackwas reported with all the details.</Paragraph> <Paragraph position="4"> As one would expect, the level of the #0Crst occurrence of a fact in an article is usually less than or equal to the level of the second occurrence of that fact in the same article.</Paragraph> <Paragraph position="5"> #0F From Figure 8, we can see that almost 50#25 of the #0Cve facts were at level-1. This is not surprising because four out of the #0Cve standard facts most frequently occur as level-1 facts #28Figure 7#29. Based on the levels of the #0Cve standard facts in the MUC-4 TST3 set of articles, we calculated the domain number of the terrorist domain to be 1.87 LUs. We are assuming the fact that the set of 100 randomly chosen articles in the MUC-4 TST3 set are representative of the domain. This assumption may not necessarily hold, but, given the large number of articles we analyzed, we hope that the domain number calculated is close to the actual domain number of the terrorist domain.</Paragraph> </Section> <Section position="4" start_page="7" end_page="7" type="sub_section"> <SectionTitle> Analysis of MUC-5 </SectionTitle> <Paragraph position="0"> Because two di#0Berent domains were used in MUC-5 #28each in two di#0Berent languages#29, we decided to focus only on the English JointVentures #28EJV#29 domain. Once again, the set of standard facts were selected from the o#0Ecial MUC-5 template and were chosen such that they contained most of the information in the template. They are: #28The full de#0Cnition of each fact is not included here.#29 Due to the unavailability of the o#0Ecial test set used for the MUC-5 EJV evaluation, we used a set of 50 articles used by the systems for training on the EJV domain. Using the algorithm described earlier, we then built the networks for the relevant articles. Out of the 50 articles, 47 were relevant and the #0Cve standard facts appeared 209 times in these articles. The level distribution of each of the #0Cve facts is shown in Figure 9. The level distribution of the #0Cve facts combined is shown in Figure 10. Based on Figure 9 one can deduce that the MUC-5 EJV domain is harder than the MUC-4 terrorist domain because three out of the #0Cve standard facts most frequently occur as level-2 facts. Figure 10 peaks at level-2 giving further indication that the domain number for this domain is more than 2 LUs.</Paragraph> <Paragraph position="1"> Based on the levels of the standard set of facts, we calculated the domain numberoftheMUC-5 EJV domain to be 2.67 LUs. This domain number is almost 1 LU higher than the domain number for the MUC-4 terrorist attack domain and it indicates that the MUC-5 EJV task was much harder than the MUC-4 task. In comparison, an analysis done by Beth Sundheim, using the features described earlier, shows that the nature of the MUC-5 EJV task is approximately twice as hard as the nature of the MUC-4 task #5BSundheim 1993#5D.</Paragraph> <Paragraph position="2"> Analysis of MUC-6 The domain used for MUC-6 consisted of articles regarding changes in corporate executive management personnel. As in the case of our analyses of the previous two MUCs, we selected a set of standard facts based on the o#0Ecial MUC-6 template. This set consisted of the following facts: #28The full de#0Cnition of each fact is We analyzed the levels of the standard set of facts in the o#0Ecial MUC-6 test set by building the networks for the relevant articles in the test set #28using the algorithm described earlier#29. This test set consisted of 100 articles, 56 of whichwere relevant. The six standard facts appeared 478 times in the relevant articles. The level distribution of each of these six facts is shown in Figure 11. The level distribution of these six facts combined is shown in Figure 12.</Paragraph> <Paragraph position="3"> We calculated the domain number for the MUC-6 domain to be 2.47 LUs. Our analysis therefore indicates that the MUC-6 domain is almost as hard as the MUC-5 EJV domain.</Paragraph> <Paragraph position="4"> Comparing the MUC Information Extraction Tasks Figure 13 shows the domain numbers for the MUC tasks that have been analyzed. For each of the MUCs, the #0Cgure also shows the highest P&R F-Measure achieved by a system #28for the information extraction task#29. Our analysis clearly separates the MUC-4 task from the later ones. The tasks for the later MUCs, however, have surprisingly similar complexity pro#0Cles: 20 to 30#25 percent level-1 facts, substantially higher level-2 facts, and decreasing values for higher level facts. Given that the analysis has been only done for 4 tasks, we do not want to infer too much from the di#0Berence between the complexities of the MUC-5 task and the MUC-6 and the MUC-7 tasks #28whichhave roughly the same complexities#29.</Paragraph> </Section> </Section> class="xml-element"></Paper>