File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1022_metho.xml

Size: 23,537 bytes

Last Modified: 2025-10-06 14:14:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="M98-1022">
  <Title>DESCRIPTION OF THE UPENN CAMP SYSTEM AS USED FOR COREFERENCE</Title>
  <Section position="1" start_page="0" end_page="3" type="metho">
    <SectionTitle>
DESCRIPTION OF THE UPENN CAMP SYSTEM AS USED
FOR COREFERENCE
</SectionTitle>
    <Paragraph position="0"> In this paper we present some advances made to the CAMP system since it's inception for MUC-6. Although the infrastructure has been completely re-implemented, the architecture has remained fundamentally the same#7Bconsequently we will focus some advances wehave made in our understanding of coreference and then discuss the performance of the system.</Paragraph>
    <Section position="1" start_page="0" end_page="3" type="sub_section">
      <SectionTitle>
Scoring Coreference Output
</SectionTitle>
      <Paragraph position="0"> Scoring the performance of a system is an extremely important aspect of coreference algorithm performance.</Paragraph>
      <Paragraph position="1"> The score for a particular run is the single strongest measure of how well the system is performing and it can strongly determine directions for further improvements. In this paper, we present several di#0Berent scoring algorithms and detail their respective strengths and weaknesses for varying classes of processing. In particular, we describe and analyze the coreference scoringalgorithm used to evaluate the coreference systems in the sixth Message Understanding Conference #28MUC-6#29#5BMUC-6, 95#5D. We also present two shortcomings of this algorithm. In addition, we present a new coreference scoring algorithm, our B-CUBED algorithm, whichwas designed to overcome the shortcomings of the MUC-6 algorithm.</Paragraph>
      <Paragraph position="2"> Scoring in MUC-6#2F7: Vilain et al.</Paragraph>
      <Paragraph position="3"> Prior to Vilain et al.'s coreference scoring algorithm #5BVilain, 95#5D there had been a graph based scoring algorithm #28Sundheim et al.#29 which produced unintuitive results for even very simple cases. #5BVilain, 95#5D substituted a model-theoretic scoring algorithm which produced very intuitive results for the type of scoring desired in MUC-6. This algorithm computes computes the recall error by taking each equivalence class S #28de#0Cned by the links in the answer key#29 and determining the number of coreference links m that would have to be added to the response to place all the entities in S into the same equivalence class in the response.</Paragraph>
      <Paragraph position="4"> Recall error then is the sum of m's divided by the number of links in the key. Precision error is computed by reversing the roles of the answer key and the response.</Paragraph>
      <Paragraph position="5"> The full details of the algorithm are discussed next.</Paragraph>
      <Paragraph position="6"> The Model Theoretic Approach To The Vilain et. al Algorithm  In the description of the model theoretic algorithm, the terms #5Ckey,&amp;quot; and #5Cresponse&amp;quot; are de#0Cned in the following way: key refers to the manually annotated coreference chains #28the truth#29.</Paragraph>
      <Paragraph position="7"> response refers to the coreference chains output by a system.</Paragraph>
      <Paragraph position="8"> An equivalence set is the transitive closure of a coreference chain. The algorithm computes recall in the following way.</Paragraph>
      <Paragraph position="9"> First, let S be an equivalence set generated by the key, and let R</Paragraph>
      <Paragraph position="11"> be equivalence classes generated by the response. Then we de#0Cne the following functions over S: #0F p#28S#29 is a partition of S relative to the response. Each subset of S in the partition is formed by intersecting S and those response sets R i that overlap S. Note that the equivalence classes de#0Cned by the response may include implicit singleton sets - these correspond to elements that are mentioned in the key but not in the response. For example, say the key generates the equivalence class S = fABC Dg, and the response is simply #3CA-B#3E. The relative partition p#28S#29 is then fABgfCgand fDg. #0F c#28S#29 is the minimal number of #5Ccorrect&amp;quot; links necessary to generate the equivalence class S. It is clear that c#28S#29 is one less than the cardinality of S, i.e.,</Paragraph>
      <Paragraph position="13"> is the number of links necessary to fully reunite any components of the p#28S#29 partition. We note that this is simply one fewer than the number of elements in the partition, that is,</Paragraph>
      <Paragraph position="15"> Looking in isolation at a single equivalence class in the key, the recall error for that class is just the number of missing links divided by the number of correct links, i.e.,  Finally, extending this measure from a single key equivalence class to an entire set T simply requires summing over the key equivalence classes. That is,</Paragraph>
      <Paragraph position="17"> For example, let the key contain 3 equivalence classes as shown in Figure 1. Suppose Figure 2 shows a response. From Figure 3#28I#29, the three equivalence classes in the truth, S  #29, with respect to the response, shown in Figure 3#28II#29, are f1, 2, 3, 4, 5g, f6, 7g, and f8, 9, A, B, Cg respectively. Using equation 2, the recall can now be calculated in the following way:  #29, are f1, 2, 3, 4, 5g and #5Bf6, 7gf8, 9, A, B, Cg#5D respectively #28Figure 3#28III#29#29. The precision can now be calculated as:</Paragraph>
      <Paragraph position="19"> Shortcomings of the Vilain et. al Algorithm Despite the advances of the model-theoretic scorer, it yields unintuitive results for some tasks. There are two main reasons.</Paragraph>
      <Paragraph position="20"> 1. The algorithm does not give any credit for separating out singletons #28entities that occur in chains consisting only of one element, the entity itself#29 from other chains whichhave been identi#0Ced. This follows from the convention in coreference annotation of not identifying those entities that are markable aspossiblycoreferentwith otherentities inthe text. Rather, entities areonlymarkedasbeingcoreferent  if they actually are coreferent with other entities in the text. This potential shortcoming could be easily enough overcome with di#0Berent annotation conventions and with minor changes to the algorithm, but the decision to annotate singletons is a bit of a philosophical issue. On the one hand singletons do form equivalence classes, and those equivalence classes are signi#0Ccant in that they are NOT coreferent with another phrase in the text and they may play an important role in other equivalence classes out side the immediate text #28as in cross document coreference#29. On the other hand, if coreference is viewed as being about the relations between entities, then perhaps is makes little sense to annotate and score singletons.</Paragraph>
      <Paragraph position="21"> 2. All errors are considered to be equal. The MUC scoring algorithm penalizes the precision numbers equally for all types of errors. It is our position that, for certain tasks, some coreference errors do more damage than others.</Paragraph>
      <Paragraph position="22"> Consider the following examples: suppose the truth contains two large coreference chains and one small one #28Figure 1#29, and suppose Figures 2 and 4 showtwo di#0Berent responses. We will explore two di#0Berent precision errors. The #0Crst error will connect one of the large coreference chains with the small one #28Figure 2#29. The second error occurs when the two large coreference chains are related by the errant coreferent link #28Figure 4#29. It is our position that the second error is more damaging because, compared to the #0Crst error, the second error makes more entities coreferent that should not be. This distinction is not re#0Dected in the #5BVilain, 95#5D scorer which scores both responses as having a precision score of 90#25 #28Figure 6#29.</Paragraph>
      <Paragraph position="23"> Revisions to the Algorithm: Our B-CUBED Algorithm  Our B-CUBED algorithm was designed to overcome the two shortcomings of the Vilain et. al algorithm. Instead of looking at the links produced by a system, our algorithm looks at the presence#2Fabsence of entities relative to each of the other entities in the equivalence classes produced. Therefore, we compute the precision and recall numbers for eachentity in the document, which are then combined to produce #0Cnal precision and recall numbers for the entire output. The formal model-theoretic version of our algorithm is discussed in the next section.</Paragraph>
      <Paragraph position="25"> For an entity, i,we de#0Cne the precision and recall with respect to that entity in Figure 5.</Paragraph>
      <Paragraph position="26"> The #0Cnal precision and recall numbers are computed by the following two formulae:</Paragraph>
      <Paragraph position="28"> is the weight assigned to entity i in the document.</Paragraph>
      <Paragraph position="29"> It should be noted that the B-CUBED algorithm implicitly overcomes the #0Crst shortcoming of the Vilain et. al algorithm by calculating the precision and recall numbers for eachentity in the document #28irrespectiveof whether an entity is part of a coreference chain#29.</Paragraph>
      <Paragraph position="30"> Di#0Berent weighting schemes produce di#0Berent versions of the algorithm. The choice of the weighting scheme is determined by the task for which the algorithm is going to be used.</Paragraph>
      <Paragraph position="31"> When coreference #28or cross-document coreference#29 is used for an information extraction task, where information about every entity in an equivalence class is important, the weighting scheme assigns equal weights for every entity i. For example, the weight assigned to eachentity in Figure 1 is 1#2F12. As shown in Figure 6, the precision scores for responses in Figures 2 and 4 are 16#2F21 #2876#25#29 and 7#2F12 #2858#25#29 respectively, using equal weights for all entities. Recall for both responses is 100#25. It should be noted that the algorithm penalizes the precision numbers more for the error made in Figure 4 than the one made in Figure 2. As evident from the two examples, this version of the B-CUBED algorithm #28using equal weights for eachentity#29 is a precision oriented algorithm i.e. it is sensitive to precision errors.</Paragraph>
      <Paragraph position="32"> But, for an information retrieval #28IR#29 task, or a web search task, where an user is presented with classes of documents that pertain to the same entity, the weighting scheme assigns equal weights to each equivalence class. The weight for each entity within an equivalence class is computed by dividing the weight of the equivalence class by the number of entities in that class. Recall is calculated by assigning equal weights to each equivalence class in the truth while precision is calculated by assigning equal weights to each equivalence class in the response. For example, in Figure 2, the weighting scheme assigns a weight of 1#2F10 to eachentity in the #0Crst equivalence class, and a weight of 1#2F14 to each entity in the second equivalence class, when calculating precision. Using this weighting scheme, the precision scores for responses in Figures 2 and 4 are  Comparing these numbers to the ones obtained by using the version of the algorithm which assigns equal weights to eachentity, one can see that the currentversion is much less sensitive to precision errors. Although the currentversion of the algorithm does penalize the precision numbers for the error in Figure 4 more than the error made in Figure 2, it is less severe than the earlier version.</Paragraph>
      <Paragraph position="33"> The Model Theoretic Approach To The B-CUBED Algorithm Let S be an equivalence set generated by the key, and let R</Paragraph>
      <Paragraph position="35"> be equivalence classes generated by the response. Then we de#0Cne the following functions over S: #0F p#28S#29 is a partition of S with respect to the response, i.e. p#28S#29 is a set of subsets of S formed by intersecting S with those response sets R</Paragraph>
      <Paragraph position="37"> j#29: Since the B-CUBED algorithm looks at the presence#2Fabsence of entities relative to each of the other entities, the number of missing entities in an entire equivalence set is calculated by adding the number of missing entities with respect to eachentity in that equivalence set. Therefore, the number of missing entities for the entire set S is</Paragraph>
      <Paragraph position="39"> The recall error is simply the number of missing entities divided by the number of entities in the equiv- null of equivalence classes. Therefore, the recall R i for an equivalence class S</Paragraph>
      <Paragraph position="41"> #29, and, hence, is a subset of S</Paragraph>
      <Paragraph position="43"> The recall numbers calculated for each class can now be combined in various ways to produce the #0Cnal recall. Di#0Berent versions of the algorithm are obtained by using di#0Berent combination strategies. If equal weights are assigned to each class, the version of the algorithm produced is exactly the same as the version of the informal algorithm which assigns equal weights to each class, as described in the previous section. In other words, the #0Cnal recall is an average of the recall numbers for each equivalence class, i.e.,</Paragraph>
      <Paragraph position="45"> : To obtain the version of the informal algorithm which assigns equal weights to eachentity, the #0Cnal recall is computed by calculating the weighted average of the recall numbers for each equivalence class where the weights are decided by the number of entities in each class, i.e.,</Paragraph>
      <Paragraph position="47"> : Finally, as in the case of the Vilain et. al algorithm, the precision numbers are calculated by reversing the roles of the key and the response in the above formulation. Task Relative Strengths and Weaknesses of the Two Algorithms The Vilain et. al algorithm is useful for applications#2Ftasks that use single coreference relations at a time rather than resulting equivalence classes. For our development in the coreference task, the two algorithms provide distinct perspectives on system performance. Vilain et. al provide a strong diagnostic for errors that re#0Dect pairwise decisions done by the system. Our visual display techniques emphasize just this sort of processing.</Paragraph>
      <Paragraph position="48"> Our total score under the Vilain algorithm, with a somewhat fuzzier extent requirement and stricter requirement for links is 81#25 precision and 45#25 recall.</Paragraph>
      <Paragraph position="49"> The same #0Cles using the B3 algorithm resulted in 78#25 precision and 31#25 recall. The precision numbers are comparable which indicates that our goal of high precision is supported under both views of the data. The 14#25 drop in recall was however unexpected. The reason is fairly straight forward#7B our system is not doing a good job of relating large equivalence classes. This is the converse of penalizing the system for positing incorrect links that result in larger equivalent classes than smaller ones.</Paragraph>
      <Paragraph position="50"> The drop in recall in the B3 scorer also suggests a distinct class of coreference resolution procedure that we could investigate#7Bgrowing of large equivalence classes via an entity merging model which eschewed the standard left-to-right processing strategy of most coreference resolution systems. If such a procedure can reliably grow medium sized equivalence classes into large ones, then the recall #0Cgures will improve under the B3 scorer. The Vilain et. al scorer notes no di#0Berence between correctly relating two singleton equivalence classes and correctly relating two large equivalence classes.</Paragraph>
      <Paragraph position="51"> Since large equivalence classes tend to include topically signi#0Ccantentities for documents, correctly identifying them is perhaps crucial to applications like summarization and information extraction.</Paragraph>
      <Paragraph position="52"> Developing with the Vilain et al algorithm The below analysis re#0Dects howwe assesed the individual contributions of the components during development. Since the B3 algorithm was not yet implemented, we did not use it for development.</Paragraph>
      <Paragraph position="53"> Our explicit goal was to maximize recall at a precision level of 80#25. We feel that this level of precision provides enough accuracy to drive a range of coreference dependent applications#7Bmost important for us was query sensitive text summarization. Our overall approach was to break down coreference resolution into concrete subprograms that resolved a limited class of coreference well. Each component could be scored separately by either running it in isolation, or by blocking coreference from subsequent processes. Belowwe discuss each component in the order of execution.</Paragraph>
      <Paragraph position="54"> Genre Speci#0Cc Coreference A problematic aspect of any new genre of data is the existence of idiosyncratic classes of coreference and the MUC-7 data was particularly troubling since very oddly formatted text was fair game for coreference. For example, the strings `HUGHES' and `FCC' in `#3CSLUG fv=tia-z#3E BC-HUGHES-FCC-BLOOM #3C#2FSLUG#3E' are coreferent with the same strings in `#3CPREAMBLE#3EBC-HUGHES-FCC-BLOOM...' whichwas outside the scope of our linguistic tools. Simple programs were written to recognize this sort of coreference. The performance by the Vilain scorer is 4.2#25 recall 67.5#25 precision.</Paragraph>
      <Paragraph position="55"> This performance is well below what we observed in training data#7Bthe precision was 85-90#25 for similarly sized collections. Perhaps part of the problem was that we never quite grasped why some but not all these all CAPS strings were not coreferent.</Paragraph>
      <Paragraph position="56"> La Hack 3 La Hack is a carry over from our original MUC-6 system, and it is responsible for identi#0Ccation of proper noun coreference. This component is indirectly helped by IBM's named entity tool 'Textract' which #0Cnd extents of named entities in addition to assigning them properties like 'is person', 'is company'. It is the foundation upon which our coreference annotation is built#7Bmistakes here can be devastating for the rest of the system. In MUC-6, La Hack performed at 29#25 Recall and 86#25 precision, but it faired somewhat worse in MUC-7 with, 24.0#25 precision 80.0#25 recall.</Paragraph>
      <Paragraph position="57"> We observed that the New York Times data had far less regular honori#0Cc use and corporate designator use than the MUC-6 corpus based on Wall Street Journal. As a result, there were fewer reliable indicators of proper names.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Highly Syntactic Coreference
</SectionTitle>
      <Paragraph position="0"> This component asserts coreference between phrases that are in appositive relations or that are in predicate nominal relations. Wewere quite surprised at how poorly this component performed since we expected performance to be above the 80#25 precision cuto#0B. Our actual performance is 3.3#25 precision 64.0#25 recall.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Quoted Speech
</SectionTitle>
      <Paragraph position="0"> Quoted speech has idiosyncratic patterns of use that are better solved out side the scope of our standard coreference resolution module. We expected performance to be above 90#25 precision and were pleased with 2.6#25 recall and 86.8#25 precision. This module is a good example of how the coreference problem can be fruitfully broken up into sub-parts of individually high precision.</Paragraph>
    </Section>
  </Section>
  <Section position="2" start_page="3" end_page="3" type="metho">
    <SectionTitle>
CogNIAC Proper Noun Resolution
</SectionTitle>
    <Paragraph position="0"> CogNIAC is the most general purpose coreference resolution component of the system. It features a fairly sophisticated salience model and property con#0Cdence model to preorder order the set of candidate antecedents. The importance of the preorder is that it allows ties between equally salientantecedents#7Band in the case of ties the anaphor is not resolved.</Paragraph>
    <Paragraph position="1"> When de#0Cciencies were noted with the output of LaHack, the simplest solution was to add a proper noun resolution component to CogNIAC. In the end this addition added a bit of recall but with fairly low precision with 1.2#25 recall and 65.2#25 precision.</Paragraph>
  </Section>
  <Section position="3" start_page="3" end_page="3" type="metho">
    <SectionTitle>
CogNIAC Common Noun Resolution
</SectionTitle>
    <Paragraph position="0"> Common noun coreference is an important part of coreference, but it is very di#0Ecult to accurately resolve.</Paragraph>
    <Paragraph position="1"> Our MUC-6 system had fairly poor performance with 10#25 recall and a precision of 48#25. Wewere surprised with an increase in performance over training data #2878#25 precision#29 with 7.1#25 recall 90.7#25 precision.</Paragraph>
    <Paragraph position="2"> Common noun anaphora is probably one of the most trying classes of coreference to annotate as a human. This is due to many di#0Ecult judgment calls required on the part of the human judges, and this was re#0Dected in the consistency of annotation in the training data. We found it challenging to develop on the training data because the system would #0Cnd what we considered to be reasonable instances of coreference that the annotator had not made coreferent. We believe that common noun anaphora is a large source of inter-annotator disagreement.</Paragraph>
    <Paragraph position="3"> CogNIAC Pronouns The pronominal system performed under our goal of 80#25 precision. In training, we found that wewere constantly balancing the ability of pronouns to i#29 refer uniquely, and ii#29 have all entities have the correct property. We adopted a property con#0Cdence model that encouraged recall over precision. This meant that a proper noun like 'Mrs. Fields' would be both potentially an antecedent to feminine pronouns, and pronouns that referred to companies. A salience model was then applied to these overloaded entities and pronominal resolution served to be a word-sense disambiguation problem in addition to a coreference resolution problem. Our performance was 4.5#25 recall and 70.0#25 precision.</Paragraph>
    <Paragraph position="4"> Conclusions One of the stronger conclusions that wehave come to regarding coreference is that there is an apparent linear trade-o#0B between precision and recall given the performance of other systems with the coreference task. Our suspicion is that the same can be said with the B3 scorer but that will havetoawait experimentation. This is a positive result in its self because we now can choose from multiple types of coreference systems depending on our task. We consider high precision systems to be more useful for the types of systems that we build, but, it has not been clear that high precision systems were possible.</Paragraph>
    <Paragraph position="5"> We also believe that the space of high precision 'contributors' to coreference is not exhausted. We doubt that there are any 10#25 recall#2F80#25 precision subcomponents that wehave not already explored, but there are certainly 1-5#25 recall opportunities. Howwell they will sum to the recall of the entire system is unknown, but there is room for improvement.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML