File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1633_metho.xml
Size: 27,060 bytes
Last Modified: 2025-10-06 14:10:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1633"> <Title>BESTCUT: A Graph Algorithm for Coreference Resolution</Title> <Section position="4" start_page="275" end_page="279" type="metho"> <SectionTitle> 2 BESTCUT Coreference Resolution </SectionTitle> <Paragraph position="0"> For each entity type (PERSON, ORGANIZATION, LOCATION, FACILITY or GPE3) we create a graph in which the nodes represent all the mentions of that type in the text, the edges correspond to all pairwise coreference relations, and the edge weights are the confidences of the coreference relations. We will divide this graph repeatedly by cutting the links between subgraphs until a stop model previously learned tells us that we should stop the cutting. The end result will be a partition that approximates the correct division of the text into entities.</Paragraph> <Paragraph position="1"> We consider this graph approach to clustering a more accurate representation of the relations between mentions than a tree-based approach that treats only anaphora resolution, trying to connect mentions with candidate referents that appear in text before them. We believe that a correct resolution has to tackle cataphora resolution as well, by taking into account referents that appear in the text after the anaphors. Furthermore, we believe that a graph representation of mentions in a text is more adequate than a tree representation because the coreference relation is symmetrical in addi3Entity types as defined by (NIST, 2003).</Paragraph> <Paragraph position="2"> tion to being transitive. A greedy bottom-up approach does not make full use of this property. A graph-based clusterization starts with a complete overall view of all the connections between mentions, therefore local errors are much less probable to influence the correctness of the outcome. If two mentions are strongly connected, and one of them is strongly connected with the third, all three of them will most probably be clustered together even if the third edge is not strong enough, and that works for any order in which the mentions might appear in the text.</Paragraph> <Section position="1" start_page="275" end_page="275" type="sub_section"> <SectionTitle> 2.1 Learning Algorithm </SectionTitle> <Paragraph position="0"> The coreference confidence values that become the weights in the starting graphs are provided by a maximum entropy model, trained on the training datasets of the corpora used in our experiments. For maximum entropy classification we used a maxent4 tool. Based on the data seen, a maximum entropy model (Berger et al., 1996) offers an expression (1) for the probability that there exists coreference C between a mention mi and a mention mj.</Paragraph> <Paragraph position="2"> where gk(mi,mj,C) is a feature and lk is its weight; Z(mi,mj) is a normalizing factor.</Paragraph> <Paragraph position="3"> We created the training examples in the same way as (Luo et al., 2004), by pairing all mentions of the same type, obtaining their feature vectors and taking the outcome (coreferent/noncoreferent) from the key files.</Paragraph> </Section> <Section position="2" start_page="275" end_page="276" type="sub_section"> <SectionTitle> 2.2 Feature Representation </SectionTitle> <Paragraph position="0"> We duplicated the statistical model used by (Luo et al., 2004), with three differences. First, no feature combination was used, to prevent long running times on the large amount of ACE data. Second, through an analysis of the validation data, we implemented seven new features, presented in Table 1. Third, as opposed to (Luo et al., 2004), who represented all numerical features quantized, we translated each numerical feature into a set of binary features that express whether the value is in certain intervals. This transformation was necessary because our maximum entropy tool performs better on binary features. (Luo et al., 2004)'s features were not reproduced here from lack of space; please refer to the relevant paper for details.</Paragraph> <Paragraph position="1"> Category Feature name Feature description lexical head-match true if the two heads are identical type-pair for each mention: name-its type, noun- NOUN , pronoun-its spelling name-alias true if a mention is an alias of the other one syntactic same-governing-category true if both mentions are covered by the same type of node, e.g. NP, VP, PP path the parse tree path from m2 to m1 coll-comm true if either mention collocates with a communication verb grammatical gn-agree true if the two mentions agree in gender and number</Paragraph> </Section> <Section position="3" start_page="276" end_page="278" type="sub_section"> <SectionTitle> 2.3 Clusterization Method: BESTCUT </SectionTitle> <Paragraph position="0"> We start with five initial graphs, one for each entity type, each containing all the mentions of that type and their weighted connections. This initial division is correct because no mentions of different entity types will corefer. Furthermore, by doing this division we avoid unnecessary confusion in the program's decisions and we decrease its running time. Each of these initial graphs will be cut repeatedly until the resulting partition is satisfactory. In each cut, we eliminate from the graph the edges between subgraphs that have a very weak connection, and whose mentions are most likely not part of the same entity.</Paragraph> <Paragraph position="1"> Formally, the graph model can be defined as follows. Let M = {mi : 1..n} be n mentions in the document and E = {ej : 1..m} be m entities. Let g : M - E be the map from a mention mi [?] M to an entity ej [?] E. Let c : MxM - [0,1] be the confidence the learning algorithm attaches to the coreference between two mentions mi,mj [?] M.</Paragraph> <Paragraph position="2"> Let T = {tk : 1..p} be the set of entity types or classes. Then we attach to each entity class tk an undirected, edge-weighted graph Gk(Vk,Ek), where Vk = {mi|g(mi).type = tk} and Ek = {(mi,mj,c(mi,mj))|mi,mj [?] Vk}.</Paragraph> <Paragraph position="3"> The partitioning of the graph is based at each step on the cut weight. As a starting point, we used the Min-Cut algorithm, presented and proved correct in (Stoer and Wagner, 1994). In this simple and efficient method, the weight of the cut of a graph into two subgraphs is the sum of the weights of the edges crossing the cut. The partition that minimizes the cut weight is the one chosen. The main procedure of the algorithm computes cutsof-the-phase repeatedly and selects the one with the minimum cut value (cut weight). We adapted this algorithm to our coreference situation.</Paragraph> <Paragraph position="4"> To decide the minimum cut (from here on called the BESTCUT), we use as cut weight the number of mentions that are correctly placed in their set.</Paragraph> <Paragraph position="5"> The method for calculating the correctness score is presented in Figure 1. The BESTCUT at one stage is the cut-of-the-phase with the highest correctness score.</Paragraph> <Paragraph position="7"> An additional learning model was trained to decide if cutting a set of mentions is better or worse than keeping the mentions together. The model was optimized to maximize the ECM-F score5. We will denote by S the larger part of the cut and T the smaller one. C.E is the set of edges crossing the cut, and G is the current graph before the cut.</Paragraph> <Paragraph position="8"> S.V and T.V are the set of vertexes in S and in T, respectively. S.E is the set of edges from S, while T.E is the set of edges from T. The features for stopping the cut are presented in Table 2. The model was trained using 10-fold cross-validation on the training set. In order to learn when to stop the cut, we generated a list of positive and negative examples from the training files. Each training example is associated with a certain cut (S,T). Since we want to learn a stop function, the positive examples must be examples that describe when the cut must not be done, and the negative examples are examples that present situations when the cut must be performed. Let us consider that the list of entities from a text is E = {ej : 1..m} with ej = {mi1,mi2,...mik} the list of mentions that refer to ej. We generated a negative example for each pair (S = {ei},T = {ej}) with i negationslash= j each entity must be separated from any other en- null Feature name Feature description st-ratio |S.V|/|T.V|- the ratio between the cut parts ce-ratio |C.E|/|G.E|- the proportion of the cut from the entire graph c-min min(C.E) - the smallest edge crossing the cut c-max max(C.E) - the largest edge crossing the cut c-avg avg(C.E) - the average of the edges crossing the cut c-hmean hmean(C.E) - the harmonic mean of the edges crossing the cut c-hmeax hmeax(C.E) - a variant of the harmonic mean. hmeax(C.E) = 1 [?] hmean(C.E') where each edge from E' has the weight equal to 1 minus the corresponding edge from E lt-c-avg-ratio how many edges from the cut are less than the average of the cut (as a ratio) lt-c-hmeanratio null how many edges from the cut are less than the harmonic mean of the cut (as a</Paragraph> <Paragraph position="10"> edges from the graph when the edges from the cut are not considered g-avg avg(G.E) - the average of the edges from the graph st-wrong-avgratio null how many vertexes are in the wrong part of the cut using the average measure for the 'wrong' (as a ratio) st-wrongmax-ratio null how many vertexes are in the wrong part of the cut using the max measure for the 'wrong' (as a ratio)</Paragraph> <Paragraph position="12"> of the edges from C.E that are smaller than the average of the cut; r2 is the ratio of the edges from S.E + T.E that are smaller than the average of the cut g-avg > stavg null 1 if the avg(G.E) > avg(S.E+T.E), and 0 otherwise tity. We also generated negative examples for all pairs (S = {ei},T = E \ S) - each entity must be separated from all the other entities considered together. To generate positive examples, we simulated the cut on a graph corresponding to a single entity ej. Every partial cut of the mentions of ej was considered as a positive example for our stop model.</Paragraph> <Paragraph position="13"> We chose not to include pronouns in the BESTCUT initial graphs, because, since most features are oriented towards Named Entities and common nouns, the learning algorithm (maxent) links pronouns with very high probability to many possible antecedents, of which not all are in the same chain. Thus, in the clusterization phase the pronouns would act as a bridge between different entities that should not be linked. To prevent this, we solved the pronouns separately (at the end of input a weighted graph having a vertex for each mention considered and outputs the list of entities created. In each stage, a cut is proposed for all subgraphs in the queue. In case StopTheCut decides that the cut must be performed on the subgraph, the two sides of the cut are added to the queue (lines 10-11); if the graph is well connected and breaking the graph in two parts would be a bad thing, the current graph will be used to create a single entity (line 8). The algorithm ends when the queue becomes empty. ProposeCut (Fig null ure 3) returns a cut of the graph obtained with an algorithm similar to the Min-Cut algorithm's procedure called MinimumCut. The differences between our algorithm and the Min-Cut procedure are that the most tightly connected vertex in each step of the ProposeCutPhase procedure, z, is found using expression 2:</Paragraph> <Paragraph position="15"> lighter test function uses the correctness score presented before: the partial cut with the larger correctness score is better. The ProposeCutPhase function is presented in Figure 4.</Paragraph> </Section> <Section position="4" start_page="278" end_page="279" type="sub_section"> <SectionTitle> 2.4 An Example </SectionTitle> <Paragraph position="0"> Let us consider an example of how the BESTCUT algorithm works on two simple sentences (Figure 5). The entities present in this example are: {Mary1, the girl5} and {a brother2, John3, The boy4}. Since they are all PERSONs, the algorithm Mary1 has a brother2, John3. The boy4 is older than The initial graph is illustrated in Figure 6, with the coreference relation marked through a different coloring of the nodes. Each node number corresponds to the mention with the same index in The strongest confidence score is between a brother2 and John3, because they are connected through an apposition relation. The graph was simplified by eliminating the edges that have an insignificant weight, e.g. the edges between John3 and the girl5 or between Mary1 and a brother2.</Paragraph> <Paragraph position="1"> Function BESTCUT starts with the whole graph.</Paragraph> <Paragraph position="2"> The first cut of the phase, obtained by function ProposeCutPhase, is the one in Figure 7.a. This cut separates node 2 from the rest of the graph. In calculating the score of the cut (using the algorithm from Figure 1), we obtain an average number of three correctly placed mentions. This can be verified intuitively on the drawing: mentions 1, 2 and 5 are correctly placed, while 3 and 4 are not. The score of this cut is therefore 3. The second, the third and the fourth cuts of the phase, in Figures 7.b, 7.c and 7.d, have the scores 4, 5 and 3.5 respectively. An interesting thing to note at the fourth cut is that the score is no longer an integer. This happens because it is calculated as an average between corrects-avg = 4 and correctsmax = 3. The methods disagree about the placement of mention 1. The average of the outgoing weights of mention 1 is 0.225, less than 0.5 (the default weight assigned to a single mention) therefore the first method declares it is correctly placed. The second considers only the maximum; 0.6 is greater than 0.5, so the mention appears to be more strongly connected with the outside than the inside. As we can see, the contradiction is because of the uneven distribution of the weights of the outgoing edges.</Paragraph> <Paragraph position="3"> The first proposed cut is the cut with the great-</Paragraph> </Section> </Section> <Section position="5" start_page="279" end_page="279" type="metho"> <SectionTitle> FACILITY ORGANIZATION PERSON LOCATION GPE </SectionTitle> <Paragraph position="0"> one will be ignored- the machine learning algorithm that was trained when to stop a cut will always declare against further cuts. In the end, the cut returned by function BESTCUT is the correct one: it divides mentions Mary1 and the girl5 from mentions a brother2, John3 and The boy4.</Paragraph> </Section> <Section position="6" start_page="279" end_page="280" type="metho"> <SectionTitle> 3 Mention Detection </SectionTitle> <Paragraph position="0"> Because our BESTCUT algorithm relies heavily on knowing entity types, we developed a method for recognizing entity types for nominal mentions.</Paragraph> <Paragraph position="1"> Our statistical approach uses maximum entropy classification with a few simple lexical and syntactic features, making extensive use of WordNet (Fellbaum, 1998) hierarchy information. We used the ACE corpus, which is annotated with mention and entity information, as data in a supervised machine learning method to detect nominal mentions and their entity types. We assigned six entity types: PERSON, ORGANIZATION, LOCA-TION, FACILITY, GPE and UNK (for those who are in neither of the former categories) and two genericity outcomes: GENERIC and SPECIFIC. We only considered the intended value of the mentions from the corpus. This was motivated by the fact that we need to classify mentions according to the context in which they appear, and not in a general way. Only contextual information is useful further in coreference resolution. We have experimentally discovered that the use of word sense disambiguation improves the performance tremendously (a boost in score of 10%), therefore all the features use the word senses from a previously-applied word sense disambiguation program, taken from (Mihalcea and Csomai, 2005).</Paragraph> <Paragraph position="2"> For creating training instances, we associated an outcome to each markable (NP) detected in the training files: the markables that were present in the key files took their outcome from the key file annotation, while all the other markables were associated with outcome UNK. We then created a training example for each of the markables, with the feature vector described below and as target function the outcome. The aforementioned outcome can be of three different types. The first type of outcome that we tried was the entity type (one member of the set PERSON, ORGANIZATION, LO-CATION, FACILITY, GPE and UNK); the second type was the genericity information (GENERIC or SPECIFIC), whereas the third type was a combination between the two (pairwise combinations of the entity types set and the genericity set, e.g.</Paragraph> <Paragraph position="3"> PERSON SPECIFIC).</Paragraph> <Paragraph position="4"> The feature set consists of WordNet features, lexical features, syntactic features and intelligent context features, briefly described in Table 3. With the WordNet features we introduce the WordNet equivalent concept. A WordNet equivalent concept for an entity type is a word-sense pair from WordNet whose gloss is compatible with the definition of that entity type. Figure 8 enumerates a few WordNet equivalent concepts for entity class PERSON (e.g. CHARACTER#1), with their hierarchy of hyponyms (e.g. Frankenstein#2). The lexical feature is useful because some words are almost always of a certain type (e.g. &quot;company&quot;). The intelligent context set of features are an improvement on basic context features that use the stems of the words that are within a window of a certain size around the word. In addition to this set of features, we created more features by combining them into pairs. Each pair contains two features from two different classes.</Paragraph> <Paragraph position="5"> For instance, we will have features like: is-a- null Category Feature name Feature description WordNet is-a-TYPE true if the mention is of entity type TYPE; five features WN-eq-concept-hyp true if the mention is in hyponym set of WN-eq-concept; 42 features WN-eq-concept-syn true if the mention is in synonym set of WN-eq-concept; 42 features lexical stem-sense pair between the stem of the word and the WN sense of the word by the WSD syntactic pos part of speech of the word by the POS tagger is-modifier true if the mention is a modifier in another noun phrase modifier-to-TYPE true if the mention is a modifier to a TYPE mention in-apposition-with TYPE of the mention our mention is in apposition with intelligent context all-mods the nominal, adjectival and pronominal modifiers in the mention's parse tree preps the prepositions right before and after the mention's parse tree PERSON[?]in-apposition-with(PERSON).</Paragraph> <Paragraph position="6"> All these features apply to the &quot;true head&quot; of a noun phrase, i.e. if the noun phrase is a partitive construction (&quot;five students&quot;, &quot;a lot of companies&quot;, &quot;a part of the country&quot;), we extract the &quot;true head&quot;, the whole entity that the part was taken out of (&quot;students&quot;, &quot;companies&quot;, &quot;country&quot;), and apply the features to that &quot;true head&quot; instead of the partitive head.</Paragraph> <Paragraph position="7"> For combining the mention detection module with the BESTCUT coreference resolver, we also generated classifications for Named Entities and pronouns by using the same set of features minus the WordNet ones (which only apply to nominal mentions). For the Named Entity classifier, we added the feature Named-Entity-type as obtained by the Named Entity Recognizer. We generated a list of all the markable mentions and their entity types and presented it as input to the BESTCUT resolver instead of the list of perfect mentions. Note that this mention detection does not contain complete anaphoricity information. Only the mentions that are a part of the five considered classes are treated as anaphoric and clustered, while the UNK mentions are ignored, even if an outside anaphoricity classifier might categorize some of them as anaphoric.</Paragraph> </Section> <Section position="7" start_page="280" end_page="281" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> The clusterization algorithms that we implemented to evaluate in comparison with our method are (Luo et al., 2004)'s Belltree and Link-Best (best-first clusterization) from (Ng and Cardie, 2002). The features used were described in section 2.2. We experimented on the ACE Phase 2 (NIST, 2003) and MUC6 (MUC-6, 1995) corpora. Since we aimed to measure the performance of coreference, the metrics used for evaluation are the ECM-F (Luo et al., 2004) and the MUC P, R and F scores (Vilain et al., 1995).</Paragraph> <Paragraph position="1"> In our first experiment, we tested the three coreference clusterization algorithms on the development-test set of the ACE Phase 2 corpus, first on true mentions (i.e. the mentions annotated in the key files), then on detected mentions (i.e.</Paragraph> <Paragraph position="2"> the mentions output by our mention detection system presented in section 3) and finally without any prior knowledge of the mention types. The results obtained are tabulated in Table 4. As can be observed, when it has prior knowledge of the mention types BESTCUT performs significantly better than the other two systems in the ECM-F score and slightly better in the MUC metrics. The more knowledge it has about the mentions, the better it performs. This is consistent with the fact that the first stage of the algorithm divides the graph into subgraphs corresponding to the five entity types. If BESTCUT has no information about the mentions, its performance ranks significantly under the Link-Best and Belltree algorithms in ECM-F and MUC R. Surprisingly enough, the Belltree algorithm, a globally optimized algorithm, performs similarly to Link-Best in most of the scores.</Paragraph> <Paragraph position="3"> Despite not being as dramatically affected as BESTCUT, the other two algorithms also decrease in performance with the decrease of the mention information available, which empirically proves that mention detection is a very important module for coreference resolution. Even with an F-score of 77.2% for detecting entity types, our mention detection system boosts the scores of all three algorithms when compared to the case where no information is available.</Paragraph> <Paragraph position="4"> It is apparent that the MUC score does not vary significantly between systems. This only shows that none of them is particularly poor, but it is not a relevant way of comparing methods- the MUC metric has been found too indulgent by researchers ((Luo et al., 2004), (Baldwin et al., 1998)). The MUC scorer counts the common links between the algorithms are maxent for coreference and SVM for stopping the cut in BESTCUT. In turn, we obtain the mentions from the key files, detect them with our mention detection algorithm or do not use any information about them.</Paragraph> <Paragraph position="5"> annotation keys and the system output, while the ECM-F metric aligns the detected entities with the key entities so that the number of common mentions is maximized. The ECM-F scorer overcomes two shortcomings of the MUC scorer: not considering single mentions and treating every error as equally important (Baldwin et al., 1998), which makes the ECM-F a more adequate measure of coreference.</Paragraph> <Paragraph position="6"> Our second experiment evaluates the impact that the different categories of our added features have on the performance of the BESTCUT system. The experiment was performed with a max-ent classifier on the MUC6 corpus, which was priorly converted into ACE format, and employed mention information from the key annotations.</Paragraph> <Paragraph position="7"> CUT on MUC6. Baseline system has the (Luo et al., 2004) features. The system was tested on key mentions.</Paragraph> <Paragraph position="8"> From Table 5 we can observe that the lexical features (head-match, type-pair, name-alias) have the most influence on the ECM-F and MUC scores, succeeded by the syntactic features (samegoverning-category, path, coll-comm). Despite what intuition suggests, the improvement the grammatical feature gn-agree brings to the system is very small.</Paragraph> </Section> <Section position="8" start_page="281" end_page="282" type="metho"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> It is of interest to discuss why our implementation of the Belltree system (Luo et al., 2004) is comparable in performance to Link-Best (Ng and Cardie, 2002). (Luo et al., 2004) do the clusterization through a beam-search in the Bell tree using either a mention-pair or an entity-mention model, the first one performing better in their experiments. Despite the fact that the Bell tree is a complete representation of the search space, the search in it is optimized for size and time, while potentially losing optimal solutions- similarly to a Greedy search. Moreover, the fact that the two implementations are comparable is not inconceivable once we consider that (Luo et al., 2004) never compared their system to another coreference resolver and reported their competitive results on true mentions only.</Paragraph> <Paragraph position="1"> (Ng, 2005) treats coreference resolution as a problem of ranking candidate partitions generated by a set of coreference systems. The overall performance of the system is limited by the performance of its best component. The main difference between this approach and ours is that (Ng, 2005)'s approach takes coreference resolution one step further, by comparing the results of multiple systems, while our system is a single resolver; furthermore, he emphasizes the global optimization of ranking clusters obtained locally, whereas our focus is on globally optimizing the clusterization method inside the resolver.</Paragraph> <Paragraph position="2"> (DaumeIII and Marcu, 2005a) use the Learning as Search Optimization framework to take into account the non-locality behavior of the coreference features. In addition, the researchers treat mention detection and coreference resolution as a joint problem, rather than a pipeline approach like we do. By doing so, it may be easier to detect the entity type of a mention once we have additional clues (expressed in terms of coreference features) about its possible antecedents. For example, labeling Washington as a PERSON is more probable after encountering George Washington previously in the text. However, the coreference problem does not immediately benefit from the joining.</Paragraph> </Section> class="xml-element"></Paper>