File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0203_metho.xml
Size: 17,713 bytes
Last Modified: 2025-10-06 14:15:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0203"> <Title>Coreference as the Foundations for Link Analysis over Free Text Databases</Title> <Section position="5" start_page="0" end_page="19" type="metho"> <SectionTitle> 3 Within Document Coreference </SectionTitle> <Paragraph position="0"> We have been developing the within document coreference component of CAMP since 1995 when the system was developed to participate in the Sixth Message Understanding Conference (MUC-6) coreference task. Below we will illustrate the classes of coreference that the system annotates.</Paragraph> <Paragraph position="1"> Coreference breaks down into several readily identified areas based on the form of the phrase being resolved and the method of calculating coreference.</Paragraph> <Paragraph position="2"> We will proceed in the approximate ordering of the systems execution of components. A more detailed analysis of the classes of coreference can be found in (Bagga, 98a).</Paragraph> <Section position="1" start_page="0" end_page="19" type="sub_section"> <SectionTitle> 3.1 Highly Syntactic Coreference </SectionTitle> <Paragraph position="0"> There are several readily identified syntactic constructions that reliably indicate coreference. First are appositive relations as holds between 'John Smith' and ~chairman of General Electric' in: John Smith, chairman of General Electric, resigned yesterday.</Paragraph> <Paragraph position="1"> Identifying this class of coreference requires some syntactic knowledge of the text and property analysis of the individual phrases to avoid finding coreference in examples like: John Smith, 47, resigned yesterday.</Paragraph> <Paragraph position="2"> Smith, Jones, Woodhouse and Fife announced a new partner.</Paragraph> <Paragraph position="3"> To avoid these sorts of errors we have a mutual exclusion test that applies to such positings of coreference to prevent non-sensical annotations.</Paragraph> <Paragraph position="4"> Another class of highly syntactic coreference exists in the form of predicate nominal constructions as between 'John' and 'the finest juggler in the world' in: John is the finest juggler in the world.</Paragraph> <Paragraph position="5"> Like the appositive case, mutual exclusion tests are required to prevent incorrect resolutions as in: John is tall.</Paragraph> <Paragraph position="6"> They are blue.</Paragraph> <Paragraph position="7"> These classes of highly syntactic coreference can play a very important role in bridging phrases that we would normally be unable to relate. For example, it is unlikely that our software would be able to relate the same noun phrases in a text like The finest juggler in the world visited Philadelphia this week. John Smith pleased crowds every night in the Annenberg theater.</Paragraph> <Paragraph position="8"> This is because we do not have sufficiently sophisticated knowledge sources to determine that jugglers are very likely to be in the business of pleasing crowds. But the recognition of the predicate nominal will allow us to connect a chain of 'John Smith', 'Mr. Smith', 'he' with a chain of 'the finest juggler in the world', 'the juggler' and 'a juggling expert'.</Paragraph> </Section> <Section position="2" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.2 Proper Noun Coreference </SectionTitle> <Paragraph position="0"> Names of people, places, products and companies are referred to in many different variations. In journalistic prose there will be a full name of an entity, and throughout the rest of the article there will be ellided references to the same entity. Some name variations are: * Mr. James Dabah <- James <- Jim <- Dabah * Minnesota Mining and Manufacturing <- 3M Corp. <- 3M * Washington D.C. <- WASHINGTON <- Washington <- D.C. <- Wash.</Paragraph> <Paragraph position="1"> * New York <- New York City <- NYC <- N.Y.C. This class of coreference forms a solid foundation over which we resolve the remaining coreference in the document. One reason for this is that we learn important properties about the phrases in virtue of the coreference resolution. For example, we may not know whether 'Dabah' is a person name, male name, female name, company or place, but upon resolution with 'Mr. James Dabah' we then know that it refers to a male person.</Paragraph> <Paragraph position="2"> We resolve such coreferences with partial string matching subroutines coupled with lists of honorifics, corporate designators and acronyms. A substantial problem in resolving these names is avoiding overgeneration like relating 'Washington' the place with the name 'Consuela Washington'. We control the string matching with a range of salience functions and restrictions of the kinds of partial string matches we are willing to tolerate.</Paragraph> </Section> <Section position="3" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.3 Common Noun Coreference </SectionTitle> <Paragraph position="0"> A very challenging area of coreference annotation involves coreference between common nouns like 'a shady stock deal' and 'the deal'. Fundamentally the problem is that very conservative approaches to exact and partial string matches overgenerate badly.</Paragraph> <Paragraph position="1"> Some examples of actual chains are: * his dad's trophies <- those trophies * those words <- the last words * the risk <- the potential risk * its accident investigation <- the investigation We have adopted a range of matching heuristics and salience strategies to try and recognize a small, but accurate, subset of these coreferences.</Paragraph> </Section> <Section position="4" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.4 Pronoun Coreference </SectionTitle> <Paragraph position="0"> The pronominal resolution component of the system is perhaps the most advanced of all the components.</Paragraph> <Paragraph position="1"> It features a sophisticated salience model designed to produce high accuracy coreference in highly ambiguous texts. It is capable of noticing ambiguity in text, and will fail to resolve pronouns in such circumstances. For example the system will not resolve 'he' in the following example: Earl and Ted were working together when suddenly he fell into the threshing machine.</Paragraph> <Paragraph position="2"> We resolve pronouns like 'they', 'it', 'he', 'hers', 'themselves' to proper nouns, common nouns and other pronouns. Depending on the genre of data being processed, this component can resolve 60-90% of the pronouns in a text with very high accuracy.</Paragraph> </Section> <Section position="5" start_page="19" end_page="19" type="sub_section"> <SectionTitle> 3.5 The Overall Nexus of Coreference in a Document </SectionTitle> <Paragraph position="0"> Once all the coreference in a document has been computed, we have a good approximation of which sentences are strongly related to other sentences in the document by counting the number of coreference links between the sentences. We know which entities are mentioned most often, and what other entities are involved in the same sentences or paragraphs. This sort of information has been used to generate very effective summaries of documents and as a foundation for a simple visualization interface to texts.</Paragraph> </Section> </Section> <Section position="6" start_page="19" end_page="22" type="metho"> <SectionTitle> 4 Cross Document Coreference </SectionTitle> <Paragraph position="0"> Cross-document coreference occurs when the same person, place, event, or concept is discussed in more than one text source. Figure 1 shows the architecture of the cross-document module of CAMP.</Paragraph> <Paragraph position="1"> This module takes as input the coreference chains produced by CAMP's within document coreference module. Details about each of the main steps of the cross-document coreference algorithm are given below.</Paragraph> <Paragraph position="2"> * First, for each article, the within document coreference module of CAMP is run on that article. It produces coreference chains for all the entities mentioned in the article. For example, consider the two extracts in Figures 2 and 4. The coreference chains output by CAMP for the two extracts are shown in Figures 3 and 5.</Paragraph> <Paragraph position="3"> * Next, for the coreference chain of interest within each article (for example, the coreference chain that contains &quot;John Perry&quot;), the Sentence Extractor module extracts all the sentences that contain the noun phrases which form the coreference chain. In other words, the SentenceExtractor module produces a &quot;summary&quot; of the article with respect to the entity of interest. These sun~maries are a special case of the query sensitive techniques being developed at Penn using CAMP. Therefore, for doc.36 (Figure 2), since at least one of the three noun phrases (&quot;John Perry,&quot; &quot;he,&quot; and &quot;Perry&quot;) in the coreference chain of interest appears in each of the three sentences in the extract, the summary produced by SentenceExtractor is the extract itself. On the other hand, the summary produced by SentenceExtractor for the coreference chain of interest in doc.38 is only the first sentence of the extract because the only element of the coreference chain appears in this sentence.</Paragraph> <Paragraph position="4"> * Finally, for each article, the VSM-Disambiguate module uses the summary extracted by the SentenceExtractor and computes its similarity with 01iver &quot;Biff&quot; Kelly of Weymouth succeeds John Perry as president of the Massachusetts Golf Association. &quot;We win have continued growth in the future,&quot; said Kelly, who will serve for two years. &quot;There's been a lot of changes and there win be continued changes as we head into the year 2000.&quot;</Paragraph> <Paragraph position="6"> Our Algorithm for the John Smith Data Set the summaries extracted from each of the other articles. The VSM-Disambiguate module uses a standard vector space model (used widely in information retrieval) (Salton, 89) to compute the similarities between the summaries. Summaries having similarity above a certain threshold are considered to be regarding the same entity.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.1 Experiments and Results </SectionTitle> <Paragraph position="0"> We tested our cross-document system on two highly ambiguous test sets. The first set contained 197 articles from the 1996 and 1997 editions of the New York Times, while the second set contained 219 articles from the 1997 edition of the New York Times. The sole criteria for including an article in the two sets was the presence of a string matching the &quot;/John.*?Smith/&quot;, and the &quot;/resign/&quot; regular expressions respectively.</Paragraph> <Paragraph position="1"> The goal for the first set was to identify cross-document coreference chains about the same John Smith, and the goal for the second set was to identify cross-document coreference chains about the same &quot;resign&quot; event. The answer keys were manually created, but the scoring was completely automated.</Paragraph> <Paragraph position="2"> There were 35 different John Smiths in the first set. Of these, 24 were involved in chains of size 1.</Paragraph> <Paragraph position="3"> The other 173 articles were regarding the 11 remaining John Smiths. Descriptions of a few of the John Smiths are: Chairman and CEO of General Motors, assistant track coach at UCLA, the legendary explorer, and the main character in Disney's Pocahontas, former president of the Labor Party of Britain. In the second set, there were 97 different &quot;resign&quot; events. Of these, 60 were involved in chains of size 1. The articles were regarding resignations of several different people including Ted Hobart of ABC Corp., Dick Morris, Speaker Jim Wright, and the possible resignation of Newt Gingrich.</Paragraph> </Section> <Section position="2" start_page="21" end_page="22" type="sub_section"> <SectionTitle> 4.2 Scoring and Results </SectionTitle> <Paragraph position="0"> In order to score the cross-document coreference chains output by the system, we had to map the cross-document coreference scoring problem to a within-document coreference scoring problem. This was done by creating a meta document consisting of the file names of each of the documents that the system was run on. Assuming that each of the documents in the two data sets was about a single John Smith, or about a single &quot;resign&quot; event, the cross-document coreference chains produced by the system could now be evaluated by scoring the corresponding within-document coreference chains in the meta document.</Paragraph> <Paragraph position="1"> Precision and recall are the measures used to evaluate the chains output by the system. For an entity, i, we define the precision and recall with respect to that entity in Figure 6.</Paragraph> <Paragraph position="2"> The final precision and recall numbers are computed by the following two formulae:</Paragraph> <Paragraph position="4"> where N is the number of entities in the document, and wi is the weight assigned to entity i in the document. For the results discussed in this paper, equal weights were assigned to each entity in the meta document. In other words, wi = -~ for all i. Full details equal weights for both) statistics for the John Smith data set. The best precision and recall achieved by number of correct elements in the output chain containing entityi Precisioni =</Paragraph> <Paragraph position="6"> number of elements in the output chain containing entityi number of correct elements in the output chain containing entityi number of elements in the truth chain containing entityi Our Algorithm for the &quot;resign&quot; Data Set the system on this data set was 93% and 77% respectively (when the threshold for the vector space model was set to 0.15). Similarly, Figure 8 shows the same three statistics for the &quot;resign&quot; data set. The best precision and recall achieved by the system on this data set was 94% and 81% respectively.</Paragraph> <Paragraph position="7"> This occurs when the threshold for the vector space model was set to 0.2. The results show that the system was very successful in resolving cross-document coreference.</Paragraph> </Section> </Section> <Section position="7" start_page="22" end_page="22" type="metho"> <SectionTitle> 5 Possible Generalizations About Large Data Collections Derived From Coreference Annotations </SectionTitle> <Paragraph position="0"> Crucial to the entire process of visualizing large document collections is relating the same individual or event across multiple documents. This single aspect of our system establishes its viability for large collection analysis. It allows the drops of information held in each document to be merged into a larger pool that is well organized.</Paragraph> <Section position="1" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 5.1 The Primary Display of Information </SectionTitle> <Paragraph position="0"> Two display techniques immediately suggest themselves for accessing the coreference annotations in a document collection, the first is to take the identified entities as atomic and link them to other entities which co-occur in the same document. This might reveal a relation between individuals and events, or individuals and other individuals. For example, such a linking might indicate that no newspaper article ever mentioned both Clark Kent and Superman in the same article, but that most all other famous individuals tended to overlap in some article or another. On the positive case, individuals, over time, may tend to congregate in media stories or events may tend to be more tightly linked than otherwise expected.</Paragraph> <Paragraph position="1"> The second technique would be to take as atomic the documents and relate via links other documents that contain mention of the same entity. With a temporal dimension, the role of individuals and events could be assessed as time moved forward.</Paragraph> </Section> <Section position="2" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 5.2 Finer Grained Analysis of the Documents </SectionTitle> <Paragraph position="0"> The fact that two entities coexisted in the same sentence in a document is noteworthy for correlational analysis. Links could be restricted to those between entities that co-existed in the same sentence or paragraph. Additional filterings are possible with constraints on the sorts of verbs that exist in the sentence. null A more sophisticated version of the above is to access the argument structure of the document.</Paragraph> <Paragraph position="1"> CAMP software provides a limited predicate argument structure that allows subjects/verbs/objects to be identified. This ability moves our annotation closer to the fixed record data structure of a traditional data base. One could select an event and its object, for instance 'X sold arms to Iraq' and see what the fillers for X were in a link analysis.</Paragraph> <Paragraph position="2"> There are limitations to predicate argument structure matching-for instance getting the correct pattern for all the selling of arms variations is quite difficult.</Paragraph> <Paragraph position="3"> In any case, there appear to be a myriad of applications for link analysis in the domain of large text data bases.</Paragraph> </Section> </Section> class="xml-element"></Paper>