File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0202_metho.xml
Size: 5,694 bytes
Last Modified: 2025-10-06 14:15:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0202"> <Title>Is Hillary Rodham Clinton the President? Disambiguating Names across Documents Yael RAVIN</Title> <Section position="5" start_page="11" end_page="12" type="metho"> <SectionTitle> 3 Splitting Names </SectionTitle> <Paragraph position="0"> We address the &quot;splitting&quot; problem first. The heuristics for splitting names within the document \[WRC97\] fail to address two kinds of combined names. First, there is a residue of names containing and, such as Hoechst and Schering A.G., in which the and may or may not be part of the organization name. The cross-document algorithm to handle these is similar to the intra-document one: Iterate over the name string; break it into component strings at commas and and; verify that each component corresponds to an independently existing canonical string. If all do, split the name. The difference is that at the collection level, there are more canonical strings available for this verification. If the name is split, we repair the cross document statistics by folding the occurrence statistics of the combined form with those of each of the parts. On the collection level, we split strings like AT&T Wireless and Microsoft Network and AT&T Worldnet, for which there was not enough evidence within the document.</Paragraph> <Paragraph position="1"> More complex is the case of organization names of the form X of Y or X in Y, where Yis a place, such as Fox News Channel in New York City or Prudential Securities in Shanghai. The intra-document heuristic that splits names if their components occur on their own within the document is not appropriate here: the short form may be licensed in the document only because the full form serves as its antecedent. We need evidence that the short form occurs by itself in other contexts. First, we sort these names and verify that there are no ambiguities. For example, it may appear that Union Bank of Switzerland in San Francisco is a candidate for splitting, since Union Bank of Switzerland occurs as a canonical name, but the existence of Union Bank of Switzerland in New York signals an ambiguity -- there are several distinct entities whose name starts with Union Bank of Switzerland and so no splitting applies. Similar ambiguity is found with Federal District Court in New York, Federal District Court in Philadelphia, etc. 2</Paragraph> </Section> <Section position="6" start_page="12" end_page="12" type="metho"> <SectionTitle> 4 Merging Names </SectionTitle> <Paragraph position="0"> As discussed in \[BB98\], a promising approach to determining whether names corefer is the comparison of their contexts. However, since the cost of context comparison for all similar canonical strings would be prohibitively expensive, we have devised means of defining compatible names that are good candidates for coreference, based on knowledge obtained during intra-document processing. Our algorithm sorts names with common substrings from least to most ambiguous. For example, PR names are sorted by identical last names. The least ambiguous ones also contain a first name and middle name, followed by ones containing a first name and middle initial, followed by ones containing only a first name, a first initial and finally the ones with just a last name. PR names may also carry gender information, determined either on the basis of the first name (e.g. Bill but not Jamie) or a gender prefix (e.g. Mr., but not 2 Note that this definition of ambiguity is dependent on names found in the collection. For example, in the \[NYT98\] collection, the only Prudential Securities in/of.., found was Prudential Securities in Shanghai. President) of the canonical form or one of its variants. PL names are sorted by common initial strings. The least ambiguous have the pattern of <small place, big place>. By comparing the internal structure of these sorted groups, we are able to divide them into mutually exclusive sets (ES), whose incompatible features prevent any merging; and a residue of mergeable names (MN), which are compatible with some or all of the exclusive ones. For some of the mergeable names, we are able to stipulate coreference with the exclusive names without any further tests.</Paragraph> <Paragraph position="1"> For others, we need to compare contexts before reaching a conclusion.</Paragraph> <Paragraph position="2"> To illustrate with an example, we collected the following sorted group for last name Clinton3: There is too much ambiguity (or uncertainty) to stipulate coreference among the members of this sorted group. There is, however, one stipulated merge we apply to Bill Clinton \[PR\] and Bill Clinton \[PR?\]. We have found that when the canonical string is identical, a weak entity type can safely combine with a strong one. There are many cases of PR? to PR merging, some of PL? to ORG, (e.g., Digital City), and a fair number of PL? to PR, as in Carla Hills, U.S. and Mrs.</Paragraph> </Section> <Section position="7" start_page="12" end_page="13" type="metho"> <SectionTitle> 3 Intra-document analysis identified President </SectionTitle> <Paragraph position="0"> Clinton once as referring to a male, since President Clinton and Mr. Clinton were merged within the document(s); another time as referring to a female, since only President Clinton and Mrs. Clinton appeared in the document(s) in question and were merged; and a third President Clinton, based on documents where there was insufficient evidence for gender.</Paragraph> <Paragraph position="1"> Carla Hills. We discuss merging involving context comparison in the following section.</Paragraph> </Section> class="xml-element"></Paper>