File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0405_metho.xml

Size: 13,791 bytes

Last Modified: 2025-10-06 14:08:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0405">
  <Title>Unsupervised Personal Name Disambiguation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1000 Web Pages
</SectionTitle>
    <Paragraph position="0"> Another topic of recent interest is in producing biographical summaries from corpora (Schiffman et al., 2001). Along with disambiguation, our system simultaneously collects biographic information (Table 1). The relevant biographical attributes are depicted along with a clustering which shows the distinct referents (Section 4.1).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Robust Extraction of Categorical
Biographic Data
</SectionTitle>
    <Paragraph position="0"> Past work on this task (e.g. Bagga and Baldwin, 1998) has primarily approached personal name disambiguation using document context profiles or vectors, which recognize and distinguish identical name instances based on partially indicative words in context such as computer or car in the Clark case. However, in the specialized case of personal names, there is more precise information available.</Paragraph>
    <Paragraph position="1"> In particular, information extraction techniques can add high precision, categorical information such as approximate age/date-of-birth, nationality and occupation. This categorical data can support or exclude a candidate name$referent matches with higher confidence and greater pinpoint accuracy than via simple context vector-style features alone.</Paragraph>
    <Paragraph position="2"> Another major source of disambiguation information for proper nouns is the space of associated names. While these names could be used in a undifferentiated vector-based bag-of-words model, further accuracy can be gained by extracting specific types of association, such as familial relationships (e.g. son, wife), employment relationships (e.g.</Paragraph>
    <Paragraph position="3"> manager of), and nationality as distinct from simple term co-occurrence in a window. The Jim Clark married to &amp;quot;Vickie Parker-Clark&amp;quot; is likely not the same Jim Clark married to &amp;quot;Patty Clark&amp;quot;. Additionally, information about one's associates can help predict information about the person in question.</Paragraph>
    <Paragraph position="4"> Someone who frequently associates with Egyptians is likely to be Egyptian, or at the very least, has a close connection to Egypt.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Generating Extraction Patterns
</SectionTitle>
      <Paragraph position="0"> One standard method for generating extraction patterns is simply to write them by hand. In this paper, we have experimented with generating patterns automatically from data. This has the advantage of being more flexible, portable and scalable, and potentially having higher precision and recall. It also has the advantage of being applicable to new languages for which no developer with sufficient knowledge of the language is available.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Templates and Web Pages
</SectionTitle>
      <Paragraph position="0"> In the late 90s, there was a substantial body of research on learning information extraction patterns from templates (Huffman, 1995; Brin, 1998; Califf and Mooney, 1998; Freitag and McCallum, 1999; Yangarber et al., 2000; Ravichandran and Hovy, 2002). These techniques provide a way to bootstrap information extraction patterns from a set of example extractions or seed facts, where a tuple with the filled roles for the desired pattern are given. For the task of extracting biographical information, each example would include the personal name and the biographic feature. For example, training data for the pattern born in might be (&amp;quot;Wolfgang Amadeus Mozart&amp;quot;,1756). Given this set of examples, each method generates patterns differently.</Paragraph>
      <Paragraph position="1"> In this paper, we employ and extend the method described by Ravichandran and Hovy (2002) shown in Figure 1. For each seed fact pair for a given template (such as (Mozart,1756)), a web query is made which in turn leads to sentences in which the roles are observed in nearby association (e.g. &amp;quot;Mozart was born in 1756&amp;quot;). All substrings from these sentences are then extracted. The substrings are then subject to simple generalization, to produce candidate patterns: Mozart is replaced by &lt;name&gt;, 1756 is replaced by &lt;birth year&gt;, and all digits are replaced by #. These substring templates can  then serve as extraction patterns for previously unknown fact pairs, and their precision in fact extraction can be calculated with respect to a set of currently known facts.</Paragraph>
      <Paragraph position="2"> We examined a subset of the available and desirable extracted information. We learned patterns for birth year and occupation, and hand-coded patterns for birth location, spouse, birthday, familial relationships, collegiate affiliations and nationality. Other potential patterns currently under investigation include employer/employee and place of residence.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Multilingual Information Extraction
</SectionTitle>
      <Paragraph position="0"> We adapted the information extraction pattern generation techniques described above to multiple languages. In particular, the methodology proposed by Ravichandran and Hovy (2002) requires no parsing or other language specific resources, so is an ideal candidate for multilingual use. In this paper, we conducted an initial test test of the viability of inducing these information extraction patterns across languages. To test, we constructed a initial database of 5 people and their birthdays, and used this to induce the English patterns. We then increased the database to 50 people and birthdays and induced patterns for Spanish, presenting the results above. Figure 2 shows the top precision patterns extracted for English and for Spanish.</Paragraph>
      <Paragraph position="1"> It can be seen that the Spanish patterns are of the same length, with similar estimated precision, as well as similar word and punctuation distribution as the English ones. In fact, the purely syntactic patterns look identical. The only difference being that to generate equivalent Spanish data, a database of training examples an order of magnitude larger was required. This may be because for each database entry more pages were available on English websites than on Spanish websites.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Using Unsupervised Clustering to
</SectionTitle>
    <Paragraph position="0"> Identify the Referents of Personal Names This section examines clustering of web pages which containing an ambiguous personal name (with multiple real referents). The cluster method we employed is bottom-up centroid agglomerative clustering. In this method, each document is assigned a vector of automatically extracted features. At each stage of the clustering, the two most similar vectors are merged, to produce a new cluster, with a vector equal to the centroid of the vectors in the cluster. This step is repeated until all documents are clustered.</Paragraph>
    <Paragraph position="1"> To generate the vectors for each document, we explored a variety of methods:  1. Baseline : All words (plain) or only Proper Nouns (nnp) 2. Most Relevant words (mi and tf-idf) 3. Basic biographical features (feat) 4. Extended biographical Features (extfeat)  mation with the document collection and all of extended feature words for DAVIS/HARRELSON pseudoname null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Baseline Models
</SectionTitle>
      <Paragraph position="0"> In our baseline models, we used term vectors composed either of all words (minus a set of closed class &amp;quot;stop&amp;quot; words) or of only proper nouns. To assess similarity between vectors we utilized standard cosine similarity (cos(a;b) = a bjjajj jjbjj).</Paragraph>
      <Paragraph position="1"> We experimentally determined that the use of proper nouns alone led to more pure clustering. As a result, for the remainder of the experiments, we used only proper nouns in the vectors, except for those common words introduced by the various feature sets.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Relevant Words (mi and tf-idf)
</SectionTitle>
      <Paragraph position="0"> Selective term weighting has been shown to be highly effective for information retrieval. For this study, we investigated both the use of standard TF-IDF weighting and weighting based on the mutual information, where given a document collection c, for each word w, we calculate I(w; c) = p(wjc)p(w) .</Paragraph>
      <Paragraph position="1"> From these, we select words which appear more than 1 = 20 times in the collection, and have a I(w; c) greater than 2 = 10. These words are to the document's feature vector with a weight equal to log(I(w; c)).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Extracted Biographical Features (feat)
</SectionTitle>
      <Paragraph position="0"> The next set of models use the features extracted using the methodology described in Section 2. Biographical information such as birth year, and occupation, when found, is quite useful in connecting documents. If a document connects a name with a birth year, and another document connects the same name with the same birth year, typically, those two documents refer to the same person.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DAVIS/HARRELSON pseudoname
</SectionTitle>
    <Paragraph position="0"> These extracted features were used to categorically cluster documents in which they appeared. Because of their high degree of precision and specificity, documents which contained similar extracted features are virtually guaranteed to have the same referent. By clustering these documents first, large high quality clusters formed, which then then provided an anchor for the remaining pages. By examining the dendrogram in Figure 3, it is clear that the clusters start with documents with matching features, and then the other documents cluster around this core.</Paragraph>
    <Paragraph position="1"> In addition to improving disambiguation performance, these extracted features help distinguish the different clusters, and provide information about the different people.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Extended Biographical Features (extfeat)
</SectionTitle>
      <Paragraph position="0"> Another method for using these extracted features is to give higher weight to words which have ever been seen as filling a pattern. For example, if 1756 is extracted as a birth year from a syntactic-based pattern for the polysemous name, then whenever 1756 is observed anywhere in context (outside an extraction pattern), it is given a higher weighting and added to the document vector as a potential biographic feature. In our experiments, we did this only for words which appeared as values for a feature more than a threshold of 4 times. Then, whenever the word was seen in a document, it was given a weight equal to the log of the number of times the word was seen as an extracted feature.</Paragraph>
      <Paragraph position="1"> actor comedy  |spouse:Demi Moore  |Woody Harrelson</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Cluster Refactoring
</SectionTitle>
      <Paragraph position="0"> Ideally, the raw unsupervised clustering would yield a top level distinction between the different referents. However, this is rarely the case. With this type of agglomerative clustering, the most similar pages are clustered first, and outliers are assigned as stragglers at the top levels of the cluster tree.</Paragraph>
      <Paragraph position="1"> This typically leads to a full clustering where the top-level clusters are significantly less discriminative than those at the roots. In order to compensate for this effect, we performed a type of tree refactoring, which attempted to pick out and utilize seed clusters from within the entire clustering.</Paragraph>
      <Paragraph position="2"> In the refactoring, the clustering is stopped before it runs to completion, based on the percentage of documents clustered and the relative size of the clusters achieved. At this intermediate stage, relatively large and high-precision clusters are found (e.g. Figure 2). These automatically-induced clusters are then used as seeds for the next stage, where the unclustered documents are assigned to the seed with the closest distance measure (Figure 3).</Paragraph>
      <Paragraph position="3"> An alternative to this form of cluster refactoring would be to initially cluster only pages with extracted features. This would yield a set of cluster seeds, divided by features, which could then be used for further clustering. However, this method relies on having a number of pages with extracted features that overlap from each referent. This can only be actor comedy  |spouse:Demi Moore  |Woody Harrelson</Paragraph>
      <Paragraph position="5"> ization for DAVIS/HARRELSON pseudoname assured when the feature set is rich, or a large document space is assumed.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML