File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1002_metho.xml

Size: 18,041 bytes

Last Modified: 2025-10-06 14:10:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1002">
  <Title>Using Encyclopedic Knowledge for Named Entity Disambiguation</Title>
  <Section position="4" start_page="10" end_page="11" type="metho">
    <SectionTitle>
3 A Dictionary of Named Entities
</SectionTitle>
    <Paragraph position="0"> We organize all named entities from Wikipedia into a dictionary structure BW, where each string entry CS BE BW is mapped to the set of entities CSBMBX that can be denoted by CS in Wikipedia. The first step is to identify named entities, i.e. entities with a proper name title. Because every title in Wikipedia must begin with a capital letter, the decision whether a title is a proper name relies on the following sequence of heuristic steps:  1. If CTBMD8CXD8D0CT is a multiword title, check the capitalization of all content words, i.e. words other than prepositions, determiners, conjunctions, relative pronouns or negations.</Paragraph>
    <Paragraph position="1"> Consider CT a named entity if and only if all content words are capitalized.</Paragraph>
    <Paragraph position="2"> 2. If CTBMD8CXD8D0CT is a one word title that contains at least two capital letters, then CT is a named entity. Otherwise, go to step 3.</Paragraph>
    <Paragraph position="3"> 3. Count how many times CTBMD8CXD8D0CT occurs in the  text of the article, in positions other than at the beginning of sentences. If at least BJBHB1 of these occurrences are capitalized, then CT is a named entity.</Paragraph>
    <Paragraph position="4"> The combined heuristics extract close to half a million named entities from Wikipedia. The second step constructs the actual dictionary BW as follows: null AF The set of entries in BW consists of all strings that may denote a named entity, i.e. if CTBEBX is a named entity, then its title name CTBMD8CXD8D0CT, its redirect names CTBMCA, and its disambiguation names CTBMBW are all added as entries in BW. AF Each entry string CSBEBW is mapped to CSBMBX, the set of entities that CS may denote in Wikipedia. Consequently, a named entity CT is included in CSBMBX if and only if CS BP CTBMD8CXD8D0CT, CS BE CTBMCA, or CSBECTBMBW.</Paragraph>
  </Section>
  <Section position="5" start_page="11" end_page="12" type="metho">
    <SectionTitle>
4 Named Entity Disambiguation
</SectionTitle>
    <Paragraph position="0"> As illustrated in Section 1, the same proper name may refer to more than one named entity. The named entity dictionary from Section 3 and the hyperlinks from Wikipedia articles provide a dataset of disambiguated occurrences of proper names, as described in the following. As shown in Section 2.4, each link contains the title name of an entity, and the proper name (the display string) used to refer to it. We use the term query to denote the occurrence of a proper name inside a Wikipedia article. If there is a dictionary entry matching the proper name in the query D5 such that the set of denoted entities D5BMBX contains at least two entities, one of them the true answer entity D5BMCT, then the query D5 is included in the dataset. More exactly, if</Paragraph>
    <Paragraph position="2"> the dataset will be augmented with D2 pairs CWD5BNCT</Paragraph>
    <Paragraph position="4"> The field D5BMCC contains all words occurring in a limit length window centered on the proper name.</Paragraph>
    <Paragraph position="5"> The window size is set to 55, which is the value that was observed to give optimum performance in the related task of cross-document coreference (Gooi and Allan, 2004). The Kronecker delta function AEB4CT</Paragraph>
    <Paragraph position="7"> is the same as the entity D5BMCT referred in the link. Table 2 lists the query pairs created for the three John Williams queries from Section 1.1, assuming only three entities in Wikipedia correspond to this name.</Paragraph>
    <Paragraph position="8">  The application of this procedure on Wikipedia results into a dataset of 1,783,868 disambiguated queries.</Paragraph>
    <Section position="1" start_page="11" end_page="12" type="sub_section">
      <SectionTitle>
4.1 Context-Article Similarity
</SectionTitle>
      <Paragraph position="0"> Using the representation from the previous section, the name entity disambiguation problem can be cast as a ranking problem. Assuming that an appropriate scoring function D7CRD3D6CTB4D5BNCT</Paragraph>
      <Paragraph position="2"> able, the named entity corresponding to query D5 is defined to be the one with the highest score:</Paragraph>
      <Paragraph position="4"> If CMCT BP D5BMCT then CMCT represents a hit, otherwise CMCT is a miss. Disambiguation methods will then differ based on the way they define the scoring function.</Paragraph>
      <Paragraph position="5"> One ranking function that is evaluated experimentally in this paper is based on the cosine similarity between the context of the query and the text of the article:  BMCC are represented in the standard vector space model, where each component corresponds to a term in the vocabulary, and the term weight is the standard D8CU A2 CXCSCU score (Baeza-Yates and Ribeiro-Neto, 1999). The vocabulary CE is created by reading all Wikipedia  articles and recording, for each word stem DB, its document frequency CSCUB4DBB5 in Wikipedia. Stop-words and words that are too frequent or too rare are discarded. A generic document CS is then represented as a vector of length CYCE CY, with a position for each vocabulary word. If CUB4DBB5 is the frequency of word DB in document CS, and C6 is the total number of Wikipedia articles, then the weight of word DBBECE in the D8CU A2 CXCSCU representation of CS is:</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
4.2 Taxonomy Kernel
</SectionTitle>
      <Paragraph position="0"> An error analysis of the cosine-based ranking method reveals that, in many cases, the pair CWD5BNCTCX fails to rank first, even though words from the query context unambiguously indicate CT as the actual denoted entity. In these cases, cue words from the context do not appear in CT's article due to two  main reasons: 1. The article may be too short or incomplete. 2. Even though the article captures most of the  relevant concepts expressed in the query context, it does this by employing synonymous words or phrases.</Paragraph>
      <Paragraph position="1"> The cosine similarity between D5 and CT</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="12" end_page="12" type="metho">
    <SectionTitle>
CZ
</SectionTitle>
    <Paragraph position="0"> can be seen as an expression of the total degree of correlation between words from the context of query D5 and a given named entity CT  , it is worth considering the correlation between context words and the categories to which CT</Paragraph>
  </Section>
  <Section position="7" start_page="12" end_page="13" type="metho">
    <SectionTitle>
CZ
</SectionTitle>
    <Paragraph position="0">  belongs. For illustration, consider the two queries for the name John Williams from Figure 1. To avoid clutter, Figure 1 depicts only two entities with the name JohnWilliams in Wikipedia: the composer and the wrestler. On top of each entity, the figure shows one of their Wikipedia categories (Film score composers and Professional wrestlers respectively), together with some of their ancestor categories in the Wikipedia taxonomy. The two query contexts are shown at the bottom of the figure. In the context on the left, words such as conducted and concert denote concepts that are highly correlated with the Musicians, Composers and Filmscore composers categories. On the other hand, their correlation with other categories in Figure 1 is considerably lower. Consequently, a  goal of this paper is to design a disambiguation method that 1) learns the magnitude of these correlations, and 2) uses these correlations in a scoring function, together with the cosine similarity. Our intuition is that, given the query context on the left, such a ranking function has a better chance of ranking the &amp;quot;composer&amp;quot; entity higher than the &amp;quot;wrestler&amp;quot; entity, when compared with the simple cosine similarity baseline.</Paragraph>
    <Paragraph position="1"> We consider using a linear ranking function as follows:  The weight vector DB models the magnitude of each word-category correlation, and can be learned by training on the query dataset described at the beginning of Section 4. We used the kernel version of the large-margin ranking approach from (Joachims, 2002) which solves the optimization  problem in Figure 2. The aim of this formulation is to find a weight vector DB such that 1) the number of ranking constraints DB A8B4D5BND5BMCTB5 AL DB A8B4D5BNCT</Paragraph>
    <Paragraph position="3"> from the training data that are violated is minimized, and 2) the ranking function DB A8B4D5BNCT  product between the number of common words in the contexts of the two queries and the number of categories common to the two named entities, plus the product of the two cosine similarities. The corresponding ranking kernel is:  In (McCallum et al., 1998), a statistical technique called shrinkage is used in order to improve the accuracy of a naive Bayes text classifier. Accordingly, one can take advantage of a hierarchy of classes by combining parameter estimates of parent categories into the parameter estimates of a child category. The taxonomy kernel is very related to the same technique - one can actually regard it as a distribution-free analogue of shrinkage.</Paragraph>
    <Section position="1" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
4.3 Detecting Out-of-Wikipedia Entities
</SectionTitle>
      <Paragraph position="0"> The two disambiguation methods discussed above (Sections 4.1 and 4.2) implicitly assume that Wikipedia contains all entities that may be denoted by entries from the named entity dictionary.</Paragraph>
      <Paragraph position="1"> Taking for example the name John Williams, both methods assume that in any context, the referred entity is among the 22 entities listed on the disambiguation page in Wikipedia. In practice, there may be contexts where John Williams refers to an  is not a popular entity. These out-of-Wikipedia entities are accommodated in the ranking approach to disambiguation as follows. A special entity CT D3D9D8 is introduced to denote any entity not covered by Wikipedia. Its attributes are set to null values (e.g., the article text CT</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="13" end_page="13" type="metho">
    <SectionTitle>
BMCC BP BN,
</SectionTitle>
    <Paragraph position="0"> and the set of categories CT D3D9D8 BMBV BP BN). The ranking in Equation 1 is then updated so that it returns the Wikipedia entity with the highest score, if this score is greater then a fix threshold AS, otherwise it  If the scoring function is implemented as a weighted combination of feature functions, as in Equation 3, then the modification shown above results into a new feature AU  The associated weight AS is learned along with the weights for the other features (as defined in Equation 5).</Paragraph>
  </Section>
  <Section position="9" start_page="13" end_page="15" type="metho">
    <SectionTitle>
5 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> The taxonomy kernel was trained using the SVMD0CXCVCWD8 package (Joachims, 1999). As described in Section 4, through its hyperlinks, Wikipedia provides a dataset of 1,783,868 ambiguous queries that can be used for training a named entity disambiguator. The apparently high number of queries actually corresponds to a moderate size dataset, given that the space of parameters includes one parameter for each word-category combination. However, assuming SVMD0CXCVCWD8 does not run out of memory, using the entire dataset for training and testing is extremely  those that differ from the true answer CT in terms of their categories from BV  is restricted to all categories under People by occupation. Each category must have at least 200 articles to be retained,which results in a total of 540 categories out of the 8202 categories under People by occupation. The query dataset is generated as in the first scenario by replacing BV  ). In order to make the task more realistic, all queries from the initial Wikipedia dataset are considered as follows. For each query D5, out of all matching entities that do not have a category under People by occupation, one is randomly selected as an out-of-Wikipedia entity. Then, out of all queries for which the true answer is an out-of-Wikipedia entity, a subset is randomly selected such that, in the end, the number of queries with out-of-Wikipedia true answers is BDBCB1 of the total number of queries. In other words, the scenario assumes the task is to detect if a name denotes an entity belonging to the People by occupation taxonomy and, in the positive cases, to disambiguate between multiple entities under People by occupation that have the same name.</Paragraph>
    <Paragraph position="1"> The dataset for each scenario is split into a training dataset and a testing dataset which are disjoint in terms of the query names used in their examples. For instance, if a query for the name John Williams is included in the training dataset, then all other queries with this name are allocated for learning (and consequently excluded from testing). Using a disjoint split is motivated by the fact that many Wikipedia queries that have the same true answer also have similar contexts, containing rare words that are highly correlated, almost exclusively, with the answer. For example, query names that refer to singers often contain album or song names, query names that refer to writers often contain book names, etc. The taxonomy kernel can easily &amp;quot;memorize&amp;quot; these associations, especially when the categories are very fine-grained. In the current framework, the unsupervised method of context-article similarity does not utilize the correlations present in the training data. Therefore, for the sake of comparison, we decided to prohibit the taxonomy kernel from using these correlations by training and testing on a disjoint split. Section 6 describes how the training queries could be used in the computation of the context-article similarity, which has the potential of boosting the accuracy for both disambiguation methods.</Paragraph>
    <Paragraph position="2"> Table 3 shows a number of relevant statistics for each scenario: #CAT represents the number of Wikipedia categories, #SV is the number of support vectors, TK(A) and Cos(A) are the accuracy of the Taxonomy Kernel and the Cosine similarity respectively. The training and testing datasets are characterized in terms of the number of queries and query-answer pairs. The number of ranking contraints (as specified in Figure 2) is also included for the training data in column #CONSTR.</Paragraph>
    <Paragraph position="3"> The size of the training data is limited so that learning in each scenario takes within three days on a Pentium 4 CPU at 2.6 GHz. Furthermore,</Paragraph>
  </Section>
  <Section position="10" start_page="15" end_page="15" type="metho">
    <SectionTitle>
in CB
BG
</SectionTitle>
    <Paragraph position="0"> , the termination error criterion AF is changed from its default value of BCBMBCBCBD to BCBMBCBD. Also, the threshold AS for detecting out-of-Wikipedia entities when ranking with cosine similarity is set to the value that gives highest accuracy on training data.</Paragraph>
    <Paragraph position="1"> As can be seen in the last two columns, the Taxonomy Kernel significantly outperforms the Cosine similarity in the first three scenarios, confirming our intuition that correlations between words from the query context and categories from Wikipedia taxonomy provide useful information for disambiguating named entities. In the last scenario, which combines detection and disambiguation, the gain is not that substantial. Most queries in the corresponding dataset have only two possible answers, one of them an out-of-Wikipedia answer, and for these cases the cosine is already doing well at disambiguation. We conjecture that a more significant impact would be observed if the dataset queries were more ambiguous.</Paragraph>
  </Section>
  <Section position="11" start_page="15" end_page="15" type="metho">
    <SectionTitle>
6 Future Work
</SectionTitle>
    <Paragraph position="0"> The high number of support vectors - half the number of query-answer pairs in training data suggests that all scenarios can benefit from more training data. One method for making this feasible is to use the weight vector DB explicitely in a linear SVM. Because much of the computation time is spent on evaluating the decision function, using DB explicitely may result in a significant speed-up.</Paragraph>
    <Paragraph position="1"> The dimensionality of DB (by default CYCE CY A2 CYBVCY) can be reduced significantly by considering only word-category pairs whose frequency in the training data is above a predefined threshold.</Paragraph>
    <Paragraph position="2"> A complementary way of using the training data is to augment the article of each named entity with the contexts from all queries for which this entity is the true answer. This method has the potential of improving the accuracy of both methods when the training and testing datasets are not disjoint in terms of the proper names used in their queries.</Paragraph>
    <Paragraph position="3"> Word-category correlations have been used in (Ciaramita et al., 2003) to improve word sense disambiguation (WSD), although with less substantial gains. There, a separate model was learned for each of the 29 ambiguous nouns from the Senseval 2 lexical sample task. While creating a separate model for each named entity is not feasible there are 94,875 titles under People by occupation - named entity disambiguation can nevertheless benefit from correlations between Wikipedia categories and features traditionally used in WSD such as bigrams and trigrams centered on the proper name occurrence, and syntactic information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML