File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2809_intro.xml
Size: 6,315 bytes
Last Modified: 2025-10-06 14:04:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2809"> <Title>A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia</Title> <Section position="4" start_page="56" end_page="58" type="intro"> <SectionTitle> 2 Approach </SectionTitle> <Paragraph position="0"> In this section we present our approach to automatically build and maintain dictionaries of proper nouns. In a nutshell, we analyse the entries of an encyclopedia with the aid of a noun hierarchy. Our motivation is that proper nouns that form entities can be obtained from the entries in an encyclopedia and that some features of their definitions in the encyclopedia can help to classify them into their correct entity category.</Paragraph> <Paragraph position="1"> The encyclopedia used has been Wikipedia1.</Paragraph> <Paragraph position="2"> According to the English version of Wikipedia 2, Wikipedia is a multi-lingual web-based, freecontent encyclopedia which is updated continuously in a collaborative way. The reasons why we have chosen this encyclopedia are the following: * It is a big source of information. By December 2005, it has over 2,500,000 definitions. The English version alone has more than 850,000 entries.</Paragraph> <Paragraph position="3"> * Its content has a free license, meaning that it will always be available for research without restrictions and without needing to acquire any license.</Paragraph> <Paragraph position="4"> * It is a general knowledge resource. Thus, it can be used to extract information for open domain systems.</Paragraph> <Paragraph position="5"> * Its data has some degree of formality and structure (e.g. categories) which helps to process it.</Paragraph> <Paragraph position="6"> * It is a multilingual resource. Thus, if we are able to develop a language independent system, it can be used to create gazetteers for any language for which Wikipedia is available.</Paragraph> <Paragraph position="7"> * It is continuously updated. This is a very important fact for the maintenance of the gazetteers.</Paragraph> <Paragraph position="8"> The noun hierarchy used has been the noun hierarchy from WordNet (Miller, 1995). This is a widely used resource for NLP tasks. Although initially being a monolingual resource for the English language, a later project called EuroWordNet (Vossen, 1998), provided wordnet-like hierarchies for a set of languages of the European Union. Besides, EuroWordNet defines a language independent index called Inter-Lingual-Index (ILI) which allows to establish relations between words in wordnets of different languages. The ILI facilitates also the development of wordnets for other languages.</Paragraph> <Paragraph position="9"> From this noun hierarchy we consider the nodes (called synsets in WordNet) which in our opinion represent more accurately the different kind of entities we are working with (location, organization and person). For example, we consider the synset 6026 as the corresponding to the entity class Person. This is the information contained in synset number 6026: person, individual, someone, somebody, mortal, human, soul -- (a human being; &quot;there was too much for one person to do&quot;) Given an entry from Wikipedia, a PoS-tagger (Carreras et al., 2004) is applied to the first sentence of its definition. As an example, the first sentence of the entry Portugal in the Simple English Wikipedia 3 is presented here: For every noun in a definition we obtain the synset of WordNet that contains its first sense4. We follow the hyperonymy branch of this synset until we arrive to a synset we have considered belonging to an entity class or we arrive to the root of the hierarchy. If we arrive to a considered synset, then we consider that noun as belonging to the entity class of the considered synset. The following example may clarify this explanation: ing time increases notably.</Paragraph> <Paragraph position="10"> country --> LOCATION south-west --> NONE europe --> LOCATION As it has been said in the abstract, the application of a PoS tagger is optional. The algorithm will perform considerably faster with it as with the PoS data we only need to process the nouns. If a PoS tagger is not available for a language, the algorithm can still be applied. The only drawback is that it will perform slower as it needs to process all the words. However, through our experimentation we can conclude that the results do not significantly change.</Paragraph> <Paragraph position="11"> Finally, we apply a weighting algorithm which takes into account the amount of nouns in the definition identified as belonging to the different entity types considered and decides to which entity type the entry belongs. This algorithm has a constant Kappa which allows to increase or decrease the distance required within categories in order to assign an entry to a given class. The value of Kappa is the minimum difference of number of occurrences between the first and second most frequent categories in an entry in order to assign the entry to the first category. In our example, for any value of Kappa lower than 4, the algorithm would say that the entry Portugal belongs to the location entity type.</Paragraph> <Paragraph position="12"> Once we have this basic approach we apply different heuristics which we think may improve the results obtained and which effect will be analysed in the section about results.</Paragraph> <Paragraph position="13"> The first heuristic, called is instance, tries to determine whether the entries from Wikipedia are instances (e.g. Portugal) or word classes (e.g. country). This is done because of the fact that named entities only consider instances. Therefore, we are not interested in word classes. We consider that an entry from Wikipedia is an instance when it has an associated entry in WordNet and it is an instance. The procedure to determine if an entry from Word-Net is an instance or a word class is similar to the one used in (Magnini et al., 2002).</Paragraph> <Paragraph position="14"> The second heuristic is called is in wordnet. It simply determines if the entries from Wikipedia have an associated entry in WordNet. If so, we may use the information from WordNet to determine its category.</Paragraph> </Section> class="xml-element"></Paper>