File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-2809_relat.xml

Size: 3,792 bytes

Last Modified: 2025-10-06 14:15:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2809">
  <Title>A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia</Title>
  <Section position="5" start_page="58" end_page="58" type="relat">
    <SectionTitle>
3 Experiments and results
</SectionTitle>
    <Paragraph position="0"> We have tested our approach by applying it to 3517 entries of the Simple English Wikipedia which were randomly selected. Thus, these entries have been manually tagged with the expected entity category5. The distribution by entity classes can be seen in table 1: As it can be seen in table 1, the amount of entities of the categories Person and Location are balanced but this is not the case for the type Organization. There are very few instances of this type. This is understandable as in an encyclopedia locations and people are defined but this is not the usual case for organizations.</Paragraph>
    <Paragraph position="1"> According to what was said in section 2, we considered the heuristics explained there by carrying out two experiments. In the first one we applied the is instance heuristic. The second experiment considers the two heuristics explained in section 2 (is instance and is in wordnet). We do not present results without the first heuristic as through our experimentation it proved to increase both recall and precision for every entity category.</Paragraph>
    <Paragraph position="2"> For each experiment we considered two values of a constant Kappa which is used in our algorithm. The values are 0 and 2 as through experimentation we found these are the values which provide the highest recall and the highest precision, respectively. Results for the first experiment can be seen in table 2 and results for the second experiment in table 3.</Paragraph>
    <Paragraph position="3"> As it can be seen in these tables, the best recall for all classes is obtained in experiment 2 with Kappa 0 (table 3) while the best precision is obtained in experiment 1 with Kappa 2 (table 2).</Paragraph>
    <Paragraph position="4"> The results both for location and person categories are in our opinion good enough to the purpose of building and maintaining good quality gazetteers after a manual supervision. However, the results obtained for the organization class are very low. This is mainly due to the fact of the high interaction between this category and loca-tion combined with the practically absence of traditional entities of the organization type such as companies. This interaction can be seen in the in-depth results which presentation follows.</Paragraph>
    <Paragraph position="5"> In order to clarify these results, we present more in-depth data in tables 4 and 5. These tables present an error analysis, showing the false posi- null tives, false negatives, true positives and true negatives among all the categories for the configuration that provides the highest recall (experiment 2 with Kappa 0) and for the one that provides the highest precision (experiment 1 with Kappa 2).</Paragraph>
    <Paragraph position="6"> In tables 4 and 5 we can see that the interactions within classes (occurrences tagged as belonging to one class but NONE and guessed as belonging to other different class but NONE) is low. The only case in which it is significant is between location and organization. In table 5 we can see that 12 entities tagged as organization are classified as LOC while 20 tagged as organization are guessed with the correct type. Following with these, 5 entities tagged as location where classified as organization. This is due to the fact that countries and related entities such as &amp;quot;European Union&amp;quot; can be considered both as organizations or locations depending on their role in a text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML