File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0108_intro.xml

Size: 4,106 bytes

Last Modified: 2025-10-06 14:01:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0108">
  <Title>A confidence-based framework for disambiguating geographic terms</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Many questions about geographic term disambiguation are standardly handled in a statistical framework: for example, we can ask that, in the absence of contextual information, with what probability does the word Madison refer to a person (e.g. James Madison), an organization (e.g. Madison Guaranty Savings and Loan), or a place (e.g. Madison, Wisconsin), and if no other disambiguation alternative exists, we can expect these three numbers to sum to 1 (i.e. behave like probabilities).</Paragraph>
    <Paragraph position="1"> However, there are many other questions where a strictly probability-based framework is less appropriate.</Paragraph>
    <Paragraph position="2"> In particular, much of the information that could be used to disambiguate spatial references in natural language text is strongly non-local in character, and as we increase the amount of this background information, eventually we reach the point when the amount of training data per parameter is so low that there is no repeatable experiment to base probabilities on.</Paragraph>
    <Paragraph position="3"> In such cases, &amp;quot;probabilities&amp;quot; are effectively used as a stand-in for what is really our confidence in one judgment or another. In this paper we describe some of the methods used in a purely confidence-based geographic term disambiguation system that crucially relies on the notion of &amp;quot;positive&amp;quot; and &amp;quot;negative&amp;quot; context.</Paragraph>
    <Paragraph position="4"> Far more information is contained in unstructured text (such as the Web and message traffic) than in structured databases, so automatically processing ambiguous geographic references unlocks a large amount of informa- null showing query results ranked and plotted on a map.</Paragraph>
    <Paragraph position="5"> tion. Adding spatial dimensions to the document search systems requires new algorithms for determining the relevance of documents. We describe methods for combining confidence-based disambiguation with measures of relevance to a user's query.</Paragraph>
    <Paragraph position="6"> It has become clear after several decades of artificial intelligence research that automated general natural language understanding is not feasible yet. However, we have been able to make progress by restricting our effort to the well-defined domain of geographic concepts, using statistical methods on extremely large corpora. To cope with billions of documents, we have built fast algorithms for extracting and disambiguating geographic  identify geographic areas that it is relevant to. This example shows the distribution of the word wine in Europe. information and fast database algorithms specifically for information which has a spatial component.</Paragraph>
    <Paragraph position="7"> One form of information retrieval made possible by extracting geographic meaning in large corpora is geographic text search. Users are presented with an interface containing a traditional text search form combined with a map. They can zoom in on areas of the world that are of interest, and results of textual queries are plotted on the map (Figure 1). Other forms of data exploration are also made possible, such as exploring the spatial density pattern of documents satisfying a textual query (Figure 2). In Section 2 we explore challenges of finding geographic meaning in natural language texts and give examples of typical ambiguities. In Section 3 we introduce some of our methods for determining geographic meaning in natural language. In Section 4 we describe some of the methods of determining geographic meaning during real-time processing. In Section 5 we describe some of our training methods. In Section 6 we describe methods for combining confidence-based disambiguation with measures of relevance to a user's query.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML