File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0108_metho.xml

Size: 16,113 bytes

Last Modified: 2025-10-06 14:08:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0108">
  <Title>A confidence-based framework for disambiguating geographic terms</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Challenges of finding geographic
</SectionTitle>
    <Paragraph position="0"> meaning in natural language text Like other references in natural language text, geographic references are often highly under-specified and ambiguous. To take an extreme example, when encountering a reference to Al Hamra, the task is to determine which of the 65 places in the world with that name is being referred to, or even whether a place is being referred to at all, for the phrase also means red in Arabic. The same applies to the more than two dozen US towns named Madison. In fact, the majority of references to places are ambiguous in this way.</Paragraph>
    <Paragraph position="1"> Human beings have a remarkable ability to derive useful information from ambiguous and under-specified references using real-world knowledge and experience, by deriving fuzzy rules from experience and knowing when to apply them. MetaCarta imitates this process using combinations of heuristics and data mining. For example, when encountering a mention of Al Hamra, a human analyst may notice that the rest of the document is focused on a region of Oman. Even if there is no mention of Oman itself, a mention of the nearby place Safil in the same document makes it likely that the Al Hamra in Oman is referred to. Even though there is another place named Safil in Iran, the towns of Safil and Al Hamra in Oman are close to each other, while there is no Al Hamra close to Safil, Iran.</Paragraph>
    <Paragraph position="2"> People also apply real-world knowledge gained in other contexts: they know, for example, that a reference to a place called Madison, in the absence of a state, is more likely to refer to Madison, Wisconsin than the smaller Madison, Iowa; and they know that James Madison and the Madison family do not refer to places at all. Similarly they know that Ishihara does not refer to a place, even though there is a Japanese town of that name, if a government minister named Ishihara is being mentioned. null Moreover, much of the information people use to disambiguate references is not contained within the document itself, but is in the form of experience gained from reading many other documents. When encountering a name, people have various associations with the uses of this name they have seen before, and have a rough idea of how often it referred to places.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methods for determining geographic
</SectionTitle>
    <Paragraph position="0"> meaning in natural language MetaCarta has been able to imitate many aspects of this common-sense process because of the well-defined, low-dimensional space of geographic concepts. We begin with a gazetteer containing several million name-point and name-region pairs, and the enclosure relationship between regions and points. A given name n may refer to several points or regions, or refer to a non-geographic concept. To deal with ambiguity, for every potential reference of a name n to a point p, we estimate c(p,n), the confidence that n really refers to p. The relevance of the document to each mentioned location must also be determined, in order to present the results that best satisfy the need for both correctness and relevance to a query, as described in Section 6.</Paragraph>
    <Paragraph position="1"> There are two main phases of processing involved in the extraction of geographic information: training on large corpora, and real-time processing of a document.</Paragraph>
    <Paragraph position="2"> In order to index large volumes of documents in a reasonable time, documents must be processed at a rate of at least a hundred documents per second on a single workstation. This constraint affects the choice of heuristics used. Some of the methods of determining geographic meaning during real-time processing are described in Section 4.</Paragraph>
    <Paragraph position="3"> The training phase requires some seed system capable of extracting the geographic information or, in the limiting case, some manually grounded documents. The quality of training depends on the quality of the seed, so as the system for the real-time processing of documents improves, we iterate the training process. Some details of the training process are described in Section 5.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Real-time processing of documents
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Identifying candidate places
</SectionTitle>
      <Paragraph position="0"> When processing a document, we begin by identifying potentially geographic references. For each, we identify all known candidates for the meaning of that reference.</Paragraph>
      <Paragraph position="1"> For example, a reference to 'Madison' can potentially mean any of 22 points with that name, or none of them.</Paragraph>
      <Paragraph position="2"> The main source of geographic references are names from the high-quality MetaCarta gazetteer. See (Axelrod, 2003) for the process of building and updating this gazetteer. The procedure used to obtain realistic initial confidences associated with the gazetteer names is described in Section 5.1.</Paragraph>
      <Paragraph position="3"> We mention some of the alternative sources of potentially geographic references here. We have capabilities allowing to match US postal addresses and pass them to third-party geolocation software producing a coordinate for the address.</Paragraph>
      <Paragraph position="4"> Coordinates such as 38*01'10.5&amp;quot;N 121*44'48.8&amp;quot;W or 56.51*N 25.86*E are matched. We match some of JINTACCS (Department of the Army, 1990) message traffic formats such as 163940N 1062920E (means 16*39'40&amp;quot;N 106*29'20&amp;quot;E).</Paragraph>
      <Paragraph position="5"> The matches are then assigned initial confidences, and disambiguated using local and non-local information within the document.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Geographic disambiguation by local linguistic
</SectionTitle>
      <Paragraph position="0"> context Similarly to other statistical NLP efforts, we use the local document context that a potentially geographic name occurs in. For example, the words city of or mayor of preceding or the words community college following a name like Madison are strong positive indicators of the geographic nature of this name. At the same time, the words Mr., Dr., or a common first name preceding or the words will arrive following a potential city name are strong negative indicators that the name in question is geographic. We use the mixture of data mining procedures described in Section 5.2 and domain knowledge repositories containing context strings such as first names to form the sets of contexts we are using and to determine their strength as positive and negative indicators.</Paragraph>
      <Paragraph position="1"> Heuristics then adjust the confidence cgeo(n) that n refers to any geographic location (though not whether it refers to one of several synonymous locations) according to the nature and strength of these indicators.</Paragraph>
      <Paragraph position="2"> Other local clues, such as absence of upper-case letters in the name itself or the resemblance of the name to an acronym have also proven useful to further adjust the values of cgeo.</Paragraph>
      <Paragraph position="3"> The values of cgeo are then modified by non-local information as described below.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Geographic disambiguation by spatial patterns
</SectionTitle>
      <Paragraph position="0"> of geographic references in documents We have found that there is a high degree of spatial correlation in geographic references that are in textual proximity. This applies not only to points that are nearby, such as Madison and Milwaukee, but also to the situation when points are enclosed by regions, e.g. Madison and Wisconsin. This correlation between geographic and textual distance is considered in estimating the confidence that a name refers to a point.</Paragraph>
      <Paragraph position="1"> Some of our heuristics increase c(p,n) based on how many and which points (and enclosing regions) are mentioned in the same document as n and their proximity.</Paragraph>
      <Paragraph position="2"> We make use of the characteristics of the nearby locations, and weight their influence as a decreasing function of geographic relationships to p and textual relationships to n. c(p,n) is then increased by a saturating function of these influences.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Domain knowledge: population heuristics
</SectionTitle>
      <Paragraph position="0"> In addition, population data in the gazetteer is also used.</Paragraph>
      <Paragraph position="1"> A place with a high population is more likely to be mentioned than a place with a lower one. Thus when disambiguating multiple referents with the same name, the population of each is considered. The confidence of a place p is decreased by an amount proportional to the logarithm of the ratio of the population of p to the population of all places with the name n.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Relative references
</SectionTitle>
      <Paragraph position="0"> Until now we discussed the processing of stand-alone geographic references. We also process relative geographic references such as 15 miles northeast of Portland. This relative reference is resolved in correspondence with the disambiguation of its anchor reference, Portland. If we decided that Portland refers to Portland, Oregon with confidence c, then we assume that 15 miles northeast of Portland refers to the point 15 miles northeast of Portland, Oregon with confidence f(c), where f(c) is greater than c, since the presence of a well-defined relative reference serves as an additional linguistic clue.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.6 Temporal information
</SectionTitle>
      <Paragraph position="0"> While not strictly a geographic issue, we mention here that the system also extracts temporal information from natural language documents. Currently we recognize military date/time group Zulu formats (Com-</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Training
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Determining the geographic significance of
</SectionTitle>
      <Paragraph position="0"> gazetteer names The methods for disambiguating geographic terms described above can also be exploited at the level of the corpus, despite the fact that the data used for training are untagged and therefore noisy. Since the real-time document processing system is high throughput, it can be applied to a training corpus consisting of a few hundred million documents.</Paragraph>
      <Paragraph position="1"> If a name n is often given a high confidence of referring to a point p, then n is likely to refer to p even in the absence of other evidence in the document. Thus, each name-point pair n,p is given an initial confidence which is the average confidence assigned to an instance in the training corpus.</Paragraph>
      <Paragraph position="2"> This initial confidence is then used as a starting point and modified by the other heuristics described above to obtain confidence for a name instance in a specific document during real-time document processing. Thus the training process is iterative.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Data mining of geographically significant local
linguistic contexts
</SectionTitle>
      <Paragraph position="0"> We currently use data mining on tagged corpora to learn the contexts in which geographic and non-geographic references occur, the words and phrases leading up to and trailing the name n. The tagged corpora were obtained using the Alembic tagger (Day et al., 1997). The accumulated statistics allow us to determine whether a specific context is a positive or negative indicator of a term being geographic, and the strength of this particular indicator. For any context C, an adjustment is applied to the confidence which is a nonlinear function of the probability of a geographic reference occurring in C in the tagged corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Relevance
</SectionTitle>
    <Paragraph position="0"> The addition of geographic dimensions to information retrieval means that in addition to the relevance of documents to a textual query, the relevance to the places mentioned in those documents must also be considered in order to rank the documents. The two kinds of relevance, traditional textual query relevance Rw and georelevance Rg, must be properly balanced to return documents relevant to a user's query. The traditional textual query relevance is obtained using standard techniques (Robertson and Jones, 1997).</Paragraph>
    <Paragraph position="1"> Georelevance is based on both the geographic confidence of the place names used to place the document on the map, and the emphasis of the place name in the document. Emphasis is affected by the position Pn of the name in the document, and the prominence Bn. The latter is a function of whether it is in the title or header, whether it is emphasized or rendered in a large font, and other clues related to the nature and formatting of a document.</Paragraph>
    <Paragraph position="2"> This is similar to term relevance heuristics in information retrieval (Robertson and Jones, 1997), but the pattern of emphasis of geographic references is somewhat different.</Paragraph>
    <Paragraph position="3"> The function that assigns the emphasis component that is a function of in-document position is somewhat different than those usually used. It decreases from a maximum at the beginning of the document to a low number near the end of a long document, but increases near the bottom of the document to account for the increased relevance of information in footers. The frequency of the name Fn in the document is considered in a similar way to standard information retrieval techniques (Robertson and Jones, 1997).</Paragraph>
    <Paragraph position="4"> Emphasis is also a function of the number of other geographic references S in the document. This is based on the assumption that a document does not have an unlimited amount of relevance to &amp;quot;spend&amp;quot; on places. Thus, a place mentioned in a document with many others is likely to be less relevant. Once emphasis E(Pn,Bn,Fn,S) is calculated, it is multiplied by geoconfidence Cg to obtain the georelevance Rg.</Paragraph>
    <Paragraph position="5"> We also compute a georelevance-like function for each location that could be referenced by a document. It varies as a function of character position in the document and is independent of geoconfidence.</Paragraph>
    <Paragraph position="6"> Finally, the textual query relevance and georelevance are balanced as follows. The more terms m are in the user's query, the higher the weight Ww we assign to the term component of the query; however we use a function Ww that saturates at a maximal weight M (.5 &lt; M &lt; 1).</Paragraph>
    <Paragraph position="7"> The term relevance weight is defined as Ww(m) = .5+ m[?]1m (M [?].5) Georelevance and term relevance Rw are then combined as (1[?]Ww(m))Rg +Ww(m)Rw.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML