File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0107_metho.xml

Size: 5,385 bytes

Last Modified: 2025-10-06 14:08:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0107">
  <Title>Bootstrapping toponym classifiers</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> Dividing the corpora in training and test data, we train Naive Bayes classifiers on all examples of disambiguated toponyms in the training set. Although it is not uncommon for two places in the same state, for example, to share a name, we define disambiguation for purposes of these experiments as finding the correct U.S. state or foreign country. This asymmetry is reflected in U.S. news and historical text of the training data, where toponyms are specified by U.S. states or by foreign countries. We then run the classifiers on the test text with disambiguating labels, such as state or country names that immediately follow the city name, removed.</Paragraph>
    <Paragraph position="1"> Since not all toponyms in the test set will have been seen in training, we also train backoff classifiers to guess the states and countries related to a story. If, for example, we cannot find a classifier for &amp;quot;Oxford&amp;quot;, but can tell that a story is about Mississippi, we will still be able to disambiguate. We use a gazetteer to restrict the set of candidate states and countries for a given place name. In trying to disambiguate &amp;quot;Portland&amp;quot;, we would thus consider Oregon, Maine, and England, among other options, but not Maryland. As in the word sense disambiguation task as usually defined, we are classifying names and not clustering them. This approach is practical for geographic names, for which broad-coverage gazetteers exist, though less so for personal names (Mann and Yarowsky, 2003). System performance is measured with reference to the naive baseline where each ambiguous toponym is guessed to be the most commonly occurring place. London, England, would thus always be guessed rather than London, Ontario. Bootstrapping methods similar to ours have been shown to be competitive in word sense disambiguation (Yarowsky and Florian, 2003; Yarowsky, 1995).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Difficulty of the Task
</SectionTitle>
    <Paragraph position="0"> Our ability to disambiguate place names should be weighed against the ease or difficulty of the task. In a world where most toponyms referred unambiguously to one place, we would not be impressed by near-perfect performance.</Paragraph>
    <Paragraph position="1"> Before considering how toponyms are used in text, we can examine the inherent ambiguity of place names in 1Our annotated data also includes disambiguated texts of Herodotus' Histories and Caesar's Gallic War, but toponyms in the ancient (especially Greek) world do not show enough ambiguity with personal names or with each other to be interesting.  to more than one place in the Getty Thesaurus of Geographic Names isolation. The Getty Thesaurus of Geographic Names, with over a million toponyms, not only synthesizes many contemporary gazetteers but also contains a wealth of historical names. In table 2, we summarize for each continent the proportion of places that have multiple names and of names that can refer to more than one place. Although these proportions are dependent on the names and places selected for inclusion in this gazetteer, the relative rankings are suggestive. In areas with more copious historical records--such as Asia, Africa, and Europe--a place may be called by many names over time, but individual names are often distinct. With the increasing tempo of settlement in modern times, however, many places may be called by the same name, particularly by nostalgic colonists in the New World. Other ambiguities arise when people and places share names.</Paragraph>
    <Paragraph position="2"> Very few Greek and Latin place names are also personal names.2 This is less true of Britain, where surnames (and surnames used as given names) are often taken from place names; in America, the confusion grows as numerous towns are named after prominent or obscure people. What may be called a lack of imagination in the many 41 Oxfords, 73 Springfields, 91 Washingtons, and 97 Georgetowns seems to plague the very area -- North America -- covered by our corpora.</Paragraph>
    <Paragraph position="3"> If, however, one Washington or Portland predominates in actual usage, things are not as bad as they seem. At the  very worst, for a baseline system, one can always guess the most predominant referent. We quantify the level of uncertainty in our corpora using entropy and average conditional entropy. As stated above, we have simplified the disambiguation problem to finding the state or country to which a place belongs. For our training corpora, we can thus measure the entropy of the classification and the average conditional entropy of the classification given the specific place name (table 3). These entropies were calculated using unsmoothed relative frequencies. The conditional entropy, not surprisingly, is fairly low, given that the percentage of toponyms that refer to more than one place in the training data is quite low. Since training data do not perfectly predict test data, however, we have to smooth these probabilities and entropy goes up.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML