File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1046_metho.xml
Size: 16,459 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1046"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 363-370, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Disambiguating Toponyms in News</Title> <Section position="3" start_page="363" end_page="364" type="metho"> <SectionTitle> 2 Quantifying Toponym Ambiguity </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="363" end_page="363" type="sub_section"> <SectionTitle> 2.1 Data </SectionTitle> <Paragraph position="0"> We used a month's worth of articles from the New York Times (September 2001), part of the English Gigaword (LDC 2003). This corpus consisted of 7,739 documents and, after SGML stripping, 6.51 million word tokens with a total size of 36.4MB).</Paragraph> <Paragraph position="1"> We tagged the corpus using a list of place names from the USGS Concise Gazetteer (GNIS). The resulting corpus is called MAC1, for &quot;Machine Annotated Corpus 1&quot;. GNIS covers cities, states,</Paragraph> <Paragraph position="3"> and counties in the U.S., which are classified as &quot;civil&quot; and &quot;populated place&quot; geographical entities. A geographical entity is an entity on the Earth's surface that can be represented by some geometric specification in a GIS; for example, as a point, line or polygon. GNIS also covers 53 other types of geo-entities, e.g., &quot;valley,&quot; &quot;summit&quot;, &quot;water&quot; and &quot;park.&quot; GNIS has 37,479 entries, with 27,649 distinct toponyms, of which 13,860 toponyms had multiple entries in the GNIS (i.e., were ambiguous according to GNIS). Table 1 shows the entries in GNIS for an ambiguous toponym.</Paragraph> </Section> <Section position="2" start_page="363" end_page="364" type="sub_section"> <SectionTitle> 2.2 Analysis </SectionTitle> <Paragraph position="0"> Let E be a set of elements, and let F be a set of features. We define a feature g in F to be a disambiguator for E iff for all pairs <e</Paragraph> <Paragraph position="2"> valued. As an example, consider the GNIS gazetteer in Table 1, let F = {U.S. County, U.S. State, Lat-Long, and Elevation}. We can see that each feature in F is a disambiguator for the set of entries</Paragraph> <Paragraph position="4"> Let us now characterize the mapping between texts and gazetteers. A string s1 in a text is said to be a discriminator within a window w for another string s2 no more than w words away if s1 matches a disambiguator d for s2 in a gazetteer. For example, &quot;MT&quot; is a discriminator within a window 5 for the toponym &quot;Acton&quot; in &quot;Acton, MT,&quot; since &quot;MT&quot; occurs within a +-5-word window of &quot;Acton&quot; and matches, via an abbreviation, &quot;Montana&quot;, the value of a GNIS disambiguator U.S. State (here the tokenized words are &quot;Acton&quot;, &quot;,&quot;, and &quot;MT&quot;). A trie-based lexical lookup tool (called LexScan) was used to match each toponym in GNIS against the corpus MAC1. Of the 27,649 distinct toponyms in GNIS, only 4553 were found in the corpus (note that GNIS has only U.S. toponyms). Of the 4553 toponyms, 2911 (63.94%) were &quot;bare&quot; toponyms, lacking a local discriminator within a +-5-word window that could resolve the name.</Paragraph> <Paragraph position="5"> Of the 13,860 toponyms that were ambiguous according to GNIS, 1827 of them were found in MAC1, of which only 588 had discriminators within a +-5-word window (i.e., discriminators which matched gazetteer features that disambiguated the toponym). Thus, 67.82% of the 1827 toponyms found in MAC1 that were ambiguous in GNIS lacked a discriminator.</Paragraph> <Paragraph position="6"> This 67.82% proportion is only an estimate of true toponym ambiguity, even for the sample MAC1. There are several sources of error in this estimate: (i) World cities, capitals and countries were not yet considered, since GNIS only covered U.S. toponyms. (ii) In general, a single feature (e.g., County, or State) may not be sufficient to disambiguate a set of entries. It is of course possible for two different places named by a common toponym to be located in the same county in the same state. However, there were no toponyms with this property in GNIS. (iii) A string in MAC1 tagged by GNIS lexical lookup as a toponym may not have been a place name at all (e.g., &quot;Lord Acton lived ...&quot;). Of the toponyms that were spurious, most were judged by us to be common words and person names. This should not be surprising, as 5341 toponyms in GNIS are also person names according to the U.S. Census Bureau (iv) LexScan wasn't perfect, for the following reasons. First, it sought only exact matches. Second, the matching relied on expansion of standard abbreviations. Due to non-standard abbreviations, the number of true U.S. toponyms in the corpus likely exceeded 4553.</Paragraph> <Paragraph position="7"> Third, the matches were all case-sensitive: while case-insensitivity caused numerous spurious matches, case-sensitivity missed a more predictable set, i.e. all-caps dateline toponyms or lower-case toponyms in Internet addresses.</Paragraph> <Paragraph position="8"> Note that the 67.82% proportion is just an estimate of local ambiguity. Of course, there are often non-local discriminators (outside the +-5-word windows); for example, an initial place name reference could have a local discriminator, with sub- null www.census.gov/genealogy/www/freqnames.html sequent references in the article lacking local discriminators while being coreferential with the initial reference. To estimate this, we selected cases where a toponym was discriminated on its first mention. In those cases, we counted the number of times the toponym was repeated in the same document without the discriminator. We found that 73% of the repetitions lacked a local discriminator, suggesting an important role for coreference (see Sections 4 and 5).</Paragraph> </Section> </Section> <Section position="4" start_page="364" end_page="364" type="metho"> <SectionTitle> 3 Knowledge Sources for Automatic Dis- </SectionTitle> <Paragraph position="0"> ambiguation To prepare a toponym disambiguator, we required a gazetteer as well as corpora for training and testing it.</Paragraph> <Section position="1" start_page="364" end_page="364" type="sub_section"> <SectionTitle> 3.1 Gazetteer </SectionTitle> <Paragraph position="0"> To obtain a gazetteer that covered worldwide information, we harvested countries, country capitals, and populous world cities from two websites</Paragraph> </Section> </Section> <Section position="5" start_page="364" end_page="366" type="metho"> <SectionTitle> ATLAS </SectionTitle> <Paragraph position="0"> and GAZ , to form a consolidated gazetteer (WAG) with four features G1,..,G4 based on geographical inclusion, and three classes, as shown in Table 2. As an example, an entry for Aberdeen could be the following feature vector: G1=United States, G2=Maryland, G3=Harford County, G4=Aberdeen, CLASS=ppl.</Paragraph> <Paragraph position="1"> We now briefly discuss the merging of ATLAS and GAZ to produce WAG. ATLAS provided a simple list of countries and their capitals. GAZ recorded the country as well as the population of 700 cities of at least 500,000 people. If a city was in both sources, we allowed two entries but ordered them in WAG to make the more specific type (e.g. &quot;capital&quot;) the default sense, the one that LexScan would use. Accents and diacritics were stripped from WAG toponyms by hand, and aliases were associated with standard forms. Finally, we merged GNIS state names with these, as well as abbreviations discovered by our abbreviation expander. null</Paragraph> <Section position="1" start_page="364" end_page="366" type="sub_section"> <SectionTitle> 3.2 Corpora </SectionTitle> <Paragraph position="0"> We selected a corpus consisting of 15,587 articles from the complete Gigaword Agence France Presse, May 2002. LexScan was used to tag, insensitive to case, all WAG toponyms found in this corpus, with the attributes in Table 2. If there were multiple entries in WAG for a toponym, LexScan only tagged the preferred sense, discussed below. The resulting tagged corpus, called MAC-DEV, was used as a development corpus for feature exploration. To disambiguate the sense for a toponym that was ambiguous in WAG, we used two preference heuristics. First, we searched MAC1 for two dozen highly frequent ambiguous toponym strings (e.g., &quot;Washington&quot;, etc.), and observed by inspection which sense predominated in MAC1, preferring the predominant sense for each of these frequently mentioned toponyms. For example, in MAC1, &quot;Washington&quot; was predominantly a Capital. Second, for toponyms outside this most frequent set, we used the following specificity-based preference: Cap. > Ppl > Civil. In other words, we prefer the more specific sense; since there are a smaller number of Capitals than Populated places, we prefer Capitals to Populated Places.</Paragraph> <Paragraph position="1"> For machine learning, we used the Gigaword and the June 2002 New York Times from the English Gigaword, with the first author tagging a random 28, 88, and 49 documents respectively from each. Each tag in the resulting human annotated corpus (HAC) had the WAG attributes from Table 2 with manual correction of all the WAG attributes. A summary of the corpora, their source, and annotation status is shown in Table 3.</Paragraph> </Section> </Section> <Section position="6" start_page="366" end_page="367" type="metho"> <SectionTitle> 4 Feature Exploration </SectionTitle> <Paragraph position="0"> We used the tagged toponyms in MAC-DEV to explore useful features for disambiguating the classes of toponyms. We identified single-word terms that co-occurred significantly with classes within a k-word window (we tried k= +-3, and k=+-20). These terms were scored for pointwise mutual information (MI) with the classes. Terms with average tf.idf of less than 4 in the collection were filtered out as these tended to be personal pronouns, articles and prepositions.</Paragraph> <Paragraph position="1"> To identify which terms helped select for particular classes of toponyms, the set of 48 terms whose MI scores were above a threshold (-11, chosen by inspection) were filtered using the student's t-statistic, based on an idea in (Church www.timeml.org and Hanks 1991). The t-statistic was used to compare the distribution of the term with one class of toponym to its distribution with other classes to assess whether the underlying distributions were significantly different with at least 95% confidence. The results are shown in Table 4, where scores for a term that occurred jointly in a window with at least one other class label are shown in bold. A t-score > 1.645 is a significant difference with 95% confidence. However, because joint evidence was scarce, we eventually chose not to eliminate Table 4 terms such as 'city' (t =1.19) as features for machine learning. Some of the terms were significant disambiguators between only one pair of classes, e.g. 'yen,' 'attack,' and 'capital,' but we kept them on that basis.</Paragraph> <Paragraph position="2"> by all toponyms from the document: e.g., the set {civil, capital, ppl} CorefClass Value is the CLASS if any for a prior mention of a toponym in the document, or none Table 5. Features for Machine Learning Based on the discovered terms in experiments with different window sizes, and an examination of MAC1 and MAC-DEV, we identified a final set of features that, it seemed, might be useful for machine learning experiments. These are shown in Table 5. The features Abbrev and All-caps describe evidence internal to the toponym: an abbreviation may indicate a state (Mass.), territory (N.S.W.), country (U.K.), or some other civil place; an all-caps toponym might be a capital or ppl in a dateline. The feature sets LeftPos and RightPos target the +-k positions in each window as ordered tokens, but note that only windows with a MI term are considered. The domain of WkContext is the window of +-k tokens around a toponym that contains a MI collocated term.</Paragraph> <Paragraph position="3"> We now turn to the global discourse-level features. The domain for TagDiscourse is the whole document, which is evaluated for the set of toponym classes present: this information may reflect the discourse topic, e.g. a discussion of U.S. sports teams will favor mentions of cities over states or capitals. The feature CorefClass implements a one sense per discourse strategy, motivated by our earlier observation (from Section 2) that 73% of subsequent mentions of a toponym that was discriminated on first mention were expressed without a local discriminator.</Paragraph> </Section> <Section position="7" start_page="367" end_page="368" type="metho"> <SectionTitle> 5 Machine Learning </SectionTitle> <Paragraph position="0"> The features in Table 5 were used to code feature vectors for a statistical classifier. The results are shown in Table 6. As an example, when the Ripper classifier (Cohen 1996) was trained on MAC-ML with a window of k= +-3 word tokens, the predictive accuracy when tested using cross- null The majority class (Civil) had the predictive accuracy shown in parentheses. (When tested on a different set from the training set, cross-validation wasn't used). Ripper reports a confusion matrix for each class; Recall, Precision, and F-measure for these classes are shown, along with their average across classes.</Paragraph> <Paragraph position="1"> In all cases, Ripper is significantly better in predictive accuracy than the majority class. When testing using cross-validation on the same machine-annotated corpus as the classifier was trained on, performance is comparable across corpora, and is in the high 80%, e.g., 88.39 on MAC-ML (k=+-3). Performance drops substantially when we train on machine-annotated corpora but test on the human-annotated corpus (HAC) (the unsupervised approach), or when we both train and test on HAC (the supervised approach). The noise in the autogenerated classes in the machine-annotated corpus is a likely cause for the lower accuracy of the unsupervised approach. The poor performance of the supervised approach can be attributed to the lack of human-annotated training data: HAC is a small, TagDiscourse was a critical feature; ignoring it during learning dropped the accuracy nearly 9 percentage points. This indicates that prior mention of a class increases the likelihood of that class. (Note that when inducing a rule involving a set-valued feature, Ripper tests whether an element is a member of that set-valued feature, selecting the test that maximizes information gain for a set of examples.) Increasing the window size only lowered accuracy when tested on the same corpus (using crossvalidation); for example, an increase from +-3 words to +-20 words (intervening sizes are not shown for reasons of space) lowered the PA by 5.7 percentage points on MAC-DEV. However, increasing the training set size was effective, and this increase was more substantial for larger window sizes: combining MAC-ML with MAC-DEV improved accuracy on HAC by about 4.5% for k= +-3, but an increase of 13% was seen for k =+-20. In addition, F-measure for the classes was steady or increased. As Table 6 shows, this was largely due to the increase in recall on the non-majority classes. The best performance when training Ripper on the machine-annotated MAC-DEV+MAC-ML and testing on the human-annotated corpus HAC was 78.30.</Paragraph> <Paragraph position="2"> Another learner we tried, the SMO support-vector machine from WEKA (Witten and Frank 2005), was marginally better, showing 81.0 predictive accuracy training and testing on MAC-DEV+MAC-ML (ten-fold cross-validation, k=+-20) and 78.5 predictive accuracy training on MAC-DEV+MAC-ML and testing on HAC (k=+-20).</Paragraph> <Paragraph position="3"> Ripper rules are of course more transparent: example rules learned from MAC-DEV are shown in Table 7, along with their coverage of feature vectors and accuracy on the test set HAC.</Paragraph> </Section> class="xml-element"></Paper>