File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1046_intro.xml
Size: 4,496 bytes
Last Modified: 2025-10-06 14:02:51
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1046"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 363-370, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Disambiguating Toponyms in News</Title> <Section position="2" start_page="0" end_page="363" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Place names, or toponyms, are ubiquitous in natural language texts. In many applications, including Geographic Information Systems (GIS), it is necessary to interpret a given toponym mention as a particular entity in a geographical database or gazetteer. Thus the mention &quot;Washington&quot; in &quot;He visited Washington last year&quot; will need to be interpreted as a reference to either the city Washington, DC or the U.S. state of Washington, and &quot;Berlin&quot; in &quot;Berlin is cold in the winter&quot; could mean Berlin, New Hampshire or Berlin, Germany, among other possibilities. While there has been a considerable body of work distinguishing between a toponym and other kinds of names (e.g., person names), there has been relatively little work on resolving which place and what kind of place given a classification of kinds of places in a gazetteer.</Paragraph> <Paragraph position="1"> Disambiguated toponyms can be used in a GIS to highlight a position on a map corresponding to the coordinates of the place, or to draw a polygon representing the boundary.</Paragraph> <Paragraph position="2"> In this paper, we describe a corpus-based method for disambiguating toponyms. To establish the difficulty of the problem, we began by quantifying the degree of ambiguity of toponyms in a corpus with respect to a U.S. gazetteer. We then carried out a corpus-based investigation of features that could help disambiguate toponyms. Given the scarcity of human-annotated data, our method used unsupervised machine learning to develop disambiguation rules. Toponyms were automatically tagged with information about them found in a gazetteer. A toponym that was ambiguous in the gazetteer was automatically disambiguated based on preference heuristics. This automatically tagged data was used to train the machine learner. We compared this method with a supervised machine learning approach trained on a corpus annotated and disambiguated by hand.</Paragraph> <Paragraph position="3"> Our investigation targeted toponyms that name cities, towns, counties, states, countries or national capitals. We sought to classify each toponym as a national capital, a civil political/administrative region, or a populated place (administration unspecified). In the vector model of GIS, the type of place crucially determines the geometry chosen to represent it (e.g., point, line or polygon) as well as any reasoning about geographical inclusion. The class of the toponym can be useful in &quot;grounding&quot; the toponym to latitude and longitude coordinates, but it can also go beyond grounding to support spatial reasoning. For example, if the province is merely grounded as a point in the data model (e.g., if the gazetteer states that the centroid of a province is located at a particular latitude-longitude) then without the class information, the inclusion of a city within a province can't be established. Also, resolving multiple cities or a unique capital to a political region mentioned in the text can be a useful adjunct to a map that lacks political boundaries or whose boundaries are dated.</Paragraph> <Paragraph position="4"> It is worth noting that our classification is more fine-grained than efforts like the EDT task in</Paragraph> <Section position="1" start_page="363" end_page="363" type="sub_section"> <SectionTitle> Automatic Content Extraction </SectionTitle> <Paragraph position="0"> program (Mitchell and Strassel 2002), which distinguishes between toponyms that are a Facility &quot;Alfredo Kraus Auditorium&quot;, a Location &quot;the Hudson River&quot;, and Geo-Political Entities that include territories &quot;U.S. heartland&quot;, and metonymic or other derivative place references &quot;Russians&quot;, &quot;China (offered)&quot;, &quot;the U.S. company&quot;, etc. Our classification, being gazetteer based, is more suited to GIS-based applications. null</Paragraph> </Section> </Section> class="xml-element"></Paper>