File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-0109_abstr.xml
Size: 10,476 bytes
Last Modified: 2025-10-06 13:42:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0109"> <Title>Geographic reference analysis for geographic document querying</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The work presented in this paper concerns Information Retrieval from geographical documents, i.e. documents with a major geographic component. The final aim, in response to an informational query of the user, is to return a ranked list of relevant passages in selected documents, allowing text browsing within them.</Paragraph> <Paragraph position="1"> We consider in this paper the spatial component of the texts and the queries. The idea is to perform an off-line linguistic analysis of the document, extracting spatial expressions (i.e. expressions denoting geographical localisations).</Paragraph> <Paragraph position="2"> The point is that such expressions are (in general) much more complex than simple place names. We present a linguistic analyser which recognises them, performing a semantic analysis and computing symbolic representations of their &quot;content&quot;. These representations, stored in the text thanks to XML annotation, will act as indexes of passages with which queries are compared. The matching of queries with text expressions is a complex process, needing several kinds of numeric and symbolic computations. A prospective outline of it is described. 1 Presentation of the GeoSem project.</Paragraph> <Paragraph position="3"> Passage extraction from geographical document The work presented in this paper concerns Information Retrieval (IR) from geographical documents, i.e. documents with a major geographic component. Let's precise at once that we are mainly interested in human geography, where the phenomena under consideration are of social or economic nature. Such documents are massively produced and consumed by academics as well as state organisations, marketing services of private companies and so on. The final aim is, in response to an informational query of the user, to return not only a set of documents (taken as wholes) from the available collection of documents, but also a list of relevant passages allowing text browsing within them.</Paragraph> <Paragraph position="4"> Geographical information is spatialised information, information so to speak anchored in a geographical space. This characteristic is immediately visible on geographical documents, which describe how some phenomena (often quantified, either in a numeric or qualitative manner) are related with a spatial and also, often, temporal localisation. Figure 1 gives an example of this informational structure, extracted from our favourite corpus (Herin, 1994), relative to the educational system in France. As a consequence a natural way to query documents will be through a 3-dimensional topic, Phenomenon-Space-Time as shown in Figure 2. The goal is to select passages that fulfil the whole bunch of criteria and to return them to the user in relevance order.</Paragraph> <Paragraph position="5"> The system we designed and currently develop for that purpose is divided in two tasks: an off-line one, devoted to linguistic analysis of the text, and an online one concerning querying itself. Let's give an overall view of the process, focusing on the spatial dimension of texts and analysis. Other aspects of the project, including especially the analysis of expressions denoting phenomena, techniques used to link the three components of information (Space, Time, Phenomena) and implementation issues can be found in (Bilhaut, 2003).</Paragraph> <Paragraph position="6"> Concerning text analysis, the goal is to locate, extract and analyse the expressions which refer to some geographical localisation 1 so that they act as indexes of text passages. The first remark to do is that we have to cope (in general) with complex nominal expressions, not only named geographical entities, as exemplified in figure 3. Indeed the collection of (proper) place names can 1Temporal expressions (expressing temporal localisation) are treated in a similar manner.</Paragraph> <Paragraph position="7"> De 1965 a 1985, le nombre de lyceens a augmente de 70%, mais selon des rythmes et avec des intensites differents selon les academies et les departements. Faible dans le Sud-Ouest et le Massif Central, moderee en Bretagne et a Paris, l'augmentation a ete considerable dans le Centre-Ouest, et en Alsace. [...] Intervient aussi l'allongement des scolarites, qui a ete plus marque dans les departements ou, au milieu des annees 1960, la poursuite des etudes apres l'ecole primaire etait loin d'etre la regle.</Paragraph> <Paragraph position="8"> From 1965 to 1985, the number of high-school students has increased by 70%, but at different rythms and intensities depending on academies and departments. Lower in South-West and Massif Central, moderate in Brittany and Paris, the rise has been considerable in Mid-West and Alsace. [...] Also occurs the schooling duration increase which was more important in departments where, in the middle of the 60's, study continuation after primary school was far from being systematic.</Paragraph> <Paragraph position="9"> not constitute an adequate index: a mention of &quot;north of Paris&quot; or &quot;north of France&quot; has obviously not the same meaning as &quot;Paris&quot;or &quot;France&quot;, not to speak of &quot;south of a Bordeaux-Geneve line&quot;. Moreover, some expressions (&quot;industrial towns&quot; or &quot;rural departments&quot;...)2&quot; involve a &quot;qualitative&quot; (demographic, sociological, economic...) characterisation of the selected areas, involving some knowledge of this kind.</Paragraph> <Paragraph position="10"> The conclusion is that a literal matching of &quot;queries&quot; against &quot;text expressions&quot; simply can't do. Expressions (and queries) must receive a linguistic analysis, discovering their structure and producing some kind of semantic representation. This is the goal of the off-line text processing step. A linguistic analyser of spatial expressions (nominal and prepositional phrases) have been designed, which recognise them and produces a symbolic representation of their &quot;content&quot;. These representations are associated with the text, thanks to XML annotation, and constitute the index with which queries will be compared. The linguistic analysis is described in section 2.</Paragraph> <Paragraph position="11"> Assuming that such an analysis is performed, we are 2&quot;departments&quot; denotes in France administrative districts, roughly equivalent to &quot;counties&quot; Find the passages which concern: - Le retard scolaire dans l'Ouest de la France depuis les annees 1950.</Paragraph> <Paragraph position="12"> - Educational difficulties in West of France since the 50's.</Paragraph> <Paragraph position="13"> - L'evolution des effectifs dans l'enseignement secondaire a Paris / dans la region parisienne.</Paragraph> <Paragraph position="14"> - Variations of the number of pupils in secondary school in Paris / in Paris area - L'evolution des effectifs scolaires dans les regions rurales.</Paragraph> <Paragraph position="15"> - Variations of the number of pupils in rural areas.</Paragraph> <Paragraph position="16"> - Les mutations du personnel enseignant dans les academies du Sud.</Paragraph> <Paragraph position="17"> - Transfers of the teaching staff to southern ready for querying. Clearly the easier way for a user to formulate his/her query is to use also natural language. The first step will be to apply the same linguistic analysis, producing a symbolic representation of the same nature as what was extracted from text. We have then to perform some matching between (the representations of) the query and the text. This is not a trivial task, as the reader can guess, considering expressions and queries in figures 2 and 3. To achieve this task, we will use referential information associated with named geographical entities (long-lat coordinates) together with some computation exploiting the symbolic representations produced by the linguistic analysis. A (prospective) sketch of this process is described in section 3.</Paragraph> <Paragraph position="18"> Summing up to situate the project among current research, we see that the goals are those of Document Retrieval, but at an intra-document level, selecting passages (Callan, 1994). But the methods are rather (though not exclusively) those of Information Extraction in the sense of MUC's (Pazienza, 1997) and we are quite close to Answer Extraction in the sense of (Molla, 2000). In particular, the spatial component of geographical texts needs much more than an access to geographical resources as gazetteers: it needs both a specific semantic analysis of complex linguistic expressions, and some symbolic and numeric spatial computation for matching the query with text. Let's now consider these two aspects in turn.</Paragraph> <Paragraph position="19"> QUANT : TYPE : ZONE : administrative : qualification : position : named geo. entity (1) : : : : a Paris (2) : : : au nord de : la France (3) Quelques : villes : maritimes : : (4a) Le quart des : : : : (4b) Tous les : departements : : du nord de : la France (4c) Quelques : : : : (4d) Quinze : : : : (5) Quelques : villes : maritimes : : de la Normandie (6) Les : departements : les plus ruraux : situes au sud de : la Loire (1): in Paris (2): in north of France (3): some seaboard towns (4a/b/c/d): The quarter of / All / Some / Fifteen / districts of north of France (5) Some Seaboard towns of Normandy (6) The most rural districts situated from south of Loire Table 1: Structure of spatial expressions - Paris - Les villes industrielles d'Ile de France.</Paragraph> <Paragraph position="20"> - Industrial towns in Ile de France.</Paragraph> <Paragraph position="21"> - La moitie nord de la France.</Paragraph> <Paragraph position="22"> - The northern half of France.</Paragraph> <Paragraph position="23"> - Les departements ruraux du nord de la France.</Paragraph> <Paragraph position="24"> - Rural departments in the north of France.</Paragraph> <Paragraph position="25"> - Au sud d'une ligne Bordeaux-Geneve.</Paragraph> <Paragraph position="26"> - In the south of a Bordeaux-Geneve line.</Paragraph> </Section> class="xml-element"></Paper>