File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0106_intro.xml
Size: 4,048 bytes
Last Modified: 2025-10-06 14:01:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0106"> <Title>InfoXtract location normalization: a hybrid approach to geographic references in information extraction [?]</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> The design and implementation of the location normalization module is an integrated part of Cymfony's core information extraction (IE) engine InfoXtract. InfoXtract extracts and normalizes entities, relationships and events from natural language text.</Paragraph> <Paragraph position="1"> Figure 1 shows the overall system architecture of InfoXtract, involving multiple modules in a pipeline structure.</Paragraph> <Paragraph position="2"> InfoXtract involves a spectrum of linguistic processing and relationship/event extraction. This engine, in its current state, involves over 100 levels of processing and 12 major components. Some components are based on hand-crafted pattern matching rules, some are statistical models or procedures, and others are hybrid (e.g. NE, Co-reference, Location Normalization). The basic information extraction task is NE tagging [Krupka and Hausman 1998; Srihari et al. 2000]. The NE tagger identifies and classifies proper names of type PERSON, ORGANIZATION, PRODUCT, NAMED-EVENTS, LOCATION (LOC) as well as numerical expressions such as MEASUREMENT (e.g. MONEY, LENGTH, WEIGHT, etc) and time expressions (TIME, DATE, MONTH, etc.).</Paragraph> <Paragraph position="3"> Parallel to location normalization, InfoXtract also involves time normalization and measurement normalization.</Paragraph> <Paragraph position="5"> InfoXtract combines the Maximum Entropy Model (MaxEnt) and Hidden Markov Model for NE tagging [Srihari et al. 2000]. Maximum Entropy Models incorporate local contextual evidence to handle ambiguity of information from a location gazetteer. In the Tipster Location Gazetteer used by InfoXtract, there are many common words, such as I, A, June, Friendship, etc. Also, there is large overlap between person names and location names, such as Clinton, Jordan, etc.</Paragraph> <Paragraph position="6"> Using MaxEnt, systems learn under what situation a word is a location name, but it is very difficult to determine the correct sense of an ambiguous location name. The NE tagger in InfoXtract only assigns the location super-type tag LOC to the identified location words and leaves the task of location sub-type tagging such as CITY or STATE and its disambiguation to the subsequent module Location Normalization.</Paragraph> <Paragraph position="7"> Beyond NE, the major information objects extracted by InfoXtract are Correlated Entity (CE) relationships (e.g. AFFILIATION and POSITION), Entity Profile (EP) that is a collection of extracted entity-centric information, Subject-Verb-Object (SVO) which refers to dependency links between logical subject/object and its verb governor, General Event (GE) on who did what when and where and Predefined Event (PE) such as Management</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Succession and Company Acquisition. </SectionTitle> <Paragraph position="0"> It is believed that these information objects capture the key content of the processed text. When normalized location, time and measurement NEs are associated with information objects (events, in particular) based on parsing, co-reference and/or discourse propagation, these events are stamped. The processing results are stored in IE Repository, a dynamic knowledge warehouse used to support cross-document consolidation, text mining for hidden patterns and IE applications. For example, location-stamped events can support information visualization on maps (Figure 2); time-stamped information objects can support visualization along a timeline; measurement-stamped objects will allow advanced retrieval such as find all Company Acquisition events that involve money amount greater</Paragraph> </Section> </Section> class="xml-element"></Paper>