File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1010_metho.xml

Size: 17,290 bytes

Last Modified: 2025-10-06 14:15:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1010">
  <Title>COREFERENCE RESOLUTION STRATEGIES FROM AN APPLICATION PERSPECTIVE</Title>
  <Section position="4" start_page="46" end_page="47" type="metho">
    <SectionTitle>
3. DOCUMENT ZONING
</SectionTitle>
    <Paragraph position="0"> During the development of the drug seizure application, it became apparent that knowledge of the structure of the document would be of help in limiting the coreference resolution to semantically related zones. Often, a document is sent to convey information on multiple topics and/or locations. If the text processing system does not recognize a topic shift, it may incorrectly relate unrelated information.</Paragraph>
    <Paragraph position="1"> The challenge is to zone the document before information extraction begins. As with most text-processing problems, zoning must be determined via both structure and meaning, i.e. the syntax and semantics of the document.</Paragraph>
    <Paragraph position="2"> Authors often use visual cues, such as skipped lines and indentation to alert the reader to shifts into new topics. Since text processing techniques have been developed for character streams, it is difficult for them to interpret the visual cues that are two dimensional in nature, rather than linear. Our current work is seeking to apply image understanding techniques to this problem by constructing an auxiliary grid representation of the text and applying two dimensional pattern matching, in order to extract the nature of the document's structure.</Paragraph>
    <Paragraph position="3">  Knowledge of the structure must then be supplemented with knowledge of the semantics of the structure. This will make it possible for the text processing system to go beyond the structural components of paragraph and table to find the semantic zones of the document which tie structural components together. For example, a single word at the beginning of a paragraph may have significance only because of the fact that it is a country name.</Paragraph>
  </Section>
  <Section position="5" start_page="47" end_page="47" type="metho">
    <SectionTitle>
SPAIN
LOCAL POLICE HA VE SEIZED ...
</SectionTitle>
    <Paragraph position="0"> Depending on the source of the document, the author may insert outline characters to help the reader interpret the structure of the story.</Paragraph>
    <Paragraph position="1"> A. PUERTO RICO: ON JULY 5, MARITIME OFFICERS ...</Paragraph>
    <Paragraph position="2"> Th~ outline styles are often standard forms which have been tailored by the author, so an automatic system must be able to interpret varying styles. In the following, the location will span several blocks of text, marked alphabetically.</Paragraph>
  </Section>
  <Section position="6" start_page="47" end_page="48" type="metho">
    <SectionTitle>
1. CALIFORNIA
</SectionTitle>
    <Paragraph position="0"> A. JULY 5, OFFICERS SUSPECTED...</Paragraph>
    <Paragraph position="1"> B. JULY 7, LOCAL LAW ENFORCEMENT SEIZED ... Pattern matching techniques are being applied at the tokenizer level to identify structure markers and look for the outline patterns.</Paragraph>
    <Paragraph position="2"> Zoning is a topic of ongoing research which is also being applied to the problem of tabular information.</Paragraph>
    <Paragraph position="3"> BlockFinder Image understanding techniques have been applied to the problem of recognizing the structure of text. BlockFinder, a prototype of a new NLToolset tool, uses two-dimensional patterns to find the edges in a grid of text. It converts an input text file into a list of blocks separated by white space.</Paragraph>
    <Paragraph position="4"> The BlockFinder is the component of NLToolset that looks at a text file from a two dimensional perspective. Characters from the file are arranged in a two dimensional grid where the rows of the grid are separated by newline characters. By treating this character grid as an image, it is possible to find sections of text which are isolated by white space. Now that the computer has a representation of a file that reflects how the characters would appear on a page, it is feasible to look &amp;quot;above&amp;quot; and &amp;quot;below&amp;quot; characters in a file to find boundaries where text meets white space. In this way the BlockFinder can pick up on zoning cues which are obvious to a human reader but which have proved elusive for computers. Here is an overview of the BlockFinder algorithm.</Paragraph>
    <Paragraph position="5">  1. Characters from the file stream are inserted into a 2-D array. A newline character starts a new line of the array. Tab characters insert white space up to the next eight character tab stop.</Paragraph>
    <Paragraph position="6"> 2. Each character is classified as text, punctuation, or white space.</Paragraph>
    <Paragraph position="7"> 3. White space consistent with normal word spacing within a block of text is filtered out. These space characters are reclassified as text.</Paragraph>
    <Paragraph position="8"> 4. Punctuation consistent with standard English is filtered out. These punctuation characters are reclassified as text.</Paragraph>
    <Paragraph position="9"> 5. The boundaries between text and non-text characters are marked as edges.</Paragraph>
    <Paragraph position="10"> 6. Adjacent edges are linked together to  form longer straight edges.</Paragraph>
    <Paragraph position="11"> 7. The long straight edges are grouped to trace the boundaries of text blocks.</Paragraph>
    <Paragraph position="12"> 8. These text blocks are flagged as document zones.</Paragraph>
    <Paragraph position="13"> Most of the blocks detected by the BlockFinder are sections and paragraphs within a document. These blocks are continuous; they can be represented by a start and end position in the file stream. The BlockFinder finds other (non-continuous) blocks as well. A primary example of a non-continuous block is a column of a table. Work is underway to have the BlockFinder isolate and organize these column blocks into a table structure that would allow the NLToolset to interpret tabular data.</Paragraph>
    <Paragraph position="14"> Whether the identified blocks represent tabular columns, paragraphs, or sections of a document, they contain important clues to the document's organization. These clues help the human reader to understand the document. The BlockFinder allows the NLToolset to use these same clues to break the document into logical zones which should, in turn, improve the quality of the coreferences generated.</Paragraph>
    <Section position="1" start_page="48" end_page="48" type="sub_section">
      <SectionTitle>
Outline Matching
</SectionTitle>
      <Paragraph position="0"> The author of a document has an almost infinite variety of conventions from which to choose to indicate text grouping. Sections, sub-sections, or paragraphs can be separated by blank lines, or by outline symbols, or by some arbitrary indentation with no blank lines. A system that can also use outline characters and indentation as well as blocks will be more successful than one that works with blocks alone.</Paragraph>
      <Paragraph position="1"> The outline hierarchy of a document is indicated by the order in which the outline symbols appear.</Paragraph>
      <Paragraph position="2"> In our prototype, during tokenization, an outline label (a letter or number) is recognized as one or two digits or letters followed by a &amp;quot;.&amp;quot; at the beginning of a line. The tokenizer then inserts the title &amp;quot;outlineletter,&amp;quot; &amp;quot;outline-number,&amp;quot; or &amp;quot;outline-roman. ''2 Pattern matching is used to determine hierarchy by position in the file. If the first occurrence of an outline title is a Roman numeral, then we know that Roman numerals are being used as the top outline level. Similarly, if the second type of outline title to appear (that is not Roman) is an outline letter and lastly an outline number, then we have identified the style. This pattern matching is used to create new labels that indicate hierarchy: outlinel, outline2, outline3, etc. Next, we simply find and group each outline label and the text associated with it into component objects.</Paragraph>
      <Paragraph position="3"> Internal structures are used to group the Outline components into parent-child relationships that represent zone structure.</Paragraph>
      <Paragraph position="4"> Indentation Next, we intend to look at indentation to indicate the breaking of a block of text into smaller units (i.e. paragraph). In our prototype, if the indentation is greater at the break point than the indentation at the start of the containing block, the new units will be grouped as children of the containing block.</Paragraph>
      <Paragraph position="5"> Otherwise, the new units are sibling s to the containing block.</Paragraph>
      <Paragraph position="6"> Semantic Zones The idea behind using the blocks, outlines, and indentation, is to create the basic document structure 2 Exceptions are made for single digit Roman numerals, I, V, X, etc., which can either represent Roman numerals or letters, depending on the context. first, refining at each step, using the new information. Once the structure has been built, we will use semantic pattern matching to determine the meaning of the structure based on prior information concerning the document style. For example, in the case of the application under discussion, document sections are sometimes marked with location names that have either no outline labels or an outlinel label. So, the collection of components that start with a location name is a document section, and it and all its children can be treated as a single zone.</Paragraph>
      <Paragraph position="7"> When the document sections and subsections have been identified, the code can verify that the reference token retrieved is in the same document section as the current token. If it is not, than it is not an accurate reference. Also, since the location in the section header has been identified, it is clearly the default location for any event found in that section.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="48" end_page="48" type="metho">
    <SectionTitle>
4. CONCLUSIONS AND FUTURE
WORK
</SectionTitle>
    <Paragraph position="0"> Research is ongoing to expand the capability of the NLToolset's coreference resolution module.</Paragraph>
    <Paragraph position="1"> Location Merging The location template for the drug seizure application contains slots to hold information about the locale (a descriptive, such as Highway 40), the city, state, country, latitude/longitude, region or body of water. To extract a complete representation of the location for an event, the NLToolset must collect all location references and merge them into a complete description of the location. In the following example, a pattern to extract seizure information may pick up both the city, San Felipe, and the locale, Highway 32. These must then be merged into one template, based on the knowledge that they are related via the seizure extraction pattern. Additionally, the country information must then be added. While it is possible to use the gazetteer to look up city names in order to find the associated country, sometimes a city name has been used in more than one country, and other information, such as zoning information, must be used to disambiguate. Another problem is that not all events occur in large cities; small towns are not usually listed in the gazetteer.</Paragraph>
  </Section>
  <Section position="8" start_page="48" end_page="48" type="metho">
    <SectionTitle>
1. BOLIVIA
</SectionTitle>
    <Paragraph position="0"> A. LA PAZ: ON JULY 8 COAST GUARD PATROLS SIGHTED...</Paragraph>
    <Paragraph position="1"> B. SAN FELIPE: JULY 7, LOCAL LAW ENFORCEMENT</Paragraph>
  </Section>
  <Section position="9" start_page="48" end_page="49" type="metho">
    <SectionTitle>
OFFICERS SEIZED 2 TONS OF COCAINE ON HIGHWAY 32.
</SectionTitle>
    <Paragraph position="0"> Location merging capability, based on event and zoning information, will be added to the NLToolset in the near future.</Paragraph>
    <Section position="1" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
Event Coreference Resolution
</SectionTitle>
      <Paragraph position="0"> During development of the application prototype, event coreference resolution was identified as a necessary technique to better the accuracy of the system. The following example illustrates the problem.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="49" end_page="49" type="metho">
    <SectionTitle>
17 KG. OF COCAINE WERE SEIZED IN MIAMI. THE
OPERATION WAS CONDUCTED BY A TEAM CONSISTING
OF THE FBI, THE COAST GUARD, AND LOCAL
</SectionTitle>
    <Paragraph position="0"> AUTHORITIES.</Paragraph>
    <Paragraph position="1"> In order to tie in the seizing organizations to the seizure event, the system must be able to identify the referent of operation as the entire seizure event. This is coreference resolution at a later stage of processing than that for entities; it must occur after the main events have been identified. The plan is to apply patterns which match nominalized event forms, and to link them to the known events, based on zoning information.</Paragraph>
    <Section position="1" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
Event Merging
</SectionTitle>
      <Paragraph position="0"> Event merging is a challenging part of extracting complex scenario templates. Authors usually spread information across several sentences, depending on the understanding of the reader to link the related information. The following example illustrates this point.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="49" end_page="49" type="metho">
    <SectionTitle>
17 KG OF COCAINE WAS SEIZED ON THE HMS PINAFORE.
THE VESSEL HAD EMBARKED FROM CALl AND WAS
HEADED FOR MIAMI.
</SectionTitle>
    <Paragraph position="0"> Thoroughly understanding the text is not something that automatic text processing systems currently do successfully. In fact, the most successful information extraction systems long ago gave up the goal of completely understanding free text. Targeted extraction of relevant information has been the most fruitful strategy, thus far.</Paragraph>
    <Paragraph position="1"> To continue in this tradition, our TIPSTER research has identified two techniques to investigate as solutions to the event merging problem. The first is entity-based event merging. This technique is based on the observation that the entity coreference resolution can act as a vehicle for linking secondary information. In the previous example, having linked the vessel with the platform HMS Pinafore would allow the origin and destination of the vessel to migrate back to the extracted seizure event via the coreference chain.</Paragraph>
    <Paragraph position="2"> The second technique to be developed is based on the idea that a particular event is usually composed of a finite set of predictable activities. For example, a successful Coast Guard seizure operation may be composed of patrolling, boarding, arresting, and seizing activities. This is not a new idea in the field of Artificial Intelligence.</Paragraph>
    <Paragraph position="3"> Since extracting isolated event information is something that the NLToolset does very well, it is thought that a profile of an event can be modeled.</Paragraph>
    <Paragraph position="4"> The profile would consist of a main event and its associated events. The NLToolset could then merge the extracted information based on the compatibility of its participating entities and zoning information. Something like this was developed on a limited basis for the joint venture scenario template of the original TIPSTER program. In that case, the LISP version of the NLToolset allowed ownership information to be merged into the main event of joint venture based on entity compatibility.</Paragraph>
    <Paragraph position="5"> The differences between the two techniques, entity-based and profile-based event merging, are subtle. Both require the construction of patterns for extracting associated event information. The main difference is that, in the former, the associated information, e.g. vessel destination, is tied to the entities involved. This method does not preclude the possibility that an entity may be involved in more than one event; however, event merging, as a step after event extraction, is not required.</Paragraph>
    <Paragraph position="6"> With profile-based event merging, the entity information is kept associated with the extracted event and merging takes place after all events have been extracted. As the application is expanded to handle more than one type of main event, there may be overlaps among the profiled subevents.</Paragraph>
    <Paragraph position="7"> Both techniques will be investigated under the remainder of the current TIPSTER research effort.</Paragraph>
    <Paragraph position="8"> Summary This paper has discussed the evolution of the coreference resolution techniques of the NLToolset, as they have been applied to an information extraction application, similar to the MUC Scenario Template tasks. It has also discussed current work on understanding document structure, as well as future work on improving information merging techniques.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML