XML Viewer - m92-1031

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1031_metho.xml
Size: 8,230 bytes
Last Modified: 2025-10-06 14:13:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1031">
  <Title>CRL/NMSU and Brandeis : Description of the MucBruce System as Used for MUC-4</Title>
  <Section position="2" start_page="0" end_page="224" type="metho">
    <SectionTitle>
OVERVIEW OF THE TEMPLATE FILLING PROCES S
</SectionTitle>
    <Paragraph position="0"> The overall system architecture is shown in Figure 1 . Three independent processes operate on an inpu t text. One, the Text Tagger, marks a variety of strings with semantic information . The other two, the Relevant Template Filter and the Relevant Paragraph Filter, perform word frequency analysis to determin e whether a text should be allowed to generate templates for particular incident types and which paragraph s are specifically related to each incident type . These predictions are used by the central process in th e system, the Template Constructor, which uses a variety of heuristics to extract template information fro m the tagged text . A skeleton template structure is then passed to the final process, the Template Formatter, which performs some consistency checking, creates cross references and attempts to expand any names foun d in the template to the longest form in which they occur in the text . Each of the above processes is described in more detail below .</Paragraph>
    <Paragraph position="1"> Relevancy Filters We have developed a procedure for detecting document types in any language . The system requires training texts for the types of documents to be classified and is developed on a sound statistical basis usin g probabilistic models of word occurrence [Guthrie and Walker 1991] . This may operate on letter grams o f appropriate size or on actual words of the language being targeted and develops optimal detection algorithm s from automatically generated &amp;quot;word&amp;quot; lists . The system depends on the availability of appropriate training  texts. So far the method has been applied to English, discriminating between Tipster and MUC texts, an d to Japanese between Tipster texts and translations of ACM proceedings . In both cases the classification scheme developed was correct 99% of the time .</Paragraph>
    <Paragraph position="2"> The method has now been extended to the identification of relevant paragraphs and relevant template types for the MUC documents. This is a more complex problem due to the non-homogeneous nature of th e texts and the difficulty of deriving training sets of text. Each process uses two sets of words, one which occurs with high probability in the texts of interest, and the other which occurs in the `non-interesting ' texts. Due to the complexity of separating relevant from non-relevant information for the MUC texts w e actually use three filters, two trained on sets of non-relevant and relevant paragraphs and one trained o n sets of relevant and non-relevant texts . The lists of relevant and non-relevant paragraphs were derived using the templates of the 1300 text test corpus . Any paragraph which contributed two or more string fills to a particular template was used as part of the relevant training set ; paragraphs contributing only one string fill were regarded as of dubious accuracy and were not placed in either set and all other paragraphs wer e considered as non-relevant . Word lists were derived automatically by finding those words in the relevan t training set which occurred within a threshold of most frequently occurring words in the relevant paragraphs and not in the non-relevant paragraphs, and vice versa to obtain a set of non-relevant words .</Paragraph>
    <Paragraph position="3"> The relevant template marker consists of two processes, the first trained on a set of texts consistin g of paragraphs from the MUC corpus which produced two or more string fills against text consisting o f paragraphs which generated no string fills .</Paragraph>
    <Paragraph position="4"> These allow us to determine, based on word counts taken at paragraph level, whether the whole tex t should be checked for specific template types. The second stage is activated if any single paragraph in the text is found to be `relevant' . This stage is trained on the set of texts which generated a particular templat e type against texts which produced no templates . There are separate relevant and non-relevant lists of word s used to determine each template type .</Paragraph>
    <Paragraph position="5"> The result is a vector represented as a Prolog fact which determines whether the texts will be allowed t o generate templates of a particular type . Thus:</Paragraph>
  </Section>
  <Section position="3" start_page="224" end_page="224" type="metho">
    <SectionTitle>
FREQUENCY
WORD
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="4" start_page="224" end_page="224" type="metho">
    <SectionTitle>
ELN
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="5" start_page="224" end_page="224" type="metho">
    <SectionTitle>
BOMB
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="6" start_page="224" end_page="224" type="metho">
    <SectionTitle>
KIDNAPPED
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="7" start_page="224" end_page="224" type="metho">
    <SectionTitle>
ELECTIONS
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="8" start_page="224" end_page="225" type="metho">
    <SectionTitle>
DAMAGED
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="9" start_page="225" end_page="225" type="metho">
    <SectionTitle>
FREQUENCY
WORD
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="10" start_page="225" end_page="225" type="metho">
    <SectionTitle>
WERE
</SectionTitle>
    <Paragraph position="0"> slot(4, ['NO', 'ARSON', 'NO', 'ATTACK', 'YES', 'BOMBING', 'NO', 'KIDNAPPING', 'NO', 'ROBBERY', 'NO', 'DUMMY']) .</Paragraph>
    <Paragraph position="1"> The relevant paragraph filter is the final stage and uses word lists which were derived from relevant an d non-relevant paragraphs for each template type .</Paragraph>
    <Paragraph position="2"> Once again this operates at the paragraph level and produces a list of paragraph numbers for eac h template type. These paragraph lists are only used if the relevant template filter has also predicted a template of that type. This stage produces a vector of relevant paragraphs . Thus: rel_paras([[1,3,5],'ARSON', [1,2,3,4,5],'ATTACK', [1,3],'BOMBING' , [],'KIDNAPPING', [],'ROBBERY', [],'DUMMY']) .</Paragraph>
    <Paragraph position="3"> The two stages can be thought of as first distinguishing relevant texts for a particular template typ e from among all texts and second, given a relevant text, to distinguish between the relevant and non-relevan t paragraphs within that text for the template type .</Paragraph>
    <Paragraph position="4"> Partial word lists for relevant and non-relevant texts are given in Tables 1 and 2 . The full lists contain 124 and 117 words respectively . Partial relevant word lists for BOMBING at the text level (relevant template ) and the paragraph level are given in Tables 3 and 4 . The full lists contain 176 and 51 words respectively . Semantic Tagging A key question for the Tipster and MUC tasks is the correct identification of place names, company an d organization names, and the names of individuals . We now have available to us several sources of geographic , company and personal name information . In addition the templates provided for MUC also supplied nam e information. These have been incorporated in a set of tagging files which provide lexical information as a pre-processing stage for every text .</Paragraph>
    <Paragraph position="5"> The details of the Text Tagger are shown in Figure 2, which is a screen dump of an interface which allow s examination of the operation of each stage in the filter . The text window on the left shows the state of a text after the group dates process has converted dates to standard form and on the right after the temporary tags placed to identify date constituents have been removed . Each stage, apart from the last, marks the text with tags in the form:</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML