File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1014_metho.xml

Size: 8,224 bytes

Last Modified: 2025-10-06 14:13:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="M92-1014">
  <Title>lt;\human&gt; POLICE Itype( ('LAW ENFORCEMENT,'NOUN'DI &lt;\endhuman&gt; SOURCES HAV E REPORTED THAT THE EXPLOSION CAUSED SERIOUS A FLAG FROM THE &lt;\organ&gt; MANUE LRODRIGUEZ PATRIOTIC FRONT ItypeIITERRORIST , 'NAME' fl I &lt;\endorgan&gt; (&lt;\organ&gt; FPM RItype([TERRORIST','NAME' DI &lt;\endorgan&gt;) WA S FOUN DAT THE SCENE OF THE EXPLOSION. THE &lt;\organ &gt; FPMR Itype([TERRORIST', 'NAME' ])I &lt;\endorgan&gt;IS A CLANDESTINE LEFTIST &lt;\organ&gt; GROUP ltype(['OTHER','NOUN' D I&lt;\endorgan&gt; THAT PROMOTES &amp;quot;ALL FORMS O F</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
INTRODUCTIO N
</SectionTitle>
    <Paragraph position="0"> The Computing Research Laboratory (New Mexico State University) and the Computer Science Departmen t (Brandeis University) are collaborating on the development of a system (DIDEROT) to perform data extraction for the Tipster project . This system is still far from fully developed, but as many of the techniques being used are domain --and in many cases language-- independent, we have assembled them in a preliminar y manner to produce a prototype system (MucBruce l ), which handles the MUC-4 texts .</Paragraph>
    <Paragraph position="1"> The overall system architecture is shown in Figure 1 .</Paragraph>
    <Paragraph position="2"> The development of the software and data used for MucBruce has been carried out over a three mont h period beginning at the end of February, 1992 . The present version of the system relies extensively on statistically-based measures of relevance made both at the text and the paragraph level . Texts are tagge d for a variety of features by a pipeline of processes . The marked texts and the paragraph relevancy informatio n are used to allow a scan around keywords for appropriate slot filling strings . The system has been augmented since the dry-run with a parser which processes sentences which contain a word with an associated Generativ e Lexical Semantic (GLS) definition . This component was added by Brandeis late in the development process and has access to approximately 20 lexical definitions .</Paragraph>
    <Paragraph position="3"> Our results reflect the extremely simplistic approach to identifying the slot fills in a text . We feel confident , however, that an expansion of the coverage of our GLS entries and the addition of further constraints to prevent template overgeneration will produce significant improvements . We have created a set of tagging and statistical techniques which will apply to any text type, given appropriate training data .</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SYSTEM FEATURES
</SectionTitle>
    <Paragraph position="0"> The system consists of three front-end components all of which are C or Lex programs:  * A text relevancy marke r * A paragraph relevancy marker * A text tagging pipeline and two MUC specific Prolog programs : * A template constructor * A template formatte r 'We seem to have adopted a philosophical stance for our system nomenclature, and this particular Australian philosophe r seemed to embody some of the ad hoc notions which, at the moment, glue our system together.  &lt;\endorgan&gt; THAT PROMOTES &amp;quot;ALL FORMS O FSTRUGGLE&amp;quot; AGAINST THE. &lt;\organ&gt; MILITAR Y ItYpe(PMILITARY'. 'NOUN' DI &lt;\endorgan &gt;&lt;\organ&gt; GOVERNMENT Itype( ['GOVERNMENT', 'NOUN' ])j &lt;\endorgan&gt; HEADED BY &lt;\human &gt;GENERAL Itype( ['MILITARY' , 'NOUN', 'RAN K'DI &lt;\endhuman&gt; AUGUSTO PINOCHET .</Paragraph>
    <Paragraph position="1"> &lt;\human&gt; POLICE Itype( ('LAW ENFORCEMENT,'NOUN'DI &lt;\endhuman&gt; SOURCES HAV E REPORTED THAT THE EXPLOSION CAUSED SERIOUS A FLAG FROM THE &lt;\organ&gt; MANUE LRODRIGUEZ PATRIOTIC FRONT ItypeIITERRORIST , 'NAME' fl I &lt;\endorgan&gt; (&lt;\organ&gt; FPM RItype([TERRORIST','NAME' DI &lt;\endorgan&gt;) WA S FOUN DAT THE SCENE OF THE EXPLOSION. THE &lt;\organ &gt; FPMR Itype([TERRORIST', 'NAME' ])I &lt;\endorgan&gt;IS A CLANDESTINE LEFTIST &lt;\organ&gt; GROUP ltype(['OTHER','NOUN' D I&lt;\endorgan&gt; THAT PROMOTES &amp;quot;ALL FORMS O F STRUGGLE&amp;quot; AGAINST THE &lt;\organ&gt; MILITAR YItype(['MILITARY', 'NOUN' DI &lt;\endorgan&gt; &lt;\organ&gt; GOVERNMENT Itype(['GOVERNME N ','NOUN' DI &lt;\endorgan&gt; ==uHEADED= B Y&lt;\human&gt; GENERAL Itype(('MILITARY'. 'NOUN', 'RANK'DI &lt;\endhuman&gt; ==nAUGUSTO===uPINOCHET=.</Paragraph>
    <Paragraph position="2"> &lt;\human&gt; POLICE; [type(['LAW ENFORCEMENT, Figure 1 : MucBruce - System Overvie w  One of our principal intentions is to automate as much as possible all the processes associated with th e creation of a text extraction system . Our statistical techniques for relevant text recognition use word list s which are automatically derived from training data . Our text tagger uses proper name information derive d from the key templates and other taggers for human names and dates are largely domain independent . We intend to derive the entire core lexicon for the system from Machine Readable Dictionaries and then to tun e it against appropriate corpora .</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
OFFICIAL RESULT S
</SectionTitle>
    <Paragraph position="0"> Our results are shown in tables 1 and 2. The results for test 4 are much poorer then those for test 3. We have not established any specific causes for this difference . For most of the individual slots we see some improvement in recall and a greater improvement in precision over the results of the dry run test . The MucBruce system is not parameterized in any way to affect recall or precision . To change these we woul d require modifying the parameters given to the text statistics programs . For MUC-4 we tried to improve precision at the expense of some recall . It is extremely difficult to measure the accuracy of the templat e predicting programs, as their performance can be easily masked by errors occurring in the template producing sections of the system . We need to run separate tests of these components to establish the exact relationshi p on performance of the text statistics, text marking and template producing components . We have not yet , however, had time to carry out these tests on the new MUC-4 data .</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="121" type="metho">
    <SectionTitle>
EFFORT SPENT
</SectionTitle>
    <Paragraph position="0"> Approximately ten people have worked at one time or another on the MUC-4 system over the last thre e months . They were all, however, also working on other projects over this period . A rough estimate of the time involved would be six person-months . The major areas of work were in developing and refinin g  the statistical techniques, designing and developing the tagging software and implementing a system whic h could use our current incomplete set of components . Work also went into designing and implementing an appropriate form for the Generative Lexical Semantic entries .</Paragraph>
    <Paragraph position="1"> Our limiting factor was definitely time . In the last month we generalized the lexical entries in ou r tagging file . This meant our system was often likely to recognize partial strings as being appropriate filler s (e.g. GUERILLAS) . We intended to avoid this problem by incorporating the BBN part of speech tagge r (POST) into our MUC-4 system and to write code to glue together noun phrases occurring around our ne w general tags. All this code was written and tested just before the MUC-4 final test, but we were unable t o incorporate it in time .</Paragraph>
    <Paragraph position="2"> The training texts were used to generate our statistical information and word lists. The methods used are automatic and require only the setting of thresholds for word selection .</Paragraph>
    <Paragraph position="3"> The system has improved its performance slightly since the dry run test . Many of our changes in isolation are detrimental and require the addition of other techniques to establish their usefulness .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML