File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/m92-1014_metho.xml
Size: 8,224 bytes
Last Modified: 2025-10-06 14:13:14
<?xml version="1.0" standalone="yes"?> <Paper uid="M92-1014"> <Title>lt;\human> POLICE Itype( ('LAW ENFORCEMENT,'NOUN'DI <\endhuman> SOURCES HAV E REPORTED THAT THE EXPLOSION CAUSED SERIOUS A FLAG FROM THE <\organ> MANUE LRODRIGUEZ PATRIOTIC FRONT ItypeIITERRORIST , 'NAME' fl I <\endorgan> (<\organ> FPM RItype([TERRORIST','NAME' DI <\endorgan>) WA S FOUN DAT THE SCENE OF THE EXPLOSION. THE <\organ > FPMR Itype([TERRORIST', 'NAME' ])I <\endorgan>IS A CLANDESTINE LEFTIST <\organ> GROUP ltype(['OTHER','NOUN' D I<\endorgan> THAT PROMOTES &quot;ALL FORMS O F</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> INTRODUCTIO N </SectionTitle> <Paragraph position="0"> The Computing Research Laboratory (New Mexico State University) and the Computer Science Departmen t (Brandeis University) are collaborating on the development of a system (DIDEROT) to perform data extraction for the Tipster project . This system is still far from fully developed, but as many of the techniques being used are domain --and in many cases language-- independent, we have assembled them in a preliminar y manner to produce a prototype system (MucBruce l ), which handles the MUC-4 texts .</Paragraph> <Paragraph position="1"> The overall system architecture is shown in Figure 1 .</Paragraph> <Paragraph position="2"> The development of the software and data used for MucBruce has been carried out over a three mont h period beginning at the end of February, 1992 . The present version of the system relies extensively on statistically-based measures of relevance made both at the text and the paragraph level . Texts are tagge d for a variety of features by a pipeline of processes . The marked texts and the paragraph relevancy informatio n are used to allow a scan around keywords for appropriate slot filling strings . The system has been augmented since the dry-run with a parser which processes sentences which contain a word with an associated Generativ e Lexical Semantic (GLS) definition . This component was added by Brandeis late in the development process and has access to approximately 20 lexical definitions .</Paragraph> <Paragraph position="3"> Our results reflect the extremely simplistic approach to identifying the slot fills in a text . We feel confident , however, that an expansion of the coverage of our GLS entries and the addition of further constraints to prevent template overgeneration will produce significant improvements . We have created a set of tagging and statistical techniques which will apply to any text type, given appropriate training data .</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> SYSTEM FEATURES </SectionTitle> <Paragraph position="0"> The system consists of three front-end components all of which are C or Lex programs: * A text relevancy marke r * A paragraph relevancy marker * A text tagging pipeline and two MUC specific Prolog programs : * A template constructor * A template formatte r 'We seem to have adopted a philosophical stance for our system nomenclature, and this particular Australian philosophe r seemed to embody some of the ad hoc notions which, at the moment, glue our system together. <\endorgan> THAT PROMOTES &quot;ALL FORMS O FSTRUGGLE&quot; AGAINST THE. <\organ> MILITAR Y ItYpe(PMILITARY'. 'NOUN' DI <\endorgan ><\organ> GOVERNMENT Itype( ['GOVERNMENT', 'NOUN' ])j <\endorgan> HEADED BY <\human >GENERAL Itype( ['MILITARY' , 'NOUN', 'RAN K'DI <\endhuman> AUGUSTO PINOCHET .</Paragraph> <Paragraph position="1"> <\human> POLICE Itype( ('LAW ENFORCEMENT,'NOUN'DI <\endhuman> SOURCES HAV E REPORTED THAT THE EXPLOSION CAUSED SERIOUS A FLAG FROM THE <\organ> MANUE LRODRIGUEZ PATRIOTIC FRONT ItypeIITERRORIST , 'NAME' fl I <\endorgan> (<\organ> FPM RItype([TERRORIST','NAME' DI <\endorgan>) WA S FOUN DAT THE SCENE OF THE EXPLOSION. THE <\organ > FPMR Itype([TERRORIST', 'NAME' ])I <\endorgan>IS A CLANDESTINE LEFTIST <\organ> GROUP ltype(['OTHER','NOUN' D I<\endorgan> THAT PROMOTES &quot;ALL FORMS O F STRUGGLE&quot; AGAINST THE <\organ> MILITAR YItype(['MILITARY', 'NOUN' DI <\endorgan> <\organ> GOVERNMENT Itype(['GOVERNME N ','NOUN' DI <\endorgan> ==uHEADED= B Y<\human> GENERAL Itype(('MILITARY'. 'NOUN', 'RANK'DI <\endhuman> ==nAUGUSTO===uPINOCHET=.</Paragraph> <Paragraph position="2"> <\human> POLICE; [type(['LAW ENFORCEMENT, Figure 1 : MucBruce - System Overvie w One of our principal intentions is to automate as much as possible all the processes associated with th e creation of a text extraction system . Our statistical techniques for relevant text recognition use word list s which are automatically derived from training data . Our text tagger uses proper name information derive d from the key templates and other taggers for human names and dates are largely domain independent . We intend to derive the entire core lexicon for the system from Machine Readable Dictionaries and then to tun e it against appropriate corpora .</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> OFFICIAL RESULT S </SectionTitle> <Paragraph position="0"> Our results are shown in tables 1 and 2. The results for test 4 are much poorer then those for test 3. We have not established any specific causes for this difference . For most of the individual slots we see some improvement in recall and a greater improvement in precision over the results of the dry run test . The MucBruce system is not parameterized in any way to affect recall or precision . To change these we woul d require modifying the parameters given to the text statistics programs . For MUC-4 we tried to improve precision at the expense of some recall . It is extremely difficult to measure the accuracy of the templat e predicting programs, as their performance can be easily masked by errors occurring in the template producing sections of the system . We need to run separate tests of these components to establish the exact relationshi p on performance of the text statistics, text marking and template producing components . We have not yet , however, had time to carry out these tests on the new MUC-4 data .</Paragraph> </Section> <Section position="4" start_page="0" end_page="121" type="metho"> <SectionTitle> EFFORT SPENT </SectionTitle> <Paragraph position="0"> Approximately ten people have worked at one time or another on the MUC-4 system over the last thre e months . They were all, however, also working on other projects over this period . A rough estimate of the time involved would be six person-months . The major areas of work were in developing and refinin g the statistical techniques, designing and developing the tagging software and implementing a system whic h could use our current incomplete set of components . Work also went into designing and implementing an appropriate form for the Generative Lexical Semantic entries .</Paragraph> <Paragraph position="1"> Our limiting factor was definitely time . In the last month we generalized the lexical entries in ou r tagging file . This meant our system was often likely to recognize partial strings as being appropriate filler s (e.g. GUERILLAS) . We intended to avoid this problem by incorporating the BBN part of speech tagge r (POST) into our MUC-4 system and to write code to glue together noun phrases occurring around our ne w general tags. All this code was written and tested just before the MUC-4 final test, but we were unable t o incorporate it in time .</Paragraph> <Paragraph position="2"> The training texts were used to generate our statistical information and word lists. The methods used are automatic and require only the setting of thresholds for word selection .</Paragraph> <Paragraph position="3"> The system has improved its performance slightly since the dry run test . Many of our changes in isolation are detrimental and require the addition of other techniques to establish their usefulness .</Paragraph> </Section> class="xml-element"></Paper>