File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1034_metho.xml

Size: 17,560 bytes

Last Modified: 2025-10-06 14:12:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1034">
  <Title>Messages Indexing Module Syste mVocabulary Indexed Training Set Message s Training Set Key Template Learning Module s . Concep t Rul e Vectors Test Message Filterin g Module Message Relevanc e Templat e Filler Module Slot fillvs Paragraph Relevance Fille d Templates Rul e</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DESCRIPTION OF THE UNL/USL SYSTEM USED
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
BACKGROUN D
</SectionTitle>
    <Paragraph position="0"> The MuC-3 task consists of generating a database of filled templates with respect to messages tha t belong to a general topical domain . In particular, for the current phase, the message collection belongs to the domain of terrorist activities. On the one hand, a decision as to the relevance o f a message to a specified class of terrorist events should be made . If relevant, a predefined set of facts are to be extracted and placed as fills for appropriate slots of the template(s) created for this message . If not relevant, a template having a'*' as the fill in all but one slot, is created (see AppendixA for details). Some aspects of the MUC-3 task are amenable to be solved by techniques typically employed in information retrieval (IR) . These techniques are especially designed to be applicable t o any domain. In contrast, there are other aspects of the problem that may require a great deal oflanguage understanding, thus needing natural language processing (NLP) techniques . For the most part, NLP techniques may be considered domain dependent .</Paragraph>
    <Paragraph position="1"> The primary thrust of our effort has been to design and implement a system that employ s techniques typically found in m literature, augmented by basic search techniques available in fil e management systems. An important goal (for the time being) is to ensure that the system is domai nindependent to the greatest extent possible . Consequently, certain slots which are not suitable,t o be handled by the chosen techniques are not filled..</Paragraph>
    <Paragraph position="2"> In the context of the MUC-3 task, slots fall into one of four categories depending on the typ e of fill that are applicable to them . Our system is designed to handle slots whose values are from a set-list. More specifically we process TYPE OF INCIDENT, CATEGORY OF INCIDENT, PERPETRATOR : CONFIDENCE, PHYSICAL TARGET TYPE, HUMAN TARGET TYPE, INSTRUMENT : TYPE(S), EFFEC T ON PHYSICAL TARGET(S) and EFFECT ON HUMAN TARGET(S) . In addition, two slots whose fills are of string type are also processed . These are PERPETRATOR : ID OF ORG(S) and LOCATION OF</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="234" type="metho">
    <SectionTitle>
INCIDENT .
</SectionTitle>
    <Paragraph position="0"> As will be explained later, the system consists of an Indexing Module, a Learning Module, a Filtering Module, and a Template Filler Module . We had developed and experimentally validated indexing and learning techniques for use in the context of information retrieval and classification .</Paragraph>
    <Paragraph position="1"> These techniques were adapted to develop the indexing and learning modules for MUC-3 in addition  to the development and implementation of other modules . This site did not participate in eithe r MUC-1 or MUC-2 .</Paragraph>
  </Section>
  <Section position="4" start_page="234" end_page="237" type="metho">
    <SectionTitle>
OVERVIEW OF THE SYSTEM
</SectionTitle>
    <Paragraph position="0"> A popular strategy in IR is to formulate the problem of identifying items relevant to a subject area as one of conceptual categorization . The subject area(s) of interest is imagined as a concept or a class .</Paragraph>
    <Paragraph position="1"> Example items relevant to a certain concept are assumed to be given . Based on this informatio n and using techniques for learning from examples, a concept characterization rule, that is optimal i n a certain precise sense, is derived . In other words, retrieval of relevant items is actually viewed as a &amp;quot;recognition&amp;quot; problem .</Paragraph>
    <Paragraph position="2"> Our system employs the above idea by mapping possible fill values of set-list type slots to concept s of interest. For example, in the context of the TYPE OF INCIDENT slot, fill values such as ARSON , MURDER, BOMBING, etc. are the concepts to be learned . Note that the question of whether concept ARSON is applicable to a message is equivalent to deciding whether a message belongs to the messag e class identified by the label ARSON . Thus, the template filling task and the decision of whether a message is relevant to MUC-3 task are investigated as problems requiring conceptual categorization .</Paragraph>
    <Paragraph position="3"> For each concept that is considered by the system to be applicable to a message, the system als o keeps track of the extent to which each of the paragraphs in the message contributed to this decision .</Paragraph>
    <Paragraph position="4"> Judicious use of this information enables various important activities such as the resolution of the &amp;quot;best&amp;quot; fill for a slot from among alternatives, the linking of the fills to templates when more tha n one template must be generated for the same message, and the filling of the two string type slots.</Paragraph>
    <Paragraph position="5"> The general architecture of the system is presented in Figure 1 . There are four major subsystems : Indexing Module, Learning Module, Filtering Module, and Template Filler Module . Each of these systems are outlined next .</Paragraph>
    <Section position="1" start_page="234" end_page="237" type="sub_section">
      <SectionTitle>
Indexing Modul e
</SectionTitle>
      <Paragraph position="0"> The function of the Indexing Module is to generate a representation for each message . A message is represented by a vector of weights . Each weight value either indicates the presence or absence o fa term in the message or the importance of a term to the message . A term is either a single-order term or a high-order term (i .e., single terms or word combinations representing phrases) .</Paragraph>
      <Paragraph position="1"> For the assignment of single terms to messages, the indexing module from the SMART Retrieval System [1] is used. This module utilizes a stop list to filter out the common words and the &amp;quot;no stemming&amp;quot; option is chosen . All terms that are assigned a weight larger than a threshold by thi s module are retained in the message representation vector .</Paragraph>
      <Paragraph position="2"> For the purpose of phrase extraction, a modified version of the INDEX software, developed and implemented by Jones, et al. [2,3], is used . INDEX is used mainly to extract all possible substring s that are within certain minimum and maximum length specifications and are not substrings of othe r previously selected substrings . Several strategies for filtering these to identify &amp;quot;good&amp;quot; phrases ar e provided as a part of the software developed for the MUC-3 project.</Paragraph>
      <Paragraph position="3"> Thus, each element of the vector representing a message corresponds to either a single term o r a phrase. The phrase identification is expected to be important as a precision improving device .This module also generates the system vocabulary, which consists of all the distinct single terms an d phrases used in representing the messages .</Paragraph>
      <Paragraph position="4"> Learning Module The function of the learning module is to derive the concept categorization rules for the variou s concepts of interest . Each rule is a vector of numeric weights, where the elements correspond to th e terms in the system vocabulary.</Paragraph>
      <Paragraph position="5">  This module also involves components for selecting a training set from the development set , identifying the concepts for which the training set has at least a minimum number of positiv e examples (i.e., the learnable concepts) and preparing the grid file, which shows for each message i nthe training set which of the learnable concepts are applicable . The source for this information is the set of key-templates manually generated for the 1300 messages in the development set .</Paragraph>
      <Paragraph position="6"> The concept rule vectors are derived by employing the perceptron -learning algorithm [4] . The algorithm is simple and efficient . The procedure is incremental in that the rule can be updated as new examples become available. As long as a decision boundary exists, this algorithm is guaranteed to find one and terminate .</Paragraph>
      <Paragraph position="7"> Usually, the decision boundary constructed is a hyper-plane. However, since the system vocab-ulary includes phrases, and phrases incorporate dependency information between single terms, ou r result is equivalent to constructing a non-linear boundary. In the terminology of the connectionis tnetworks, we employ a single-layer, high-order perceptron . The single-layer option facilitates fast learning time, while the higher-order option enables the use of more powerful separation boundaries. Furthermore, the concept rule vectors are connectionist, rather than symbolic in nature . Suchrules are more attractive when a large number of features are involved and when robustness agains t noisyness in features is crucial .</Paragraph>
      <Paragraph position="8"> In addition to concepts associated with slot fills, another concept known as &amp;quot;optimal-query &amp;quot; is also derived . This rule vector distinguishes messages that are not relevant to Muc-3 task fromthose that generate at least one template. The system is set up in such a way that the training set of messages for deriving this optimal-query vector can be different from that used for the othe r concepts.</Paragraph>
      <Paragraph position="9"> Filtering module The Filtering Module is responsible for identifying concepts applicable to a set of test messages and deciding whether a message is relevant to the MuC-3 task. The major subsystems of this modul eare concerned with test message indexing, assessment of concept relevance and the evaluation of a rule base by means of an inference engine .</Paragraph>
      <Paragraph position="10"> The test message indexing involves the determination of which of the single terms and phrase s in the system vocabulary are contained in the message . This process generates a message vector that is matched against each of the concept rule vectors to determine the corresponding activatio n values. The distribution of the activation values for the test set of messages relative to each concept i s analyzed to determine a threshold. A concept is considered relevant to a message if the correspondin g activation value exceeds the threshold chosen for that concept. Depending on the concepts applicabl eto a message, the inference engine activates appropriate rules of the rule base, whose terminal symbol s correspond to the various concepts acquired. The rule base expresses the requirements in terms o f concept combinations that, when present in a message, imply that the message is relevant to Muc- 3 task. The module also identifies for each message, the extent to which its paragraphs contribute d to the activation values relative to the different concepts . This result is referred to as the concep t vs. paragraph relevance vector .</Paragraph>
      <Paragraph position="11"> For slots of string fill type, a database of possible fill values, grouped by slot name, is provide d as input to this module . For each string in the database for which at least one match is found i nthe message, the paragraphs in which a match is found and the frequency of its occurrence in each paragraph is determined.</Paragraph>
      <Paragraph position="12"> Template Filler Modul e This module is responsible for generating one or more templates for each message determined to b e relevant by the Filtering Module and filling the slots on the basis of concepts and string filled tha t are activated.</Paragraph>
      <Paragraph position="13"> For each relevant message either the optimal-query concept is activated or one or more inciden ttypes are recognized along with a desired combination of concepts (or both) . In the case exactly one incident type is recognized, for each of the other slots the following is performed . If several concepts  are activated for this slot and only one value is permitted, the one with highest activation value i schosen ; otherwise, all values are filled .</Paragraph>
      <Paragraph position="14"> In the case more than one incident type is activated, the system must decide, for each activate dconcept, to which incident type it is the closest . For this purpose, the concept versus paragrap hrelevance vector is used . This vector contains the contribution of the various paragraphs in a messag eto the activation value of the slot fill relative to this message . The paragraph relevance vector of anactivated slot fill, say CIVILIAN (from HUMAN TARGET TYPE), is compared to the vector associated with each of the activated incident types, say KIDNAPPING and MURDER. The strength of this match is then used to decide whether the fill CIVILIAN will be used in the KIDNAPPING or the MURDER template.</Paragraph>
      <Paragraph position="15"> If a message becomes relevant only due to optimal-query, then it enables other slots having activated fills to be filled even though no TYPE OF INCIDENT may have been activated.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="237" end_page="237" type="metho">
    <SectionTitle>
SYSTEM WALKTHROUG H
</SectionTitle>
    <Paragraph position="0"> The system walkthrough explains how the message TsT1-MUC3-0099 is processed. The result obtained corresponds to the parameter settings used in our Option 4 (see report on UNL/USL : Muc- 3Test Results and Analysis) . In this option, Training Set 2 is used for determining the rule vector fo roptimal-query and Training Set 3 is used for the other concepts . The threshold used for deciding whether a concept is activated is based on an analysis of the distribution of the activation values ofthis concept relative to the test set messages (threshold setting T1) .</Paragraph>
    <Paragraph position="1"> Table 1 shows a list of all set-list type fills and those that are actually learnable on the basis oftraining set 3 . The concept rule vectors for each of these fills are constructed by using the indexin gand the learning module . The test message is indexed and the dot product of its representation vecto rwith each of the concept rule vectors is computed . The activation values so obtained are compare dto the corresponding threshold values . Table 2 shows that, for the current message, the following fiveconcepts are activated : BOMBING, TERRORIST ACT, TRANSPORT VEHICLE, SOME DAMAGE, and theoptimal-query.</Paragraph>
    <Paragraph position="2"> These concepts activate the appropriate leaf nodes of the AND/OR tree associatedwith the rulebase shown in Table 4 . This results in the root node getting the value &amp;quot;true&amp;quot; an dtherefore, this message is termed relevant . For the current testing, the rule base is defined withall the concept weights being either 0 or 1 . The inference engine is, however, capable of handlin gany numeric weights between 0 and 1 . The vector representation for each of the paragraphs in themessage are also multiplied by the concept rule vectors to obtain the paragraph vs concept relevanc evector (Table 3) . This paragraph information is not useful in this case since neither several fill sare activated for a slot for which only one fill is permitted nor is there an indication, in terms o f INCIDENT TYPE activations, that multiple templates should be created .</Paragraph>
    <Paragraph position="3"> For the two string fill slots the matching strings along with their occurrence frequency in th evarious paragraphs is shown in Table 5 . The paragraph vector for BOMBING is found to match theparagraph vector of &amp;quot;POLICE&amp;quot; better (wrong decision!) . All 3 incident locations have a positiv e activation value with BOMBING . Since the location slot permits multiple fills, all three may be retained. However, since &amp;quot;PRC&amp;quot; is not one of the South American countries, it is discarded .</Paragraph>
    <Paragraph position="4"> The filled template for this message is shown in Table 6 . This template most closely matches keytemplate that is numbered 2 (see Appendix H) . The paragraph relevance vector matching technique needs to be refined as evidenced by the choice of &amp;quot;POLICE&amp;quot; as the perpetrator organization . Furthermore, template filler module should be refined to automatically determine and incorporate in th e filling process various dependencies between template fills . For example, &amp;quot;POLICE &amp;quot; is inconsistentwith</Paragraph>
  </Section>
  <Section position="6" start_page="237" end_page="238" type="metho">
    <SectionTitle>
CATEGORY OF INCIDENT being TERRORIST ACT .
</SectionTitle>
    <Paragraph position="0"> By proper modification of the stop list used during phrase extraction, phrases such as NO INJURY could be extracted. The optimal-query vector identifies relevant passages fairly accurately . Careful detailed analysis of individual instances should lead to many ideas for improvement .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML