File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1014_metho.xml

Size: 38,994 bytes

Last Modified: 2025-10-06 14:15:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1014">
  <Title>ALGORITHMS THAT LEARN TO EXTRACT INFORMATION m BBN: TIPSTER PHASE III</Title>
  <Section position="4" start_page="0" end_page="75" type="metho">
    <SectionTitle>
STATISTICAL EXTRACTION OF
ENTITIES AND RELATIONS
</SectionTitle>
    <Paragraph position="0"> The SIFT system (&amp;quot;Statistically-derived Information From Text&amp;quot;) combines a sentence-level model with message-level processing to merge elements and identify cross-sentence relations.</Paragraph>
    <Paragraph position="1"> At the sentence level, SIFT employs a unified statistical process to map from words to semantic structures. That is, part-of-speech determination, name-finding, parsing, and relationship-finding  all happen as part of the same process. This allows each element of the model to influence the others, and avoids the assembly-line trap of having to commit to a particular part-of-speech choice, say, early on in the process, when only local information is available to inform the choice.</Paragraph>
    <Paragraph position="2"> The SIFT sentence-level model was trained from two sources: * General knowledge of English sentence structure was learned from the Penn Treebank corpus of one million words of Wall Street Journal text.</Paragraph>
    <Paragraph position="3"> * Specific knowledge about how the target entities and relations are expressed in English was learned from about 500 K words of on-domain text annotated with named entities, descriptors, and semantic relations.</Paragraph>
    <Paragraph position="4"> In the on-domain training data, the names and descriptors of relevant items (persons, organizations, locations, and artifacts) are marked, as well as the target relationships between them that are signaled syntactically. For example, in the phrase &amp;quot;GTE Corp. of Stamford&amp;quot;, the annotation would record a &amp;quot;location-of&amp;quot; connection between the company and the city. The model can thus learn the structures that are typically used in English to convey the target relationships. Doing extraction in a new domain would require fresh semantically annotated training data appropriate to the new domain, but the general syntactic knowledge acquired from the Penn Treebank would still be applicable.</Paragraph>
    <Paragraph position="5"> After the sentence-level model has identified names, descriptors, and relationships that are syntactially signaled within each sentence, further message-level processing is required to link up entities mentioned more than once or in different sentences, and to try to identify cross-sentence relationships or those not syntactically signaled. After the names, descriptors, and local relationships have been extracted from the sentence-level decoder's output, a merging process is applied to link multiple occurrences of the same name or of alternative forms of the name from different sentences. A second, cross-sentence model is then invoked to try to identify relationships that were not picked up by the decoder, such as when the two entities do not occur in the same sentence. Finally, some additional fields required by the MUC answer specification are filled in using heuristic tests and a gazetteer database, and output filters are applied to select which of the proposed internal structures should be included in the output. We are actively exploring ways of integrating this message-level processing more closely with the sentence-level model, since an integrated statistical model is the only way in which to make every choice in a nuanced way, based on all the available information.</Paragraph>
    <Paragraph position="6"> The following sections describe the sentence-level and message-level processing of the SIFT system in more detail.</Paragraph>
  </Section>
  <Section position="5" start_page="75" end_page="83" type="metho">
    <SectionTitle>
SIFT's Sentence-Level Model
</SectionTitle>
    <Paragraph position="0"> Figure 1 is a block diagram of the sentence-level model showing the main components and data paths. Two types of annotations are used to train the model: syntactic annotations for learning about the general structure of English, and semantic annotations for learning about the target entities and relations. From these annotations, the training program estimates the parameters of a unified statistical model that accounts for both syntax and semantics. Later, when presented with a new sentence, the search program explores the statistical model to find the most likely combined semantic and syntactic interpretation.</Paragraph>
    <Section position="1" start_page="75" end_page="76" type="sub_section">
      <SectionTitle>
Training Data
</SectionTitle>
      <Paragraph position="0"> Our source for syntactically annotated training data was the Penn Treebank (Marcus et al., 1993). Significantly, we do not require that syntactic annotations be from the same source, or cover the same domain, as the target task. For example, while the Penn Treebank consists of Wall Street Journal text, the target source for this evaluation was New York Times newswire.</Paragraph>
      <Paragraph position="1"> Similarly, although the Penn Treebank domain covers general and financial news, the target domain for the MUC-7 evaluation was space technology. The ability to use syntactic training from a different source and domain than the target is an important feature of our model.</Paragraph>
      <Paragraph position="2"> Since the Penn Treebank serves as our syntactically annotated training corpus, we need only create a semantically annotated corpus.</Paragraph>
      <Paragraph position="3"> Stated generally, semantic annotations serve to denote the entities and relations of interest in the  target domain. More specifically, entities are marked as either names or descriptors, with co-reference between entities marked as well.</Paragraph>
      <Paragraph position="4"> Figure 2 shows a semantically annotated fragment of a typical sentence.</Paragraph>
      <Paragraph position="5"> From only these simple semantic annotations, the system can be trained to work in a new domain. To train SIFT for MUC-7, we annotated approximately 500,000 words of New York Times newswire text, covering the domains of air disasters and space technology. (We have not yet run experiments to see how performance varies with more/less training data.)</Paragraph>
    </Section>
    <Section position="2" start_page="76" end_page="77" type="sub_section">
      <SectionTitle>
Semantic/Syntactic Structure
</SectionTitle>
      <Paragraph position="0"> While our semantic annotations are quite simple, the internal model of sentence structure is substantially more complicated, since this combined model must account for syntactic structure as well as for entities and semantic relations. Our underlying training algorithm requires examples of these internal structures in order to estimate the parameters of the unified semantic/syntactic model. However, we do not wish to incur the high cost of annotating parse trees. Instead, we use the following multi-step training procedure, exploiting the Penn Treebank: 1) Train the sentence-level model on the purely syntactic parse trees in the Treebank. Once this step is complete, the model will function as a state-of-the-art statistical parser.</Paragraph>
      <Paragraph position="1"> 2) For each sentence in the semantically annotated corpus: a) Apply the sentence level model to syntactically parse the sentence, constraining the model to produce only parses that are consistent with the semantic annotation.</Paragraph>
      <Paragraph position="2"> b) Augment the resulting parse tree to reflect semantic structure as well as syntactic structure.</Paragraph>
      <Paragraph position="3"> 3) Retrain the sentence-level model on the augmented parse trees produced in step 2.</Paragraph>
      <Paragraph position="4"> Once this step is complete, we have an integrated model of semantics and syntax.</Paragraph>
      <Paragraph position="5"> Details of the statistical model will be discussed  later. For now, we turn our attention to (a) constraining the decoder and (b) augmenting the parse trees with semantic structure.</Paragraph>
      <Paragraph position="6"> Constraints are simply bracketing boundaries that may not be crossed by any parse constituent. There are two types of constraints: hard constraints that cannot be violated under any conditions, and soft constraints, that may be violated only if enforcing them would result in no plausible parse. All named entities and descriptors are treated as hard constraints; the model is prohibited from producing any constituents that overlap either edge of the span of these elements. In addition, we attempt to keep possible appositives together through soft constraints. Whenever there is a co-referential relation between two entities that are either adjacent or separated by only a comma, we posit an appositive and introduce a soft constraint to encourage the parser to keep the elements together.</Paragraph>
      <Paragraph position="7"> Once a constrained parse is found, it must be augmented to reflect the semantic structure.</Paragraph>
      <Paragraph position="8"> Augmentation is a five step process.</Paragraph>
      <Paragraph position="9"> 1) Nodes are inserted into the parse tree to distinguish names and descriptors that are not bracketed in the parse. For example, the parser produces a single noun phrase with no internal structure for &amp;quot;Lt. Cmdr. David Edwin Lewis&amp;quot;. Additional nodes must be inserted to distinguish the descriptor, &amp;quot;Lt. Cmdr.,&amp;quot; and the name, &amp;quot;David Edwin Lewis.&amp;quot; 2) Semantic labels are attached to all nodes that correspond to names or descriptors. These labels reflect the entity type, such as person, organization, or location, as well as whether the node is a proper name or a descriptor.</Paragraph>
      <Paragraph position="10"> 3) For relations between entities, where one entity is not a syntactic modifier of the other, the lowermost parse node that spans both entities is identified. A semantic tag is then added to that node denoting the relationship. For example, in the sentence &amp;quot;Mary Fackler Schiavo is the inspector general of the U.S. Department of Transportation,&amp;quot; a co-reference semantic label is added to the S node spanning the name, &amp;quot;Mary Fackler Schiavo,&amp;quot; and the descriptor, &amp;quot;the inspector general of the U.S.</Paragraph>
    </Section>
    <Section position="3" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
Department of Transportation.&amp;quot;
</SectionTitle>
      <Paragraph position="0"> 4) Nodes are inserted into the parse tree to distinguish the arguments to each relation.</Paragraph>
      <Paragraph position="1"> In cases where there is a relation between two entities, and one of the entities is a syntactic modifier of the other, the inserted node serves to indicate the relation as well as the argument. For example, in the phrase &amp;quot;Lt. Cmdr. David Edwin Lewis,&amp;quot; a node is inserted to indicate that &amp;quot;Lt. Cmdr.&amp;quot; is a descriptor for &amp;quot;David Edwin Lewis.&amp;quot; 5) Whenever a relation involves an entity that is not a direct descendant of that relation in the parse tree, semantic pointer labels are attached to all of the intermediate nodes.</Paragraph>
      <Paragraph position="2"> These labels serve to form a continuous chain between the relation and its argument.</Paragraph>
      <Paragraph position="3">  ending in &amp;quot;-r&amp;quot; mark MUC reportable names and descriptors.</Paragraph>
    </Section>
    <Section position="4" start_page="77" end_page="80" type="sub_section">
      <SectionTitle>
Statistical Model
</SectionTitle>
      <Paragraph position="0"> In SIFT's statistical model, augmented parse trees are generated according to a process similar to that described in Collins (1996, 1997). For each constituent, the head is generated first, followed by the modifiers, which are generated from the head outward. Head words, along with their part-of-speech tags and features, are generated for each modifier as soon as the modifier is created. Word features are introduced primarily to help with unknown words, as in Weischedel et al. (1993).</Paragraph>
      <Paragraph position="1"> We illustrate the generation process by walking through a few of the steps of the parse shown in  made from a statistical distribution, with the probability of each possible selection dependent on particular features of previously-generated elements. We pick up the derivation just after the topmost S and its head word, said, have been produced. The next steps are to generate in order: 1. A head constituent for the S, in this case a VP.</Paragraph>
      <Paragraph position="2"> 2. Pre-modifier constituents for the S. In this case, there is only one: a PER/NP.</Paragraph>
      <Paragraph position="3"> 3. A head part-of-speech tag for the PER/NP,  in this case PER/NNP.</Paragraph>
      <Paragraph position="4">  , vbd I I , said 4. A head word for the PER/NP, in this case nance.</Paragraph>
      <Paragraph position="5"> 5. Word features for the head word of the PER/NP, in this case capitalized.</Paragraph>
      <Paragraph position="6"> 6. A head constituent for the PER/NP, in this case a PER-R/NP.</Paragraph>
      <Paragraph position="7"> 7. Pre-modifier constituents for the PER/NP.  In this case, there are none.</Paragraph>
      <Paragraph position="8"> 8. Post-modifier constituents for the PER/NP. First a comma, then an SBAR structure, and then a second comma are each generated in turn.</Paragraph>
      <Paragraph position="9"> This generation process is continued until the entire tree has been produced.</Paragraph>
      <Paragraph position="10"> We now briefly summarize the probability structure of the model. The categories for head constituents, Ch, are predicted based solely on the category of the parent node, Cp: e(c h I Cp ), e.g. P(vp I s) Modifier constituent categories, Cm, are predicted based on their parent node, cp, the head constituent of their parent node, Chp, the previously generated modifier, Cm-1, and the head word of their parent, Wp. Separate probabilities are maintained for left (pre) and right (post) modifiers:</Paragraph>
      <Paragraph position="12"> PR(null I s, vp, null, said) Part-of-speech tags, tin, for modifiers are predicted based on the modifier, Cm, the part-of-speech tag of the head word , th, and the head</Paragraph>
      <Paragraph position="14"> Head words, win, for modifiers are predicted based on the modifier, Cm, the part-of-speech tag  of the modifier word, tin, the part-of-speech tag of the head word , th, and the head word itself, Wh: e(w m \] Cm, tm,th, Wh ), e.g.</Paragraph>
      <Paragraph position="15"> P(nance I per I np, per I nnp, vbd, said) Finally, word features, fro, for modifiers are predicted based on the modifier, Cm, the part-of-speech tag of the modifier word, tm, the part-of-speech tag of the head word, th, the head word itself, wh, and whether or not the modifier head word, Win, is known or unknown.</Paragraph>
      <Paragraph position="16"> P( \]m I Cm, tra,th, Wh, known(w m )), e.g.</Paragraph>
      <Paragraph position="17"> P(cap I per / np, per / nnp, vbd, said, true) The probability of a complete tree is the product of the probabilities of generating each element in the tree. If we generalize the tree components (constituent labels, words, tags, etc.) and treat them all as simply elements, e, and treat all the conditioning factors as the history, h, we can</Paragraph>
      <Paragraph position="19"> Training the Model Maximum likelihood estimates for all model probabilities are obtained by observing frequencies in the training corpus. However, because these estimates are too sparse to be relied upon, they must be smoothed by mixing in lower-dimensional estimates. We determine the mixture weights using the Witten-Bell smoothing method.</Paragraph>
      <Paragraph position="20"> For modifier constituents, the mixture  components are: P'(c m ICp,Chp,Cm_l,Wp)= 21 P(c m I Cp,Chp,Cm_l,Wp) -I-~, 2 P(c m ICp,Chp,Cm-l) For part-of-speech tags, the mixture components are: P'(t m I Cm, t h, w h) = 21 P(t m I cm, w h ) +2 2 P(t m \]Cm,th) +2 3 P(t m I c m) For head words, the mixture components are: P'(W m I Cm,tm,th,Wh) = 21 P(W m I cm,tm,W h) +2 2 P(W m ICm,tm,t h) +2 3 P(w m I Cm,t m) -1&amp;quot;2 4 e(w m It m)  Finally, for word features, the mixture components are:</Paragraph>
      <Paragraph position="22"> Searching the Model Given a sentence to be analyzed, the search program must find the most likely semantic and syntactic interpretation. More concretely, it must find the most likely augmented parse tree.</Paragraph>
      <Paragraph position="23"> Although mathematically the model predicts tree elements in a top-down fashion, we search the space bottom-up using a chart based search. The search is kept tractable through a combination of CKY-style dynamic programming and pruning of low probability elements.</Paragraph>
      <Paragraph position="24"> Dynamic Programming: Whenever two or more constituents are equivalent relative to all possible later parsing decisions, we apply dynamic programming, keeping only the most likely constituent in the chart. Two constituents are considered equivalent if:  1. They have identical category labels.</Paragraph>
      <Paragraph position="25"> 2. Their head constituents have identical labels. 3. They have the same head word.</Paragraph>
      <Paragraph position="26"> 4. Their leftmost modifiers have identical labels.</Paragraph>
      <Paragraph position="27"> 5. Their rightmost modifiers have identical  labels.</Paragraph>
      <Paragraph position="28"> Pruning: Given multiple constituents that cover identical spans in the chart, only those constituents with probabilities within a threshold of the highest scoring constituent are maintained; all others are pruned. For purposes of pruning, and only for purposes of pruning, the prior probability of each constituent category is multiplied by the generative probability of that  constituent (Goodman, 1997). We can think of this prior probability as an estimate of the probability of generating a subtree with the constituent category, starting at the topmost node. Thus, the scores used in pruning can be considered as the product of: 1. The probability of generating a constituent of the specified category, starting at the topmost node.</Paragraph>
      <Paragraph position="29"> 2. The probability of generating the structure  beneath that constituent, having already generated a constituent of that category. The outcome of the search process is a tree structure that encodes both the syntactic and semantic structure of the sentence, so that the TE entities and local TR relations can be directly extracted from these sentential trees.</Paragraph>
      <Paragraph position="30">  The sentence-level model in SIFT predicts names, descriptors, and relationships that are cued by the local sentence structure, but it considers each sentence in isolation. Merging such information between sentences is an important and difficult problem in information extraction. The information that indicates the presence of a template relation is often distributed across multiple sentences, and this merging problem would naturally become even more severe when trying to extract more complex structures like full scenario templates. We have explored various approaches to this merging problem in our TIPSTER research.</Paragraph>
      <Paragraph position="31"> Our overall goal is to use trained and integrated models where possible, particularly for all of the language understanding. For some portions of SIFT's message-level processing, we used hand-written rules combined with external sources like gazetteers. The MUC-7 deadlines caused us to use an existing alias process for merging names rather than implementing a statistical alias procedure. In the current system, simple heuristic code handles the filling of the type and country fields that are required by the MUC specification, and the distinction between substantial and non-substantial descriptors. (The MUC guidelines call for ignoring certain descriptors like &amp;quot;the company&amp;quot;.) A trained cross-sentence relation model is used to identify template relations that link entities across different sentences. This model was trained on 200 articles annotated with full MUC answer keys, so that even non-local relations were marked. (That level of semantic annotation was available for only a small subset of the data used to train the sentence-level model.) The model applies a set of structural and contextual features that help to indicate when such a relation might be present. Feature counts from the training data are used to estimate the probability of a relationship between each possible pair of entities mentioned in separate sentences in the text.</Paragraph>
      <Paragraph position="32"> While the cross-sentence model is currently applied as a separate step after the sentence-level decoding is complete, we are exploring various approaches toward integrating the two models more closely, and also toward doing more of the named entity merging and type field prediction by means of trained models.</Paragraph>
    </Section>
    <Section position="5" start_page="80" end_page="83" type="sub_section">
      <SectionTitle>
Merging Named Entities
</SectionTitle>
      <Paragraph position="0"> The first step in merging the results of the sentence-level model is to group together the different mentions of the same named entity. In SIFT, a set of heuristic rules were used for this.</Paragraph>
      <Paragraph position="1"> Different mentions of the same name (say, different mentions of &amp;quot;IBM&amp;quot;) would be grouped, as would strings that were related in certain predictable ways, for example, by initials (linking &amp;quot;IBM&amp;quot; with &amp;quot;International Business Machines&amp;quot;) or by the addition of a corporate designator (linking &amp;quot;International Business Machines&amp;quot; with &amp;quot;International Business Machines, Inc.&amp;quot;). This merging process also tested whether one name was a prefix of the other, linking &amp;quot;Legg Mason Wood Walker, Inc.&amp;quot; with &amp;quot;Legg Mason&amp;quot;.</Paragraph>
      <Paragraph position="2"> The Cross-Sentence Relation Model The cross-sentence model then uses structural and contextual clues to hypothesize template relations between two elements that are not mentioned within the same sentence. Since 8090% of the relations found in the answer keys connect two elements that are mentioned in the same sentence, the cross sentence model has a narrow target to shoot for. Very few of the pairs of entities seen in different sentences turn out to be actually related. This model uses features extracted from related pairs in training data to try to identify those cases.</Paragraph>
      <Paragraph position="3">  It is a classifier model that considers all pairs of entities in a message whose types are compatible with a given relation; for example, a person and an organization would suggest a possible employment relation. For the three MUC-7 relations, it turned out to be somewhat advantageous to build in a functional constraint, so that the model would not consider, for example, a possible employment relation for a person already known from the sentence-level model to be employed elsewhere.</Paragraph>
      <Paragraph position="4"> Given the measured features for a possible relation, the probability of a relation holding or not holding can be computed as follows:</Paragraph>
      <Paragraph position="6"> If the ratio of those two probabilities, computed as follows, is greater than 1, the model predicts a relation: p(rell feats) p(featsl rel)p(rel) p(-rell feats) p(featsl ~rel)p(-rel) We approximate this ratio by assuming feature independence and taking the product of the contributions for each feature.</Paragraph>
      <Paragraph position="8"> The cross-sentence feature model applies to entities found by the sentence-level model, which is run over all of the sentence-like portions of the text. An initial heuristic procedure checks for sections of the preamble or trailer that look like sentential material, that should be treated like the body text. There is also a separate handwritten procedure that searches the preamble text for any byline, and, if one is found, instantiates an appropriate employee relationship.</Paragraph>
      <Paragraph position="9"> Model Features Two classes of features were used in this model: structural features that reflect properties of the text surrounding references to the entities involved in the suggested relation, and content features based on the actual entities and relations encountered in the training data.</Paragraph>
      <Paragraph position="10"> Structural Features The structural features exploit simple characteristics of the text surrounding references to the possibly-related entities. The most powerful structural feature, not surprisingly, was distance, reflecting the fact that related elements tend to be mentioned in close proximity, even when they are not mentioned in the same sentence. Given a pair of entity references in the text, the distance between them was quantized into one of three possible values:  For each pair of possibly-related elements, the distance feature value was defined as the minimum distance between some reference in the text to the first element and some reference to the second.</Paragraph>
      <Paragraph position="11"> A second structural feature grew out of the intuition that entities mentioned in the first sentence of an article often play a special topical role throughout the article. The &amp;quot;Topic Sentence&amp;quot; feature was defined to be true if some reference to one of the two entities involved in the suggested relation occurred in the first sentence of the text-field body of the article.</Paragraph>
      <Paragraph position="12"> Other structural features that were considered but not implemented included the count of the number of references to each entity.</Paragraph>
      <Paragraph position="13"> Content Features While the structural features learn general facts about the patterns in which related references occur and the text that surrounds them, the content features learn about the actual names and descriptors of entities seen to be related in the training data. The three content features in current use test for a similar relationship in training by name or by descriptor or for a conflicting relationship in training by name.</Paragraph>
      <Paragraph position="14"> The simplest content feature tests using names whether the entities in the proposed relationship have ever been seen before to be related. To test  this feature, the model maintains a database of all the entities seen to be related in training, and of the names used to refer to them. The &amp;quot;by name&amp;quot; content feature is true if, for example, a person in some training message who shared at least one name string with the person in the proposed relationship was employed in that training message by an organization that shared at least one name string with the organization in the proposed relationship, A somewhat weaker feature makes the same kind of test for a previously seen relationship using descriptor strings. This feature fires when an entity that shares a descriptor string with the first argument of the suggested relation was related in training to an entity that shares a name with the second argument. Since titles like &amp;quot;General&amp;quot; count as descriptor strings, one effect of this feature is to increase the likelihood of generals being employed by armies. Observing such examples, but noting that the training didn't include all the reasonable combinations of titles and organizations, the training for this feature was seeded by adding a virtual message constructed from a list of such titles and organizations, so that any reasonable such pair would turn up in training.</Paragraph>
      <Paragraph position="15"> The third content feature was a kind of inverse of the first &amp;quot;by name&amp;quot; feature which was true if some entity sharing a name with the first argument of the proposed relation was related to an entity that did not share a name with the second argument. Using the employment relation again as an example, it is less likely (though still possible) that a person who was known in another message to be employed by a different organization should be reported here as employed by the suggested one.</Paragraph>
      <Paragraph position="16"> Training Given enough fully annotated data, with both sentence-level semantic annotation and message-level answer keys recorded along with the connections between them, training the features would be quite straightforward. For each possibly-related pair of entities mentioned in a document, one would just count up the 2x2 table showing how many of them exhibited the given structural feature and how many of them were actually related. The training issues that did arise stemmed from the limited supply of answer keys and that the keys were not connected to the sentence-level annotations.</Paragraph>
      <Paragraph position="17"> The government training and dry run data provided 200 messages' worth of TE and TR answer keys, Those answer keys, however, contained strings without recording where in the text they were found. In order to train structural features from that data, we needed the locations of references within the text. A heuristic string matching process was used to make that connection, with a special check to ensure for names that the shorter version of a name did not match a string in the text that also matched a longer version of the same name.</Paragraph>
      <Paragraph position="18"> Training the content features, on the other hand, did not require positional information about the references. The plain answer keys could be used in combination with a database of the name and descriptor strings for entities related in training to count up the feature probabilities for actually related and non-related pairs. The string database was collected first, and one-out training was then used, so that the rest of the training corpus provided the string database for training the feature counts on each particular message. The additional training data that was semantically annotated for training the sentence-level model but for which answer keys were not available could still also be used in building up the string database for the content features.</Paragraph>
      <Paragraph position="19"> The probabilities based on the final feature counts were smoothed by mixing them with 0.01% of a uniform model.</Paragraph>
      <Paragraph position="20"> Other Message Level Processing After the cross sentence model has been applied, some further heuristic message-level processing is done before generating the answers in MUC template form. In one step, those portions of the preamble of the message, which includes the title and by-line, that are not English sentences are searched for a possible employment relation between the article author and the organization holding the copyright. A limited form of voting was also applied across messages, so that if the same name was identified by the sentence-level model as, say, an organization in one case and a person in another, only the plurality type is actually output. Heuristic models are used to fill in some additional required fields, distinguishing, for instance, between civilian, military, and government organizations; this could have been trained, but time did not permit this. Identifying the type and country of locations  is a simple process, benefiting greatly from gazetteer lookup.</Paragraph>
      <Paragraph position="21"> Finally, a heuristic choice is made whether or not to output each element. For example, a descriptor that was not paired by the sentence-level processing with any named entity could either actually be an isolated descriptor or it could be one where the true link with a named entity was missed by the sentence-level model. Lacking at this point any trained model to distinguish those two cases, SIFT plays it safe by not outputting such entities.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="83" end_page="85" type="metho">
    <SectionTitle>
SIFT System Examples
</SectionTitle>
    <Paragraph position="0"> The main determinant of SIFT's performance is the sentence-level model, and the semantic structures that it produces. Secondary but still significant effects on performance come from the message-level processing steps that derive TE and TR output from the sentence-level decoder  This section will present examples from the output for one of the MUC-7 test messages, demonstrating the different effects that applied. Example 1 shows a case where everything worked as planned.</Paragraph>
    <Paragraph position="1"> Here the decoder correctly recognized a person name (PER/NPA) bound to a person descriptor (PER-DESC/NP-R). That descriptor contains an organization (ORG/NP) which in turn is linked to a location. The LINK and PTR nodes connect the descriptor with the person, the organization with the person descriptor (and thus indirectly with the person), and the location with the organization. In the post-processing, the person name is extracted, with the descriptor text is linked to it, the organization name is extracted, and the employment relationship noted. The organization is also linked to the nested location;  of the two location elements in the LOC phrase, the first is taken as the LOCALE field filler, while the second is looked up in the gazetteer to identify a country in which the locale value is then looked up.</Paragraph>
    <Paragraph position="2"> Example 2 shows the effect of a decoder error.  Here the sentence-level decoder linked both organization descriptors back to the top-level named organization, while the correct reading would have attached the second descriptor to the nested &amp;quot;Bloomberg L.P.&amp;quot;. The post-processing also therefore links both descriptor phrases to &amp;quot;Bloomberg Information Television&amp;quot; internally. Only the longest descriptor, however, is actually output, which in this case results in output of only the mistaken value.</Paragraph>
    <Paragraph position="3"> Not surprisingly, a number of the decoder errors that affected output stemmed from conjunctions. In another paragraph, for example, the manufacturer organization name &amp;quot;Lockheed Space and Strategic Missiles&amp;quot; was incorrectly broken at the conjunction, causing the location relation with Bethesda to be missed.</Paragraph>
    <Paragraph position="4"> The cross sentence model is the system component that tries to find further relations beyond those identified by the sentence-level model. In the walk-through article, that component did not happen to succeed in finding any such relations. Example 3 shows the sort of relation that we would like that model to be able to get. There the sentence-level decoder did link Rubenstein to the organization descriptor &amp;quot;company&amp;quot;, but since that descriptor was never linked to &amp;quot;News Corporation&amp;quot;, the employee relation was missed. However, since News Corporation is mentioned both in that sentence and the following sentence, an improved cross sentence model would be one way of attacking such examples.</Paragraph>
    <Paragraph position="5">  The last step in processing is the output filter, which heuristically determines whether a proposed constituent should be included in the output. Example 4 shows two examples where this filter overrode correct decoder structure.  Here the decoder correctly identified both the artifact descriptors &amp;quot;A Chinese rocket&amp;quot; and &amp;quot;an Intelsat satellite&amp;quot;, but the output filter chose not to include them. That choice was made because of frequent cases where an indefinite artifact descriptor not linked to any named artifact should not be output; an example from elsewhere in this message is &amp;quot;the last rocket I'd  recommend&amp;quot;. But this example shows that this decision not to output such cases sometimes cost the system points.</Paragraph>
    <Paragraph position="6"> SIFT System Results and Summary The SIFT system worked by first applying the sentence-level model to each sentence in the message and then extracting entities, descriptors, and relations from the resulting trees, heuristically merging TE elements, applying the cross-sentence model to identify non-local relations, and finally filtering and formatting TE and TR templates for output. In the MUC-7 evaluation, the system's score on the TE task was 83% recall with 84% precision, for an F of 83.49%. Its score on TR was 64% recall with 81% precision, for an F of 71.23%.</Paragraph>
    <Paragraph position="7"> Because most of the relations in the answer keys were locally signaled, the cross sentence model in this application adds only a small boost to the performance of the sentence-level model. When measured before the evaluation on 10 randomly-selected messages from the airplane crash domain training, the cross sentence model improved TR scores by 5 points. It proved a bit less effective on the 100 messages of the MUC-7 test set, improving scores there by only 2 points. (The F score on the formal test set with the cross sentence model component disabled was  For identifying named entities in text, BBN has developed the IdentiFinder TM trained named entity extraction system (Bikel, et. al., 1997), which utilizes an HMM to recognize the entities present in the text.</Paragraph>
    <Paragraph position="8"> The HMM labels each word either with one of the desired classes (e.g., person, organization, etc.) or with the label NOT-A-NAME (to represent &amp;quot;none of the desired classes&amp;quot;). The states of the HMM fall into regions, one region for each desired class plus one for NOT-A-NAME. (See Figure 4.) The HMM thus has a model of each desired class and of the other text. Note that the implementation is not confined to the seven name classes used in the NE task; the particular classes to be recognized can be easily changed via a parameter.</Paragraph>
    <Paragraph position="9"> Within each of the regions, we use a statistical bigram language model, and emit exactly one word upon entering each state. Therefore, the number of states in each of the name-class regions is equal to the vocabulary size, Ivl.</Paragraph>
    <Paragraph position="10"> Additionally, there are two special states, the</Paragraph>
  </Section>
  <Section position="7" start_page="85" end_page="85" type="metho">
    <SectionTitle>
START-OF-SENTENCE and END-OF-SENTENCE
</SectionTitle>
    <Paragraph position="0"> states. In addition to generating the word, states may also generate features of that word.</Paragraph>
    <Paragraph position="1"> Features used in the MUC-7 version of the system include several features pertaining to numeric expressions, capitalization, and membership in lists of important words (e.g.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML