File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1105_metho.xml
Size: 43,603 bytes
Last Modified: 2025-10-06 14:15:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1105"> <Title>Semantic Tagging using a Probabilistic Context Free Grammar *</Title> <Section position="4" start_page="39" end_page="39" type="metho"> <SectionTitle> &quot;RESTOR INDUSTRIES Inc.&quot; } </SectionTitle> <Paragraph position="0"> This paper's work attacks problem (1) alone, and is restricted to recovery of the IN, OUT and POST slots.</Paragraph> <Section position="1" start_page="39" end_page="39" type="sub_section"> <SectionTitle> 1.2 Previous Work </SectionTitle> <Paragraph position="0"> The majority of systems at MUC-6, including SRI's system FASTUS (Appelt et al. 93), and the best-performing sy.stem, from NYU (Grishman 95), used cascaded finite-state transducers, which were built by hand. The domain independent transducers tokenize the text, recognise per-son and company names, &quot;chunk&quot; noun and verb groups, and finally build some higher level, complete clauses.</Paragraph> <Paragraph position="1"> The domain specific rules then extract the slots from the sentence, using patterns such as the example on page 244 of (Appelt et al. 93): &quot;Company hires or recruits person from company as position&quot;.</Paragraph> <Paragraph position="2"> There have been a number of machine learning approaches to the sentence-level stage of information extraction. The AutoSiog system (Riloff 93; Riloff 96) automatically learned &quot;concept node&quot; definitions for use on the MUC-4 terrorist events domain. A concept node specifies a trigger word, usually a verb, and maps syntactic roles with respect to this trigger to semantic slots - for example, a concept node might specify/f trigger = &quot;destroyed&quot; and syntax = direct-object then concept = Damaged-Object (Damaged-object is the name of the slot in this case). A concept node may also specify hard or soft constraints on the slot-fillers. The system uses the CIRCUS parser (Lehnert et al. 93) to find the syntactic roles in relation to the trigger. AutoSlog learns concept nodes given input-output pairs like those in figure 1, so the indicator words do not need to be specified.</Paragraph> <Paragraph position="3"> Experiments showed that running AutoSlog followed by 5 hours of filtering the rules by hand gave a system that performed as well as a hand-crafted system.</Paragraph> <Paragraph position="4"> The CRYSTAL system (Soderland et al. 9.'i) also learns rules that map syntactic frames to semantic roles.</Paragraph> <Paragraph position="5"> The triggers can be more complicated than those in AutoSlog, in that they can specify whole sequences of words, or restrict patterns by specifying words or classes in the surrounding context. CRYSTAL learns patterns by initially specifying a maximally detailed pattern for each training example, then progressively simplifying and merging patterns until some error bound is exceeded.</Paragraph> <Paragraph position="6"> CRYSTAL uses the BADGER sentence analyzer to give syntactic information.</Paragraph> <Paragraph position="7"> (Califf and Mooney 97) describe a system for extraction of information about job posfings from a newsgroup. Relational learning is used to learn rule-based patterns that specify: 1) a pre-filler pattern that matches the text before the slot; 2) a pattern that must match the actual slot filler; and 3) a post-filler pattern that matches the text after the slot. The patterns can involve parts of speech, semantic classes of words, or the words themselves. An example pattern from (Califf and Mooney 97) for identifying locations is pre-filler = in, filler = 2 or fewer words all proper nouns, post-filler = wordl is &quot;,&quot;, word2 is a state. This matches phrases like &quot;in Kansas City, Missouri&quot; or &quot;in Atlanta, Georgia&quot;. The learning algorithm starts with the most specific rule for each training example, then generalizes by merging similar rules.</Paragraph> <Paragraph position="8"> A major difference between the approaches described in (Riloff 93; Riloff 96; Soderland et al. 95) and the approach in this paper is that (Riloff 93; Riloff 96; Soderland et al. 95) rely on a syntactic parser to produce at least a shallow syntactic analysis. The approach described in this paper builds a system from a set of training examples, with only a part-of-speech tagger and a morphological analyzer as additional resources. The system in (Califf and Mooney 97) does not require a parser, but the patterns it uses are quite local (the pre-filler and post-filler patterns are adjacent to the slot). It isn't clear this method would work well for the management successions domain where there are often many &quot;noise&quot; words between the slots and the indicator. Another major difference between the methods is that the PCFG based method is probabilisfic. This may be an advantage when the sentence-level stage of processing is combined with the later merging and coreference stages, as it gives a principled way of combining evidence from the different stages of processing: an uncertainty at the sentence level may, for example, be resolved at the merging stage -- in this case it is useful for the sentence level system to be capable of giving a list of candidate analyses with associated probabilities.</Paragraph> </Section> </Section> <Section position="5" start_page="39" end_page="40" type="metho"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="39" end_page="40" type="sub_section"> <SectionTitle> 2.1 The Problem </SectionTitle> <Paragraph position="0"> We assume the following definitions: 1. A sentence, W, consists of n words, 'tU1, LU2, ...W n. 2. A template, T, is a non-empty set of slots, where each slot is a label together with a tuple giving the start and end point of the slot in the sentence. For example, T = {IN = (3, 4), OUT = (5, 6)} means there is an IN slot spanning words 3 to 4 inclusive, and an OUT slot spanning words 5 to 6 inclusive. In the management succession domain there are three possible slots, IN, OUT and POST (abbreviated to I, O and P respectively). IN is the string denoting the person who is filling the post, OUT is the person who is leaving the post, and POST is the name of the post.</Paragraph> <Paragraph position="1"> 3. In addition, we assume that each template contains an additional indicator slot, which is the verb or noun used to express the template.</Paragraph> <Paragraph position="2"> For example, a (W, T) pair might be W = Last week Hensley West, 59 years old, joined the company as president, a surprising development.</Paragraph> <Paragraph position="4"> As alternative notation in this paper we either list the strings in the template, for example T = {IN = &quot;Hensley West&quot;, POST = &quot;president&quot;, IND = &quot;joined&quot;}, or we show the (W, T) pair as a bracketed sentence: Last week (IN Hensley West), 59 years old, (INDjoined) the company as (POST president) , a surprising development.</Paragraph> <Paragraph position="5"> Table 1 shows more examples from the management succession domain. The machine learning task is to learn a function that maps an arbitrary sentence W to a template T, given a training set of N pairs (Wi, Ti) 1 < i < N.</Paragraph> <Paragraph position="6"> A test set of (W, T) pairs is used to evaluate the model.</Paragraph> <Paragraph position="7"> In addition to a training set, we assume the following resources: 1. A part of speech (POS) tagger. The POS tagger described in (Ratnaparkhi 96) was used to tag both training and test data.</Paragraph> <Paragraph position="8"> 2. A lexicon which maps each indicator word in train- null ing data to a class, for example the morphological variants &quot;join&quot;, &quot;joins&quot;, &quot;joined&quot; and &quot;joining&quot; could all be mapped to the JOIN class. This can be done automatically by a morphological analyzer as in (Karp et al. 94), or by hand. (This resource is not strictly necessary, but will help to reduce sparse data problems).</Paragraph> <Paragraph position="9"> A probabilistic approach defines a conditional probability P(T I W) or a joint probability P(T, W) for every candidate template for a sentence. The most likely template for a sentence W is then</Paragraph> <Paragraph position="11"> The major part of this paper will be concerned with formalizing a stochastic model that defines P(T, W).</Paragraph> </Section> </Section> <Section position="6" start_page="40" end_page="41" type="metho"> <SectionTitle> 3 A Probabilistic Model </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="40" end_page="40" type="sub_section"> <SectionTitle> 3.1 A Naive Approach m Finite State Tagging </SectionTitle> <Paragraph position="0"> It is useful to note that a (W, T) pair can be represented as a tagged sentence wl/tl, w2/t2, ...w,/tn where T = tl, t2...tn is the sequence of tags denoting the semantic type for each word in the sentence. For example, the tags could be I, O, P, IND, for the 3 slots and the indicator, and N for other (noise) words, as in</Paragraph> <Paragraph position="2"> As a straw man we consider using a standard bigram tagging model to tag test-set sentences. (Church 88) used this to recover part-of-speech tags, a related approach described in (Bikel et al. 97) gives a useful decomposition of P(T, W) into two terms:</Paragraph> <Paragraph position="4"> above example m = 7 and the sequence is {N, I, N, lIND, N, P, N}. Wi is the string of words under label Li, for example W1 = {Last, week}, W2 = {Hensley, West}.</Paragraph> <Paragraph position="5"> The two terms are then simplified, using bigram Markov independence assumptions, to be</Paragraph> <Paragraph position="7"> This finite state approach has been highly effective for part of speech tagging (Church 88) and name finding (Bikel et al. 97). However, the next section considers the characteristics of the task in more detail, and argues that a finite-state tagger is a poor model for the task.</Paragraph> </Section> <Section position="2" start_page="40" end_page="41" type="sub_section"> <SectionTitle> 3.2 More about the task </SectionTitle> <Paragraph position="0"> In developing an intuition for the task, and motivating the choices made in modeling it, it is useful to consider the types of information that may be useful to a system.</Paragraph> <Paragraph position="1"> Consider the following 5 points: I. There are 7 possible templates corresponding to the 7 non-empty subsets of {I,O,P}. The distribution over these alternatives is by no means uniform - see table 2 for the distribution.</Paragraph> <Paragraph position="2"> 2. The different slots tend to contain quite different lexical items or strings - for example, the IN and OUT slots are most likely to contain a proper name or a personal pronoun, whereas the POST slot contains strings such as &quot;president&quot;, &quot;chairman&quot; etc. 3. The choice of indicator word depends greatly on the choice of template. For example &quot;name&quot; is very likely to be used to express an event involving a {I, P} template; &quot;succeed&quot; is very likely to express an {I, O, P} or {I, O} template. See the final column of table 2 for more examples.</Paragraph> <Paragraph position="3"> 4. The relative order of the slots and indicator in the text varies considerably depending on the choice of indicator. For example, given the template {I} and the verb &quot;join&quot; the order is most likely to be { I Indicator } (e.g. IN joined the company); whereas given the verb &quot;hire&quot; the order is usually {Indicator I} (e.g. the company hired IN).</Paragraph> <Paragraph position="4"> 5. In addition to the central indicator, there are often secondary indicators - mainly prepositions - which are strong signals of particular slots. For example, given the verb is &quot;named&quot; or &quot;succeeded&quot;, the post is very likely to be preceded by the preposition &quot;as&quot; (e.g., the company named her as president, he succeeds Jim Smith as president).</Paragraph> <Paragraph position="5"> By considering points 1-5 we can see that the finite-state tagging approach is deficient for the semantic tagging task. The lexical probabilities in equation (4) are probably sufficient to capture the lexical differences between different states (the preference of the IN slot to generate proper names, of the POST slot to generate words like &quot;president&quot; and so on). But the Markov approximation in equation (3) is deficient in many ways: it fails to capture the non-uniform distribution over the 7 possible templates, worse still it is deficient in that it can label more than one substring with the same slot label; it fails to capture the dependence of the slot order on the indicator word, or the dependence between the template and indicator.</Paragraph> </Section> <Section position="3" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 3.3 A Probabilistic Context-Free Grammar </SectionTitle> <Paragraph position="0"> Our proposal is to replace the Markov assumption in (3) with a probabilistic context-free grammar, that is we assume that the label sequence has been generated by the application of r context-free rules LHSj =~ RHSj 1 _< j _< r (LHS stands for left hand side, RHS stands for fight hand side), and that</Paragraph> <Paragraph position="2"> Each LHS is a single non-terminal, and RHS is a string of one or more non-terminals. So for each non-terminal LHS in the grammar there is a distribution over possible RHSs which sums to one. Counts of context-free rules can be extracted from a training set of context free trees,</Paragraph> <Paragraph position="4"> indicator for each template. For example, 45.6% of the templates are {I,P}, and in these 45.6% of the cases &quot;name&quot; is chosen 67% of the time.</Paragraph> <Paragraph position="5"> and used to estimate the parameters P(RHSi I LHSi).</Paragraph> <Paragraph position="6"> Given a test data sentence, the most likely tree (and hence the most likely template) can be recovered efficiently using a variant of the CKY algorithm.</Paragraph> </Section> </Section> <Section position="7" start_page="41" end_page="43" type="metho"> <SectionTitle> 4 The Grammar </SectionTitle> <Paragraph position="0"> This section describes the underlying context-free structure 2 that we assume has generated the labels, and motivates it in terms of the observations in section 3.2.</Paragraph> <Paragraph position="1"> The context-free structure (the tree topology, and the choice of non-terminal labels within the tree), is deterministically derived from the initial labeling of the sentences -- so given a set of labeled sentences, the context-free structures can be recovered and the parameters can be estimated.</Paragraph> <Section position="1" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 4.1 The Leaf Categories </SectionTitle> <Paragraph position="0"> The tagging model as applied in the above example assumed five tags - for the IN, OUT, and POST slots, the indicator, and for noise (other words). In fact, we used rather more categories, which are listed in table 3. These labels can still be deterministically recovered from the labeled sentence though, given the additional information of a mapping from indicator words to their morphological stem (for example, the mapping &quot;joined&quot; ~ JOIN). The example sentence would have the following under-</Paragraph> </Section> <Section position="2" start_page="41" end_page="41" type="sub_section"> <SectionTitle> 4.2 The Context-Free Component- a Brief Sketch </SectionTitle> <Paragraph position="0"> The PCFG model assumes the pre-terminal 3 label sequence {Li, L2...Lm} has been generated by a stochastic process with the following steps: 2The structures in this paper are non-recursive, and could, therefore, be equivalently handled by a hierarchy of finite-state transducers, or even a single equivalent non-deterministic finite-state automaton. However, it is quite possible that extensions to the models could require recursive structures.</Paragraph> <Paragraph position="1"> SBy pre-terminal, we mean a non-terminal that dominates words rather than other non-terminals.</Paragraph> <Paragraph position="2"> 1. Decide whether to have noise words (PREN) before the template TEMR 2. Decide whether to have noise words (POSTN) after the template TEMP.</Paragraph> <Paragraph position="3"> 3. Decide which slots to have (one of the 7 subsets of {I, O, P}).</Paragraph> <Paragraph position="4"> 4. Decide the class of indicator words.</Paragraph> <Paragraph position="5"> 5. Decide the order of the slots and indicator word. 6. For each slot, choose whether to have noise between it and the indicator (NOISE+ or NOISE-).</Paragraph> <Paragraph position="6"> 7. For each slot, choose whether to have a preposition directly preceding or following it.</Paragraph> <Paragraph position="7"> Figure 2 gives an example tree and describes the context-free rules within it. The next section describes the grammar in more detail, showing how these 7 types of decision can be encoded as context-free rules.</Paragraph> </Section> <Section position="3" start_page="41" end_page="43" type="sub_section"> <SectionTitle> 4.3 The Context-Free Component in Detail </SectionTitle> <Paragraph position="0"> This section describes the top-down derivation of a sequence of leaves within a PCFG framework.</Paragraph> <Paragraph position="1"> Choosing noise at the start/end of the sentence This level of the model chooses whether to have noise preceding or following the text which expresses the succession information.</Paragraph> <Paragraph position="2"> TOP -> PREN TEMPI #there is noise at start -> TEMPI #or there isn't TEMPi -> TEMP POSTN #there is noise following -> TEMP #or there isn't The TEMP non-terminal covers the span of the succession information, in the above example &quot;Hensley ... president&quot;. P(PREN TEMPI I TOP) can be interpreted as the probability of having noise at the beginning of the sentence, P(TEMP POSTN \[ TEMPI) is the probability of having post-noise.</Paragraph> <Paragraph position="4"> TEMP first re-writes in one of seven ways, corresponding to the 7 possible templates. The T. non-terminal encodes the slots that will be generated below it, for example T. IO would generate an IN and an OUT slot below it. So P(RHS \[ TEMP) will mirror the distribution in column 2 of table 2.</Paragraph> <Paragraph position="5"> The IN, OUT and POST non-terminals &quot;Noise&quot; words before the template &quot;Noise&quot; words after the template &quot;Noise&quot; between slots and the indicator, which comes before the indicator &quot;Noise&quot; between slots and the indicator, which comes after the indicator lND(class) Leaf dominating the indicator, class can be any one of the morphological stems seen in training data. For example, IND(join) could dominate &quot;join&quot;, &quot;joins&quot;, &quot;joined&quot; or &quot;joining&quot;. l.Prep-(class) Prepositions for the I, O and P slots, for a particular class, and which follow the indicator. For example, O.Prep-(class) P.Prep-(join) would be a preposition for the Post slot, with an indicator in the join class, and would P.Prep-(class) most likely be &quot;as&quot; I.Prep+(class) Prepositions for the 1, O and P slots, for a particular class, and which precede the indicator.</Paragraph> <Paragraph position="7"/> <Paragraph position="9"> The next step is to choose the Class of indicator that is used to express the transaction. Each Class is a set of words with the same morphological stem, for example the JOIN class would include join, joins, joined and joining. P(ClasslSlots ) is implemented in the CFG fragment shown below. The IND non-terminal encodes which slots need to be generated, and the Class used to express the transaction. Each T. rule can re-write in N ways, where N is the number of classes.</Paragraph> <Paragraph position="11"> Having chosen the Slots to be generated, and the Class used to express the event, there are many possible orders in which the slots and class can appear. In the above example (Slots = {I, P}, Class = JOIN) there are 6 permutations ( {JOIN I P}, {I JOIN P}, {I P JOIN } and so on). It is necessary to estimate a distribution over these alternatives. The order is parameterized using a binary branching, context-free fragment: part of this (all rules with</Paragraph> <Paragraph position="13"> The notation is: * IND keeps tracks of which slots still need to be generated. For example IND.IP\[Class\] means that the IN and POST slots need to be generated. * The I2, 02, and P2 non-terminals will eventually generate the IN, OUT and POST leaves. The &quot;2&quot; stands for level 2 - more in the next section on why this is necessary. &quot;+&quot; means the slot appears before the head-word, &quot;-&quot; means it appears after. The Class is propagated to the I2, 02 and P2 nonterminals. Propagation of the Class and direction (+ or -) is important because the identity of any prepositions is conditioned on this information. Each binary rule expresses a choice of which of the remaining slots to generate next, and which direction to generate it in. So IND.IP\[Class\] can re-write in 4 ways: either the IN or POST slot can be generated either to the left or right of the head-word itself.</Paragraph> <Paragraph position="14"> Choosing to generate noise between the slots Noise can appear after any slot preceding the indicator, or before any slot following the indicator. The CFG rules below encode the decision to have noise in a gap or not, for an IN slot generated before or after the indicator. The rules for OUT and POST are similar.</Paragraph> <Paragraph position="16"> Choosing to generate a preposition (or other indicator) linked to a slot Any of the slots can have an adjacent &quot;indicator&quot;, usually a preposition. The rules below encode the binary decision of whether to include an indicator for an IN slot - the OUT and POST cases are similar.</Paragraph> <Paragraph position="18"> Again, for each I1, O1 or P1 non-terminal there are two possible re-writes, one binary, one unary, encoding whether or not to generate a preposition. The I.Prep, O.Prep and P.Prep non-terminals then generate the indicator with a bigram model. The non-terminal encodes whether the slot appears before or after the head-word C+&quot; or 'v'), and the Class of the head-word.</Paragraph> </Section> </Section> <Section position="8" start_page="43" end_page="44" type="metho"> <SectionTitle> 5 Training the Model </SectionTitle> <Paragraph position="0"> There are two steps to training the model: first, recovering the underlying tree structure from the training data labels; second, deriving counts of the CF rule applications and bigram sequences and using these to estimate the parameters of the model.</Paragraph> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 5.1 Deriving the Tree Structures in Training Data </SectionTitle> <Paragraph position="0"> While the tree structure described in section 4.3 may seem complex, it is important to realise that it can be deterministically derived from an annotator's labeling of the Slots and Indicator. This section describes how the structure is derived in a bottom up fashion using the following annotated sentence as example input to the process: null Last week \[I Hensley West \] , 59 years old , \[IND joined \] the company as \[P president \], a surprising development.</Paragraph> <Paragraph position="1"> The 6 stages are as follows: 1. Identify the class of the indicator, and add this information to the IND label. Mark any prepositions adjacent to the slots. Label &quot;noise&quot; words with either PREN, POSTN, NOISE+ or NOISE-. The output from this stage would be: \[PREN Last week \] \[I Hensley West \] \[NOISE+, 59 years old, \] \[IND(JOIN) joined \] \[NOISE- the company \] \[P.Prep-(JOIN) as \] \[P president \] \[POSTN, a surprising development. \] 2. Build level I of the slots, by including attached prepositions, or just building a unary rule</Paragraph> <Paragraph position="3"> 3. Build level 2 of the slots, by attaching NOISE+ or NOISE- leaves to the slots,</Paragraph> <Paragraph position="5"/> </Section> </Section> <Section position="9" start_page="44" end_page="44" type="metho"> <SectionTitle> 4. Build the binary-branching context-free structure </SectionTitle> <Paragraph position="0"> that defines the order of the slots and indicator.</Paragraph> <Section position="1" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 5.2 Context Free Rule Probabilities </SectionTitle> <Paragraph position="0"> Once the training data is processed to have full context-free trees, the grammar can be automatically read from these trees, and event counts can be extracted and used to estimate the parameters of the model. The maximum likelihood estimate for a CF rule LHS -> RHS is</Paragraph> <Paragraph position="2"> where C(z) is the number of times event z has been seen in training data. This estimate can be unreliable, particularly for low values of C(LHS). So we smooth this estimate with a &quot;backed off&quot; estimate Pb</Paragraph> <Paragraph position="4"> where 0 < A < 1. The backed off estimate Pb = C(RHS,L~FSb) is based on a subset of the context and the C(LHSb) estimate is more robust but is less detailed. For example,</Paragraph> <Paragraph position="6"> the slots when choosing the class of indicators. This method borrows heavily from smoothing techniques in language modeling for speech recognition m (Jelinek 90) describes methods for estimating A.</Paragraph> </Section> <Section position="2" start_page="44" end_page="44" type="sub_section"> <SectionTitle> 5.3 Bigram Probabilities </SectionTitle> <Paragraph position="0"> The bigram model is used at the leaves of the tree to generate the words themselves, for example to estimate P(the president I P)- The most obvious way to estimate this is as P(theISTART, P) * P(presidentlthe , P) * P( E N DIpresident, P) with smoothing being implemented by interpolation between P(wlw-x, State) --r P(wlState) ~ ~ where V is the vocabulary size. Unfortunately we do not have space to * go into the full details of the smoothing here (in the final implementation part-of-speech information was also used to smooth the estimates).</Paragraph> </Section> </Section> <Section position="10" start_page="44" end_page="46" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> This section describes experiments on the management successions domain. Before giving the results, we discuss how to deal with sentences that have more than one indicator.</Paragraph> <Section position="1" start_page="44" end_page="45" type="sub_section"> <SectionTitle> 6.1 Dealing with Sentences that have more than one Indicator </SectionTitle> <Paragraph position="0"> Thus far the model has assumed that there is only one indicator per sentence. However, training data frequently has more than one indicator, as in Mr. Smith was named president of the company, succeeding Fred Jones.</Paragraph> <Paragraph position="1"> There are two events in this sentence, one centered around named, the other centered around succeeding. The solution is to transform sentences in both training and test data to give one sentence per indicator, in this case the sentence would be expanded to give two sentences: null Mr. Smith was *named* president of the company, succeeding Fred Jones.</Paragraph> <Paragraph position="2"> Mr. Smith was named president of the company, *succeeding* Fred Jones.</Paragraph> <Paragraph position="3"> The first sentence is for the named event, the second is for succeeding. The indicator is replaced with *indicator* to show that it is under interest -- when decoding test data the model either recognises *named* as a potential indicator, but ignores succeeding, or ignores named and recognises *succeeding*. If the sentence appeared in training data it would be transformed to give two training data trees. We should stress that this process is completely automatic once the indicators have been identified in the text.</Paragraph> </Section> <Section position="2" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 6.2 Results </SectionTitle> <Paragraph position="0"> The model was trained on 563 sentences, and tested on another 356 sentences. (That is, 563/356 sentences after producing one sentence per indicator as described in section 6.1). The sentences were taken from the &quot;Who's news&quot; section of Wall Street Journal, which is almost exclusively about management successions. The training sentences were taken from 219 Who's News articles in the 1996 section, the test sentences were taken from 131 articles in the 1995 section. The sentence level annotation was part of an annotation effort for the full extraction task, which therefore also marked the relevant coreference relationships and the complete output template as in figure 1.</Paragraph> <Paragraph position="1"> The test data sentences always contain an event, and have all indicators marked as *indicator*-- only those indicators that have 1 or more slots attached to them are marked. This is an idealization, in that we avoid problems of false positives, cases where a potential indicator is not used to express an event. See section 6.4 for suggestions about how to extend the model to deal with false p.ositives.</Paragraph> <Paragraph position="2"> The results are shown in table 4. We define precision and recall when comparing to the annotated test set answers (gold standard) as Number of correct slots Precision = Number of slots proposed Number of correct slots Recall = Number of slots in the gold standard In addition we report the standard &quot;F-Measure&quot;, which is a combination of precision and recall</Paragraph> <Paragraph position="4"> The results are quoted for the IN, OUT and POST slots (the IND slot is not scored, as it is marked in test data and would score 100% recall/precision, inflating the scores).</Paragraph> <Paragraph position="5"> The number of &quot;correct&quot; slots varies depending on how partial matches are scored - a partial match is where an output slot does not match a gold standard slot exactly, but does partially overlap. For example, in Bill Smith was elected vice president, human resources.</Paragraph> <Paragraph position="6"> the gold standard might designate the slot as &quot;vice president, human resources&quot;, whereas the program output might just mark &quot;vice president&quot;. We present three: precision/recall scores -- where a partial match scores 0, 0.5 or 1.0, and Number of correct slots = Number of exact matches +Score for a partial match x Number of partial matches</Paragraph> </Section> <Section position="3" start_page="45" end_page="46" type="sub_section"> <SectionTitle> 6.3 Analysis of the results </SectionTitle> <Paragraph position="0"> In this section we look at the errors the system makes in more detail. There are two categories of error: precision errors (incorrect slots); and recall errors (slots the system failed to propose). For these tests we ran experiments on the training data, jack-knifing (i.e. using cross-validation) it into 4 sections, in each case training on three-quarters of the training set and testing on the other quarter. Tables 5 and 6 show the results on this data set.</Paragraph> <Paragraph position="1"> hand into four categories. These four categories were: 1) Semantically Plausible. Here the model has selected a slot-filler that looks good semantically, but is ruled out for other reasons (usually syntactic). For example, null The appointment puts (IN Mr. Zwirn) , 41 years old, in line to succeed the unit's president, Frank R. Bakos, 58, who is (IND retiring) at year end.</Paragraph> <Paragraph position="2"> Here &quot;Mr. Zwirn&quot; is semantically a good filler for &quot;retiring&quot;, but syntactically this is almost impossible. We sub-divided this class into 3 sub-categories: problems with relative clauses, as in the example above; problems with non-relativized subjects, for example &quot;Brandon Sweitzer, 53, succeeds (IN Mr. Wakefield) as president of Guy Carpenter and also (IND becomes) (POST the unit's CEO), succeeding Richard Blum, 56 Y; and problems that fell outside these categories.</Paragraph> <Paragraph position="3"> 2) &quot;Correct&quot;. These slots were not seen in the goldstandard, but were deemed pretty much correct, in that they would not hurt (and might even help) the score of a full system. They fall into two sub-categories- &quot;good alternative&quot;, where the model's output is different from the gold standard but still looks reasonable, either because the sentence has more than one reasonable answer, or the gold standard is simply wrong; &quot;> 1 reference&quot;, where there is more than one reference to the slot filler in the sentence, and the model has chosen a different one from the gold standard. For example, (OUT Mr. Johnson) , 52 , said he resigned (POST his positions as chief executive officer) Here the model marked &quot;Mr. Johnson&quot; as OUT, the annotator marked &quot;he&quot;, and both are in some sense correct. 3) Bad Lexieal Information. In these cases the model selected a slot filler that is clearly bad for lexical reasons, for example Mr. Broeksmit is the (OUT latest) in a string of employees to (IND leave) the firm ...</Paragraph> <Paragraph position="4"> 4) Other. Miscellaneous errors which do not fall into the above three categories.</Paragraph> <Paragraph position="5"> Of the 356 test-set sentences, 330 (92.7%) were processed by the system to give some output -- no output was produced for 26 cases. This accounts for the recall figures in table 4 being lower than the precision figures (for example, with a score of 0 for partials, precision = 80.6%, recall = 74.6%, and 92.7%*80.6% = 74.7%). Of these 26 cases, 24 involved an indicator word that had never been seen in training data. The other 2 cases involved an unusual usage of &quot;succeed&quot;, which had never been seen in training data and was peculiar enough for the system to fail to get an analysis (we set a probability threshold such that the machine gives up if it fails to find an analysis above this probability).</Paragraph> </Section> <Section position="4" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 6.4 Dealing with False Positives </SectionTitle> <Paragraph position="0"> This work has made a simplifying assumption, that test sentences were marked with indicators that had one or more slots. This section considers how this process could be automated.</Paragraph> <Paragraph position="1"> A first step would be to identify in test data morphological variants of words that had been seen as indicators in training data. However this would inevitably lead to false positives -- that is, potential indicators appearing in cases where they don't indicate an event. We could see two potential approaches for filtering out these spurious cases: first, word-sense disambiguation methods similar to those in (Yarowsky 95); second, we could extend the model to have an eighth, empty, template as a possibility the model should then learn how often null templates occur, and what kind of lexical items tend to produce them.</Paragraph> <Paragraph position="2"> We leave this to future work. At least in this dataset (Who's News articles) we believe that the false positive problem will not be severe, as the articles contain information almost exclusively on management successions, and most of the indicators are unambiguous within this sub-domain.</Paragraph> <Paragraph position="3"> The models have also made the assumption that an indicator is used to express each event. This may not be the case in all information extraction tasks, in some there may not be clear indicator words; again, we leave dealing with this limitation to future work.</Paragraph> </Section> </Section> <Section position="11" start_page="46" end_page="46" type="metho"> <SectionTitle> 7 Future Work </SectionTitle> <Paragraph position="0"> We anticipate two directions for future work: first, refining the current model to improve its performance, and second, extending the current model to encompass the complete information extraction task.</Paragraph> <Section position="1" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 7.1 Refining the Model </SectionTitle> <Paragraph position="0"> When deciding on the direction of future work, it is useful to consider the error analysis in table 7. The majority of errors (the &quot;semantically plausible&quot; class) were cases where the model picked a slot that was semantically plausible, but syntactically impossible. It is unlikely that this problem can be solved with the approach described here, even with vastly increased amounts of training data. Our feeling is that a full syntactic parser as a first stage could radically improve performance. An improved approach might be to fully integrate the recovery of syntactic structure and semantic labelings, in a similar way to the approach used in BBN's SIFT system (Miller et al. 98).</Paragraph> </Section> <Section position="2" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 7.2 Extending the Model </SectionTitle> <Paragraph position="0"> As discussed in section 1.1, the standard approach to information extraction involves three stages of processing: sentence level pattern matching, coreference, and template merging. Of these stages, our current work addresses only sentence level pattern matching. However, we believe that the generative statistical framework described in this paper could be extended advantageously to the complete information extraction problem. In extending the framework, we envision that the information extraction task would be performed using an inverted &quot;information production&quot; model.</Paragraph> <Paragraph position="1"> We can think of this model as approximating, to some * degree, the process by which text is produced by an author. Specifically, we assume that each message is produced according to a four stage process: 1) First, the author decides what facts to express. For example, the text in figure 1 can be thought of as expressing two succession events: IN = &quot;Hensley E. West&quot;, OUT = &quot;John Bradley&quot;, POST = &quot;president&quot;, COMPANY =</Paragraph> </Section> </Section> <Section position="12" start_page="46" end_page="46" type="metho"> <SectionTitle> &quot;RESTOR INDUSTRIES Inc.&quot;, and OUT = &quot;Hensley </SectionTitle> <Paragraph position="0"> E. West&quot;, POST = &quot;group vice president&quot;, COMPANY = &quot;DSC Communications Corp.&quot;. This process can be modeled as a prior probability distribution over sets of templates. In this example, the model would give the prior probability of a message containing exactly two succession templates: one containing slots IN, OUT, POST, COMPANY and the other containing slots OUT, POST, COMPANY.</Paragraph> <Paragraph position="1"> 2) After deciding what facts to express, the author must decompose them into one or more component events. For example, the succession event IN = &quot;Hensley E. West&quot;, OUT = &quot;John Bradley&quot;, POST = &quot;president&quot;, COMPANY = &quot;RESTOR INDUSTRIES Inc.&quot; is decomposed into two smaller events: IN = &quot;Hensley E. West&quot;, POST = &quot;president&quot;, COMPANY = &quot;RESTOR INDUS-</Paragraph> </Section> <Section position="13" start_page="46" end_page="47" type="metho"> <SectionTitle> TRIES Inc.&quot; and IN = &quot;Hensley E. West&quot;, OUT = &quot;John </SectionTitle> <Paragraph position="0"> Bradley&quot;. This process can be modeled as a probability distribution over &quot;template splitting operations&quot;, conditioned on the full template being expressed. Template splitting operations are thus the generative analogue of the merging operations used in most information extraction systems.</Paragraph> <Paragraph position="1"> 3) Next, each component event must be expressed as a linguistic pattern. For example, the event IN = &quot;Hensley E. West&quot;, POST = &quot;president&quot;, COMPANY = &quot;RESTOR INDUSTRIES Inc.&quot; is expressed as the linguistic pattern &quot;IN ... was named POST of COMPANY&quot;, and the event IN = &quot;Hensley E. West&quot;, OUT = &quot;John Bradley&quot; is expressed as the linguistic pattern &quot;IN ... fills a vacancy created by the retirement ... of OUT&quot;. This process can be modeled as a probability distribution over linguistic patterns, conditioned on the partial template being expressed. Modeling this distribution is the subject of the main body of this paper.</Paragraph> <Paragraph position="2"> 4) Finally, the entities involved in events must be realized as word strings within patterns. For example, &quot;'RESTOR INDUSTRIES Inc.&quot; is realized as &quot;this telecommunications-product concern&quot;, and &quot;Hensley E.</Paragraph> <Paragraph position="3"> West&quot; is realized as &quot;Mr. West&quot;. This process can be modeled as a probability distribution over &quot;descriptor generating operations&quot;, conditioned on the entity being expressed and other features of the text. For exam- null pie, given that the author intends to express &quot;Hensley E. West&quot;, and given that the full name appears earlier in the text, the model would assign a certain probability to generating the word string &quot;Mr. West&quot;. In this case, the descriptor generating operation would be \[title + last name\].</Paragraph> <Paragraph position="4"> Clearly, there are many details that would need to be resolved before a complete generative model of !information extraction could be implemented. In this paper, we have described a model containing two of the necessary components: a prior model over templates, and a model of linguistic patterns conditioned on those templates. A complete generative model for IE would offer two potentially powerful advantages. First, the model would provide pnncipled probability estimates for selecting the most likely set of templates given an input message:</Paragraph> <Paragraph position="6"> The second potential advantage derives from the generative aspect of the proposed model. While there is an analogue in conventional IE systems for each of stages 2 through 4 described above, there is no conventional analogue to stage 1: the prior model. We can think of this prior model as encoding domain-specific world knowledge about the plausibility of proposed sets of relations.</Paragraph> </Section> class="xml-element"></Paper>