File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1029_metho.xml

Size: 23,854 bytes

Last Modified: 2025-10-06 14:14:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="A97-1029">
  <Title>Nymble: a High-Performance Learning Name-finder</Title>
  <Section position="4" start_page="194" end_page="194" type="metho">
    <SectionTitle>
3. Model
</SectionTitle>
    <Paragraph position="0"> We will present the model twice, first in a conceptual and informal overview, then in a moredetailed, formal description of it as a type of HMM. The model bears resemblance to Scott Miller's novel work in the Air Traffic Information System (ATIS) task, as documented in (Miller et al., 1994).</Paragraph>
  </Section>
  <Section position="5" start_page="194" end_page="196" type="metho">
    <SectionTitle>
PERSON
ORGANIZATION
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="194" end_page="195" type="sub_section">
      <SectionTitle>
3.1 Conceptual Model
</SectionTitle>
      <Paragraph position="0"> Figure 3.1 is a pictorial overview of our model.</Paragraph>
      <Paragraph position="1"> Informally, we have an ergodic HMM with only eight internal states (the name classes, including the NOT-A-NAME class), with two special states, the START- and END-OF-SENTENCE states. Within each of the name-class states, we use a statistical bigram language model, with the usual one-word-per-state emission. This means that the number of states in each of the name-class states is equal to the vocabulary size, IVI.</Paragraph>
      <Paragraph position="2"> The generation of words and name-classes proceeds in three steps:  1. Select a name-class NC, conditioning on the previous name-class and the previous word.</Paragraph>
      <Paragraph position="3"> 2. Generate the first word inside that name-class, conditioning on the current and previous nameclasses. null 3. Generate all subsequent words inside the current  name-class, where each subsequent word is conditioned on its immediate predecessor.</Paragraph>
      <Paragraph position="4"> These three steps are repeated until the entire observed word sequence is generated. Using the Viterbi algorithm, we efficiently search the entire space of all possible name-class assignments, maximizing the numerator of Equation 2.2, Pr(W, NC).</Paragraph>
      <Paragraph position="5"> Informally, the construction of the model in this manner indicates that we view each type of &amp;quot;name&amp;quot; to be its own language, with separate bigram probabilities for generating its words. While the number of word-states within each name-class is equal  to IVI, this &amp;quot;interior&amp;quot; bigram language model is ergodic, i.e., there is a probability associated with every one of the IVI 2 transitions. As a parameterized, trained model, if such a transition were never observed, the model &amp;quot;backs off&amp;quot; to a less-powerful model, as described below, in SS3.3.3 on p. 4.</Paragraph>
    </Section>
    <Section position="2" start_page="195" end_page="195" type="sub_section">
      <SectionTitle>
3.2 Words and Word-Features
</SectionTitle>
      <Paragraph position="0"> Throughout most of the model, we consider words to be ordered pairs (or two-element vectors), composed of word and word-feature, denoted (w, f).</Paragraph>
      <Paragraph position="1"> The word feature is a simple, deterministic computation performed on each word as it is ~ to or feature computation is an extremely small part of the implementation, at roughly ten lines of code. Also, most of the word features are used to distinguish types of numbers, which are language-independent. 2 The rationale for having such features is clear: in Roman languages, capitalization gives good evidence of names. 3</Paragraph>
    </Section>
    <Section position="3" start_page="195" end_page="196" type="sub_section">
      <SectionTitle>
3.3 Formal Model
</SectionTitle>
      <Paragraph position="0"> This section describes the model formally, discussing the transition probabilities to the wordstates, which &amp;quot;generate&amp;quot; the words of each name-class.  Punctuation marks, all other words Table 3.1 Word features, examples and intuition behind them looked up in the vocabulary. It produces one of the fourteen values in Table 3.1.</Paragraph>
      <Paragraph position="1"> These values are computed in the order listed, so that in the case of non-disjoint feature-classes, such as containsDigitAndAlpha and containsDigitAndDash, the former will take precedence. The first eight features arise from the need to distinguish and annotate monetary amounts, percentages, times and dates. The rest of the features distinguish types of capitalization and all other words (such as punctuation marks, which are separate tokens). In particular, the firstWord feature arises from the fact that if a word is capitalized and is the first word of the sentence, we have no good information as to why it is capitalized (but note that allCaps and capPeriod are computed before fir s tWord, and therefore take precedence).</Paragraph>
      <Paragraph position="2"> The word feature is the one part of this model which is language-dependent. Fortunately, the word  have a most accurate, most powerful model, which will &amp;quot;back off&amp;quot; to a less-powerful model when there is insufficient training, and ultimately back-off to unigram probabilities.</Paragraph>
      <Paragraph position="3"> In order to generate the first word, we must make a transition from one name-class to another, as well as calculate the likelihood of that word. Our intuition was that a word preceding the start of a name-class (such as &amp;quot;Mr.&amp;quot;, &amp;quot;President&amp;quot; or other titles preceding the PERSON name-class) and the word following a name-class would be strong indicators of the subsequent and preceding name-classes, respectively.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="196" end_page="196" type="metho">
    <SectionTitle>
2 Non-english languages tend to use the comma and period in the reverse way in which English does, i.e., the comma is a decimal
</SectionTitle>
    <Paragraph position="0"> point and the period separates groups of three digits in large numbers. However, the re-ordering of the precedence of the two relevant word-features had little effect when decoding Spanish, so they were left as is.</Paragraph>
  </Section>
  <Section position="7" start_page="196" end_page="196" type="metho">
    <SectionTitle>
3 Although Spanish has many lower-case words in organization names. See SS4.1 on p. 6 for more details.
</SectionTitle>
    <Paragraph position="0"> Accordingly, the probabilitiy for generating the first word of a name-class is factored into two parts:</Paragraph>
    <Paragraph position="2"> The top level model for generating all but the first word in a name-class is Pr((w,f) I (w,f)_,, NO). (3.2) There is also a magical &amp;quot;+end+&amp;quot; word, so that the probability may be computed for any current word to be the final word of its name-class, i.e., pr((+end+,other) l(w,f):,~,NC ). (3.3) As one might imagine, it would be useless to have the first factor in Equation 3.1 be conditioned off of the +end+ word, so the probability is conditioned on the previous real word of the previous name-class, i.e., we compute</Paragraph>
    <Paragraph position="4"> Note that the above probability is not conditioned on the word-feature of w_ l, the intuition of which is that in the cases where the previous word would help the model predict the next name-class, the word feature---capitalization in particular--is not important: &amp;quot;Mr.&amp;quot; is a good indicator of the next word beginning the PERSON name-class, regardless of capitalization, especially since it is almost never seen as &amp;quot;mr.&amp;quot;.</Paragraph>
    <Section position="1" start_page="196" end_page="196" type="sub_section">
      <SectionTitle>
3.3.2 Calculation of Probabilities
</SectionTitle>
      <Paragraph position="0"> The calculation of the above probabilities is straightforward, using events/sample-size:</Paragraph>
      <Paragraph position="2"> where c0 represents the number of times the events occurred in the training data (the count).</Paragraph>
      <Paragraph position="3">  Ideally, we would have sufficient training (or at least one observation of!) every event whose conditional probability we wish to calculate. Also, ideally, we would have sufficient samples of that upon which each conditional probability is conditioned, e.g., for Pr(NC I NC_,, w_,), we would like to have seen sufficient numbers of NC_I, w_~. Unfortunately, there is rarely enough training data to compute accurate probabilities when &amp;quot;decoding&amp;quot; on new data.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="196" end_page="197" type="metho">
    <SectionTitle>
3.3.3.1 Unknown Words
</SectionTitle>
    <Paragraph position="0"> The vocabulary of the system is built as it trains.</Paragraph>
    <Paragraph position="1"> Necessarily, then, the system knows about all words for which it stores bigram counts in order to compute the probabilities in Equations 3.1 - 3.3. The question arises how the system should deal with unknown words, since there are three ways in which they can appear in a bigram: as the current word, as the previous word or as both. A good answer is to train a separate, unknown word-model off of held-out data, to gather statistics of unknown words occurring in the midst of known words.</Paragraph>
    <Paragraph position="2"> Typically, one holds out 10-20% of one's training for smoothing or unknown word-training.</Paragraph>
    <Paragraph position="3"> In order to overcome the limitations of a small amount of training data--particularly in Spanish--we hold out 50% of our data to train the unknown word-model (the vocabulary is built up on the first 50%), save these counts in training data file, then hold out the other 50% and concatentate these bigram counts with the first unknown word-training file. This way, we can gather likelihoods of an unknown word appearing in the bigram using all available training data. This approach is perfectly valid, as we am trying to estimate that which we have not legitimately seen in training. When decoding, if either word of the bigram is unknown, the model used to estimate the probabilities of Equations 3.1-3 is the unknown word model, otherwise it is the model from the normal training. The unknown word-model can be viewed as a first level of back-off, therefore, since it is used as a backup model when an unknown word is encountered, and is necessarily not as accurate as the bigram model formed from the actual training.</Paragraph>
    <Section position="1" start_page="196" end_page="197" type="sub_section">
      <SectionTitle>
3.3.3.2 Further Back-off Models and
Smoothing
</SectionTitle>
      <Paragraph position="0"> Whether a bigram contains an unknown word or not, it is possible that either model may not have seen this bigram, in which case the model backs off to a less-powerful, less-descriptive model. Table 3.2 shows a graphic illustration of the back-off scheme:  The weight for each back-off model is computed onthe-fly, using the following formula: If computing Pr(XIY), assign weight of ;~ to the direct computation (using one of the formulae of SS3.3.2) and a weight of (1 - ;t.) to the back-off model, where ( i ,</Paragraph>
      <Paragraph position="2"> where &amp;quot;old c(Y)&amp;quot; is the sample size of the model from which we are backing off. This is a rather simple method of smoothing, which tends to work well when there are only three or four levels of back-off: This method also overcomes the problem when a back-off model has roughly the same amount of training as the current model, via the first factor of Equation 3.8, which essentially ignores the back-off model and puts all the weight on the primary model, in such an equi-trained situation.</Paragraph>
      <Paragraph position="3"> As an example---disregarding the first factor--if we saw the bigram &amp;quot;come hither&amp;quot; once in training and we saw &amp;quot;come here&amp;quot; three times, and nowhere else did we see the word &amp;quot;come&amp;quot; in the NOT-A-NAME class, when computing Pr(&amp;quot;hither&amp;quot; I &amp;quot;come&amp;quot;, NOT-A-NAME), we would back off to the unigram probability Pr(&amp;quot;hither&amp;quot; I NOT-A-NAME) with a weight of 1/2, since the number of unique outcomes for the word-state for &amp;quot;come&amp;quot; would be two, and the total number of times &amp;quot;come&amp;quot; had been the preceding word in a bigram would be four (a</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="197" end_page="198" type="metho">
    <SectionTitle>
4 Any more levels of back-off might require a more sophisticated
</SectionTitle>
    <Paragraph position="0"> smoothing technique, such as deleted interpolation. No matter what smoothing technique is used, one must remember that smoothing is the art of estimating the probability of that which is unknown (i.e., not seen in training).</Paragraph>
    <Section position="1" start_page="197" end_page="197" type="sub_section">
      <SectionTitle>
3.4 Comparison with a traditional HMM
</SectionTitle>
      <Paragraph position="0"> Unlike a traditional HMM, the probability of generating a particular word is 1 for each word-state inside each of the name-class states. An alternative--and more traditional--model would have a small number of states within each name-class, each having, perhaps, some semantic signficance, e.g., three states in the PERSON name-class, representing a first, middle and last name, where each of these three states would have some probability associated with emitting any word from the vocabulary. We chose to use a bigram language model because, while less semantically appealing, such n-gram language models work remarkably well in practice. Also, as a first research attempt, an n-gram model captures the most general significance of the words in each name-class, without presupposing any specifics of the structure of names, ~i la the PERSON name-class example, above.</Paragraph>
      <Paragraph position="1"> More important, either approach is mathematically valid, as long as all transitions out of a given state sum to one.</Paragraph>
    </Section>
    <Section position="2" start_page="197" end_page="198" type="sub_section">
      <SectionTitle>
3.5 Decoding
</SectionTitle>
      <Paragraph position="0"> All of this modeling would be for naught were it not for the existence of an efficient algorithm for finding the optimal state sequence, thereby &amp;quot;decoding&amp;quot; the original sequence of name-classes. The number of possible state sequences for N states in an ergodic model for a sentence of m words is N m, but, using dynamic programming and an appropriate merging of multiple theories when they converge on a particular state--the Viterbi decoding algorithm--a sentence can be &amp;quot;decoded&amp;quot; in time linear to the number of tokens in the sentence, O(m) (Viterbi, 1967). Since we are  interested in recovering the name-class state sequence, we pursue eight theories at every given step of the algorithm.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="198" end_page="199" type="metho">
    <SectionTitle>
4. Implementation and Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="198" end_page="198" type="sub_section">
      <SectionTitle>
4.1 Development History
</SectionTitle>
      <Paragraph position="0"> Initially, the word-feature was not in the model; instead the system relied on a third-level back-off part-of-speech tag, which in turn was computed by our stochastic part-of-speech tagger. The tags were taken at face value: there were not k-best tags; the system treated the part-of-speech tagger as a &amp;quot;black box&amp;quot;. Although the part-of-speech tagger used capitalization to help it determine proper-noun tags, this feature was only implicit in the model, and then only after two levels of back-off! Also, the capitalization of a word was submerged in the muddiness of part-of-speech tags, which can &amp;quot;smear&amp;quot; the capitalization probability mass over several tags. Because it seemed that capitalization would be a good name-predicting feature, and that it should appear earlier in the model, we eliminated the reliance on part-of-speech altogether, and opted for the more direct, word-feature model described above, in SS3. Originally, we had a very small number of features, indicating whether the word was a number, the first word of a sentence, all uppercase, inital-capitalized or lower-case. We then expanded the feature set to its current state in order to capture more subtleties related mostly to numbers; due to increased performance (although not entirely dramatic) on every test, we kept the enlarged feature set.</Paragraph>
      <Paragraph position="1"> Contrary to our expectations (which were based on our experience with English), Spanish contained many examples of lower-case words in organization and location names. For example, departamento (&amp;quot;Department&amp;quot;) could often start an organization name, and adjectival place-names, such as coreana (&amp;quot;Korean&amp;quot;) could appear in locations and by convention are not capitalized.</Paragraph>
    </Section>
    <Section position="2" start_page="198" end_page="198" type="sub_section">
      <SectionTitle>
4.2 Current Implementation
</SectionTitle>
      <Paragraph position="0"> The entire system is implemented in C++, atop a &amp;quot;home-brewed&amp;quot;, general-purpose class library, providing a rapid code-compile-train-test cycle. In fact, many NLP systems suffer from a lack of software and computer-science engineering effort: run-time efficiency is key to performing numerous experiments, which, in turn, is key to improving performance. A system may have excellent performance on a given task, but if it takes long to compile and/or run on test data, the rate of improvement of that system will be miniscule compared to that which can run very efficiently. On a Spare20 or SGI Indy with an appropritae amount of RAM, Nymble can compile in 10 minutes, train in 5 minutes and run at 6MB/hr. There were days in which we had as much as a 15% reduction in error rate, to borrow the performance measure used by the speech community, where error rate = 100% - Fmeasure. (See SS4.3 for the definition of F-measure.)</Paragraph>
    </Section>
    <Section position="3" start_page="198" end_page="199" type="sub_section">
      <SectionTitle>
4.3 Results of evaluation
</SectionTitle>
      <Paragraph position="0"> In this section we report the results of evaluating the final version of the learning software. We report the results for English and for Spanish and then the results of a set of experiments to determine the impact of the training set size on the algorithm's performance in both English and Spanish.</Paragraph>
      <Paragraph position="1"> For each language, we have a held-out development test set and a held-out, blind test set.</Paragraph>
      <Paragraph position="2"> We only report results on the blind test set for each respective language.</Paragraph>
      <Paragraph position="3">  The scoring program measures both precision and recall, terms borrowed from the information-retrieval community, where</Paragraph>
      <Paragraph position="5"> number correct in key Put informally, recall measures the number of &amp;quot;hits&amp;quot; vs. the number of possible correct answers as specified in the key file, whereas precision measures how many answers were correct ones compared to the number of answers delivered. These two measures of performance combine to form one measure of performance, the F-measure, which is computed by the weighted harmonic mean of precision and recall:</Paragraph>
      <Paragraph position="7"> where ff represents the relative weight of recall to precision (and typically has the value 1). To our knowledge, our learned name-finding system has achieved a higher F-measure than any other learned system when compared to state-of-the-art manual (rule-based) systems on similar data.</Paragraph>
      <Paragraph position="8">  Our test set of English data for reporting results is that of the MUC-6 test set, a collection of 30 WSJ documents (we used a different test set during development). Our Spanish test set is that used for MET, comprised of articles from the news agency AFP. Table 4.1 illustrates Nymble's performance as compared to the best reported scores for each category.  With any learning technique one of the important questions is how much training data is required to get acceptable performance. More generally how does performance vary as the training set size is increased or decreased? We ran a sequence of experiments in English and in Spanish to try to answer this question for the final model that was implemented.</Paragraph>
      <Paragraph position="9"> For English, there were 450,000 words of training data. By that we mean that the text of the document itself (including headlines but not including SGML tags) was 450,000 words long. Given this maximum size of training available to us, we successfully divided the training material in half until we were using only one eighth of the original training set size or a training set of 50,000 words for the smallest experiment. To give a sense of the size of 450,000 words, that is roughly half the length of one edition of the Wall Street Journal.</Paragraph>
      <Paragraph position="10"> The results are shown in a histogram in Figure 4.1 below. The positive outcome of the experiment is that half as much training data would have given almost equivalent performance. Had we used only one quarter of the data or approximately 100,000 words, performance would have degraded slightly, only about 1-2 percent. Reducing the training set size to 50,000 words would have had a more significant decrease in the performance of the system; however, the performance is still impressive even with such a small training set.</Paragraph>
      <Paragraph position="11">  Set Sizes on Performance in English. The learning algorithm performs remarkable well, nearly comparable to handcrafted systems with as little as 100,000 words of training data.</Paragraph>
      <Paragraph position="12"> On the other hand, the result also shows that merely annotating more data will not yield dramatic improvement in the performance. With increased training data it would be possible to use even more detailed models that require more data and could achieve significantly improved overall system performance with those more detailed models.</Paragraph>
      <Paragraph position="13"> For Spanish we had only 223,000 words of training data. We also measured the performance of the system with half the training data or slightly more than 100,000 words of text. Figure 4.2 shows the results. There is almost no change in performance by using as little as 100,000 words of training data.</Paragraph>
      <Paragraph position="14"> Therefore the results in both languages were comparable. As little as 100,000 words of training data produces performance nearly comparable to handcrafted systems.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="199" end_page="199" type="metho">
    <SectionTitle>
5. Further Work
</SectionTitle>
    <Paragraph position="0"> While our initial results have been quite favorable, there is still much that can be done potentially to improve performance and completely close the gap between learned and rule-based name-finding systems.</Paragraph>
    <Paragraph position="1"> We would like to incorporate the following into the current model: * lists of organizations, person names and locations * an aliasing algorithm, which dynamically updates the model (where e.g. IBM is an alias of</Paragraph>
    <Section position="1" start_page="199" end_page="199" type="sub_section">
      <SectionTitle>
International Business Machines)
</SectionTitle>
      <Paragraph position="0"> * longer-distance information, to find names not captured by our bigram model</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML