File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0420_metho.xml

Size: 9,193 bytes

Last Modified: 2025-10-06 14:08:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0420">
  <Title>Maximum Entropy Models for Named Entity Recognition</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Maximum Entropy Models
</SectionTitle>
    <Paragraph position="0"> For our approach, we directly factorize the posterior probability and determine the corresponding NE tag for each word of an input sequence. We assume that the decisions only depend on a limited window of</Paragraph>
    <Paragraph position="2"/>
    <Paragraph position="4"> The architecture of the ME approach is summarized in Figure 1.</Paragraph>
    <Paragraph position="5"> As for the CoNLL-2003 shared task, the data sets often provide additional information like part-of-speech (POS) tags. In order to take advantage of these knowledge sources, our system is able to process several input sequences at the same time.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Feature Functions
</SectionTitle>
      <Paragraph position="0"> We have implemented a set of binary valued feature functions for our system: Lexical features: The words a4</Paragraph>
      <Paragraph position="2"> are compared to a vocabulary. Words which are seen less than twice in the training data are mapped onto an 'unknown word'. Formally, the feature  Word features: Word characteristics are covered by the word features, which test for: - Capitalization: These features will fire if a4 a15 is capitalized, has an internal capital letter, or is fully capitalized. null - Digits and numbers: ASCII digit strings and number expressions activate these features.</Paragraph>
      <Paragraph position="3"> - Pre- and suffixes: If the prefix (suffix) of a4a56a15 equals  a given prefix (suffix), these features will fire. Transition features: Transition features model the dependence on the two predecessor tags:</Paragraph>
      <Paragraph position="5"> Prior features: The single named entity priors are incorporated by prior features. They just fire for the cur-</Paragraph>
      <Paragraph position="7"> Compound features: Using the feature functions defined so far, we can only specify features that refer to a single word or tag. To enable also word phrases and word/tag combinations, we introduce the following compound features:</Paragraph>
      <Paragraph position="9"> Respectively, the dictionary features fire if an entry of a context list appears beside or around the current word position a4 a15 .</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Feature Selection
</SectionTitle>
      <Paragraph position="0"> Feature selection plays a crucial role in the ME framework. In our system, we use simple count-based feature reduction. Given a threshold a105 , we only include those features that have been observed on the training data at least a105 times. Although this method does not guarantee to obtain a minimal set of features, it turned out to perform well in practice.</Paragraph>
      <Paragraph position="1"> Experiments were carried out with different thresholds.</Paragraph>
      <Paragraph position="2"> It turned out that for the NER task, a threshold of a74 for the English data and a84 for the German corpus achieved the best results for all features, except for the prefix and suffix features, for which a threshold of a106 (a74a108a107 resp.) yielded best results.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Training
</SectionTitle>
      <Paragraph position="0"> For training purposes, we consider the set of manually annotated and segmented training sentences to form a single long sentence. As training criterion, we use the maximum class posterior probability criterion:  This corresponds to maximizing the likelihood of the ME model. Since the optimization criterion is convex, there is only a single optimum and no convergence problems occur. To train the model parameters a77 a61a7 we use the Generalized Iterative Scaling (GIS) algorithm (Darroch and Ratcliff, 1972).</Paragraph>
      <Paragraph position="1"> In practice, the training procedure tends to result in an overfitted model. To avoid overfitting, (Chen and Rosenfeld, 1999) have suggested a smoothing method where a Gaussian prior on the parameters is assumed. Instead of maximizing the probability of the training data, we now maximize the probability of the training data times the prior probability of the model parameters:  This method tries to avoid very large lambda values and avoids that features that occur only once for a specific class get value infinity. Note that there is only one parameter a10 for all model parameters a77 a61a7 .</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Search
</SectionTitle>
      <Paragraph position="0"> In the test phase, the search is performed using the so-called maximum approximation, i.e. the most likely sequence of named entities a21a17 a5a7 is chosen among all possible  Therefore, the time-consuming renormalization in Eq. 1 is not needed during search. We run a Viterbi search to find the highest probability sequence (Borthwick et al., 1998).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> Experiments were performed on English and German test sets. The English data was derived from the Reuters corpus1 while the German test sets were extracted from the ECI Multilingual Text corpus. The data sets contain tokens (words and punctuation marks), information about the sentence boundaries, as well as the assigned NE tags.</Paragraph>
    <Paragraph position="1"> Additionally, a POS tag and a syntactic chunk tag were assigned to each token. On the tag level, we distinguish five tags (the four NE tags mentioned above and a filler tag).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Incorporating Lists of Names and
Non-annotated Data
</SectionTitle>
      <Paragraph position="0"> For the English task, extra lists of names were provided, and for both languages, additional non-annotated data was supplied. Hence, the challenge was to find ways of incorporating this information. Our system aims at this challenge via the use of dictionary features.</Paragraph>
      <Paragraph position="1"> While the provided lists could straightforward be integrated, the raw data was processed in three stages: 1. Given the annotated training data, we used all features except the dictionary ones to build a first base-line NE recognizer.</Paragraph>
      <Paragraph position="2"> 2. Applying this recognizer, the non-annotated data was processed and all named entities plus contexts (up to three words beside the classified NE and the two surrounding words) were extracted and stored as additional lists.</Paragraph>
      <Paragraph position="3"> 3. These lists could again be integrated straightforward. It turned out that a threshold of five yielded best results for both the lists of named entities as well as for the context information.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> Table 1 and Table 2 present the results obtained on the development and test sets. For both languages, 1 000 GIS iterations were performed and the Gaussian prior method was applied.</Paragraph>
      <Paragraph position="1">  As can be derived from table 1, our baseline recognizer clearly outperforms the CoNLL-2003 baseline (e.g.  a1 ). To investigate the contribution of the Gaussian prior method, several experiments were carried out for different standard deviation parameters a10 . Figure 2 depicts the obtained F-Measures in comparison to the performance of non-smoothed ME models (a0 a25 a58 a7 a8a9a1 a5 a11a10 a7 ). The gain in performance is obvious. null By incorporating the information extracted from the non-annotated data our system is further improved. On the German data, the results show a performance degradation. The main reason for this is due to the capitalization of German nouns. Therefore, refined lists of proper names are necessary.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Summary
</SectionTitle>
    <Paragraph position="0"> In conclusion, we have presented a system for the task of named entity recognition that uses the maximum entropy framework. We have shown that a baseline system based on an annotated training set can be improved by incorporating additional non-annotated data.</Paragraph>
    <Paragraph position="1"> For future investigations, we have to think about a more sophisticated treatment of the additional information. One promising possibility could be to extend our system as follows: apply the baseline recognizer to annotate the raw data as before, but then use the output to train a new recognizer. The scores of the new system are incorporated as further features and the procedure is iterated until convergence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML