File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0122_intro.xml

Size: 7,249 bytes

Last Modified: 2025-10-06 14:03:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0122">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics On Using Ensemble Methods for Chinese Named Entity Recognition</Title>
  <Section position="4" start_page="0" end_page="143" type="intro">
    <SectionTitle>
2 Method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="142" type="sub_section">
      <SectionTitle>
2.1 Machine Learning Models
</SectionTitle>
      <Paragraph position="0"> In this section, we introduce ME and CRF.</Paragraph>
      <Paragraph position="1"> Maximum Entropy ME[1] is a statistical modeling technique used for estimating the conditional probability of a target label based on given information. The technique computes the probability p(y|x), where y denotes all possible outcomes of the space, and x denotes all possible features of the space. The computation of p(y|x) depends on a set of fea- null tures in x; the features are helpful for making predictions about the outcomes, y.</Paragraph>
      <Paragraph position="2"> Given a set of features and a training set, the ME estimation process produces a model, in which every feature f</Paragraph>
      <Paragraph position="4"> . The ME model can be represented by the following formula:</Paragraph>
      <Paragraph position="6"> The probability is derived by multiplying the weights of the active features (i.e., those f</Paragraph>
      <Paragraph position="8"> A conditional random field (CRF)[5] can be seen as an undirected graph model in which the nodes corresponding to the label sequence y are conditional on the observed sequence x. The goal of CRF is to find the label sequence y that has the maximized probability, given an observation sequence x. The formula for the CRF model can be written as:</Paragraph>
      <Paragraph position="10"> where i means the relative position in the sequence, and y</Paragraph>
      <Paragraph position="12"> denote the label at position i-1 and i respectively. In this paper, we only consider linear chain and first-order Markov assumption CRFs. In NER applications, a feature function f</Paragraph>
      <Paragraph position="14"> is a label (such as Others).</Paragraph>
    </Section>
    <Section position="2" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
2.2 Chinese Named Entity Recognition
</SectionTitle>
      <Paragraph position="0"> In this section, we present the features applied in our CRF and ME models, namely, characters, words, and chuck information.</Paragraph>
      <Paragraph position="1"> Character Features The character features we apply in the CRF model and the ME model are presented in Tables 1 and 2 respectively. The numbers listed in the feature type column indicate the relative position of a character in the sliding window. For example, -1 means the previous character of the target character. Therefore, the characters in those positions are applied in the model. The numbers in parentheses mean that the feature includes a combination of the characters in those positions. The unigrams in Tables 1 and 2 indicate that the listed features only consider to their own labels, whereas the bigram model considers the combination of the current label and the previous label. Since ME does not consider multiple states in a single feature, there are only unigrams in Table 2. In addition, as ME can handle more features than CRF, we apply extra features in the ME model Table 1 Character features for CRF  Because of the limitations of the closed task, we use the NER corpus to train the segmentors based on the CRF model. To simulate noisy word information in the test corpus, we use a ten-fold method for training segmentors to tag the training corpus. The word features we apply in our NER systems are presented in Tables 3 and 4.</Paragraph>
      <Paragraph position="2"> In addition to the word itself, chuck information, i.e., the relative position of a character in a word, is also valuable information. Hence, we also add chuck information to our models. As the diversity of Chinese words is greater than that of Chinese characters, the number of features that can be used in CRF is much lower than the number that can be used in ME.</Paragraph>
      <Paragraph position="3"> Table 3 Word features for CRF</Paragraph>
    </Section>
    <Section position="3" start_page="142" end_page="143" type="sub_section">
      <SectionTitle>
2.3 Ensemble Methods
Majority vote
</SectionTitle>
      <Paragraph position="0"> We can not put all the features into the CRF model because of its limited resources. Therefore, we train several CRF classifiers with different feature sets so that we can use as many features  as possible. Then, we use the following simple, equally weighted linear equation, called majority vote, to combine the results of the CRF classifiers. null</Paragraph>
      <Paragraph position="2"> the decision of the result of the i th CRF model is y, otherwise it is zero. The highest score of y is chosen as the label of x. The results are incorporated into the Viterbi algorithm to search for the path with the maximum scores.</Paragraph>
      <Paragraph position="3"> In this paper, the first step in the majority vote experiment is to train three CRF classifiers with different feature sets. Then, in the second step, we use the results obtained in the first step to generate the voting scores for the Viterbi algorithm. null Memory Based learner The memory-based learning method memorizes all examples in a training corpus. If a word is unknown, the memory-based classifier uses the k-nearest neighbors to find the most similar example as the answer. Instead of using the complete algorithm of the memory-based learner, we do not handle unseen data. In our memory- based combination method, the learner remembers all named entities from the results of the various classifiers and then tags the characters that were originally tagged as &amp;quot;Other&amp;quot;. For example, if a character x is tagged by one classifier as &amp;quot;0&amp;quot; (&amp;quot;Others&amp;quot; tag) and if the memory-based classifier learns from another classifier that this character is tagged as PER, then x will be tagged as &amp;quot;B-PER&amp;quot; by the memory-based classifier. The obvious drawback of this method is that the precision rate might decrease as the recall rate increases. Therefore, we set the following three rules to filter out samples that are likely to have a high error rate.</Paragraph>
      <Paragraph position="4">  1. Named entities can not be tagged as different named entity tags by different classifiers. 2. We set an absolute frequency threshold to filter out examples that occur less than the threshold.</Paragraph>
      <Paragraph position="5"> 3. We set a relative frequency threshold to  filter out examples that occur less than the threshold. For example, if a word x appears 10 times in the corpus, then half of the instances of x have to be tagged as named entities; otherwise, x will be filtered out of the memory classifier.</Paragraph>
      <Paragraph position="6"> In our experiment, we used the memory-based learner to memorize the named entities from the tagging results of an ME classifier and a CRF classifier, and then tagged the tagging results of the CRF classifier.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML