File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0125_metho.xml

Size: 7,564 bytes

Last Modified: 2025-10-06 14:10:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0125">
  <Title>on a context-dependent Mutual Information Independence Model</Title>
  <Section position="4" start_page="154" end_page="211" type="metho">
    <SectionTitle>
2 Mutual Information Independence
</SectionTitle>
    <Paragraph position="0"> Model In this paper, we use a discriminative Markov model, called Mutual Information Independence Model (MIIM) as proposed by Zhou et al (2002), for Chinese word segmentation and named entity recognition. MIIM is derived from a conditional probability model. Given an observation sequence</Paragraph>
    <Paragraph position="2"> = , MIIM finds a stochastic optimal</Paragraph>
    <Paragraph position="4"> )|(log),()|(log We call the above model the Mutual Information Independence Model due to its Pair-wise Mutual Information (PMI) assumption (Zhou et al 2002). The above model consists of two sub-models: the state transition model</Paragraph>
    <Paragraph position="6"> ),( , which can be computed by applying ngram modeling, and the output model</Paragraph>
    <Paragraph position="8"> )|(log , which can be estimated by any probability-based classifier, such as a maximum entropy classifier or a SVM plus sigmoid classifier (Zhou et al 2006). In this competition, the SVM plus sigmoid classifier is used in Chinese word segmentation while a simple backoff approach as described in Zhou et al (2002) is used in named entity recognition.</Paragraph>
    <Paragraph position="9"> Here, a variant of the Viterbi algorithm (Viterbi 1967) in decoding the standard Hidden Markov Model (HMM) (Rabiner 1989) is implemented to find the most likely state sequence by replacing the state transition model and the output model of the standard HMM with the state transition model and the output model of the MIIM, respectively. The above MIIM has been successfully applied in many applications, such as text chunking (Zhou 2004), Chinese word segmentation ( Zhou 2005), English named entity recognition in the newswire domain (Zhou et al 2002) and the biomedical domain (Zhou et al 2004; Zhou et al 2006).</Paragraph>
    <Paragraph position="10"> For Chinese word segmentation and named entity recognition by chunking, a word or a entity name is regarded as a chunk of one or more word atoms and we have:</Paragraph>
    <Paragraph position="12"> w is the thi [?] word atom in the sequence of word atoms</Paragraph>
    <Paragraph position="14"> p measures the word formation power of the word atom</Paragraph>
    <Paragraph position="16"> w occurring as a whole word (round to 10%) o The percentage of</Paragraph>
    <Paragraph position="18"> of other words (round to 10%) o The length of</Paragraph>
    <Paragraph position="20"> o Especially for named entity recognition, the percentages of a word occurring in different entity types (round to 10%).</Paragraph>
    <Paragraph position="22"> s : the states are used to bracket and differentiate various types of words and optional entity types for named entity recognition. In this way, Chinese word segmentation and named entity recognition can be regarded as a bracketing and classification process.</Paragraph>
    <Paragraph position="24"> s is structural and consists of two parts: o Boundary category (B): it includes four values: {O, B, M, E}, where O means that current word atom is a whOle word or entity name and B/M/E means that current word atom is at the Beginning/in the Middle/at the End of a word or entity name. o Unit category (W): It is used to denote the type of the word or entity name.</Paragraph>
    <Paragraph position="25"> Because of the limited number of boundary and unit categories, the current word atom formation pattern</Paragraph>
    <Paragraph position="27"> p described above is added into the state transition model in MIIM. This makes the above MIIM context dependent as follows:</Paragraph>
    <Paragraph position="29"> The third step is post processing, which tries to resolve ambiguous segmentations and false unknown word generation raised in the second step. Due to time limit, this is only done in Chinese word segmentation, i.e. no post processing is done on Chinese named entity recognition.</Paragraph>
    <Paragraph position="30">  A simple pattern-based method is employed to capture context information to correct the segmentation errors generated in the second steps. The pattern is designed as follows: &lt;Ambiguous Entry (AE)&gt;  |&lt;Left Context, Right Context&gt; =&gt; &lt;Proper Segmentation&gt; The ambiguity entry (AE) means ambiguous segmentations or forced-generated unknown words. We use the 1 st and 2 nd words before AE as the left context and the 1 st and 2 nd words after AE as the right context. To reduce sparseness, we also only use the 1 st left and right words as context. This means that there are two patterns generated for the same context. All the patterns are automatically learned from training corpus using the following algorithm.</Paragraph>
    <Paragraph position="31"> LearningPatterns()  // Input: training corpus // Output: patterns BEGIN (1) Training a MIIM model using training corpus (2) Using the MIIM model to segment training corpus (3) Aligning the training corpus with the segmented training corpus (4) Extracting error segmentations (5) Generating disambiguation patterns using the left and right context (6) Removing the conflicting entries if two  patterns have the same left hand side but different right hand side.</Paragraph>
  </Section>
  <Section position="5" start_page="211" end_page="211" type="metho">
    <SectionTitle>
END
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We first develop our system using the PKU data released in the Second SIGHAN Bakeoff last year. Then, we train and evaluate it on the Third SIGHAN Bakeoff corpora without any fine-tuning. We only carry out our evaluation on the closed tracks. It means that we do not use any additional knowledge beyond the training corpus.</Paragraph>
    <Paragraph position="1"> Precision (P), Recall (R), F-measure (F), OOV Recall and IV Recall are adopted to measure the performance of word segmentation. Accuracy (A), Precision (P), Recall (R) and F-measure (F) are adopted to measure the performance of NER.</Paragraph>
    <Paragraph position="2"> Tables 1, 2 and 3 in the next page report the performance of our algorithm on different corpus in the SIGHAN Bakeoff 02 and Bakeoff 03, respectively. For the performance of other systems, please refer to http://sighan.cs.uchicago.edu/bakeoff2005/data/r esults.php.htm for the Chinese bakeoff 2005 and http://sighan.cs.uchicago.edu/bakeoff2006/longst ats.html for the Chinese bakeoff 2006.</Paragraph>
    <Paragraph position="3"> Comparison against other systems shows that our system achieves the state-of-the-art performance on all Chinese word segmentation closed tracks and shows good scalability across different corpora. The small performance gap should be able to overcome by replacing the word unigram model with the more powerful word bigram model. Due to very limited time of less than three days, although our NER system under the unified framework as Chinese word segmentation does not achieve the state-of-the-art, its performance in NER is quite promising and provides a good platform for further improvement. Error analysis reveals that OOV is still an open problem that is far from to resolve. In addition, different corpus defines different segmentation principles. This will stress OOV handling in the extreme. Therefore a system trained on one genre usually performances worse when faced with text from a different register.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML