File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1038_evalu.xml

Size: 4,930 bytes

Last Modified: 2025-10-06 13:58:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1038">
  <Title>Self-Organizing Markov Models and Their Application to Part-of-Speech Tagging</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Experiments
</SectionTitle>
    <Paragraph position="0"> We performed a series of experiments to compare the performance of self-organizing Markov models with traditional Markov models. Wall Street Journal as contained in Penn Treebank II is used as the reference material. As the experimental task is part-of-speech tagging, all other annotations like syntactic bracketing have been removed from the corpus.</Paragraph>
    <Paragraph position="1"> Every figure (digit) in the corpus has been changed into a special symbol.</Paragraph>
    <Paragraph position="2"> From the whole corpus, every 10'th sentence from the first is selected into the test corpus, and the remaining ones constitute the training corpus. Table 6 shows some basic statistics of the corpora.</Paragraph>
    <Paragraph position="3"> We implemented several tagging models based on equation (3). For the tag language model, we used  Algorithm 1: SDTL(E, t, F) Data : E: set of examples,  t: target feature, F: set of contextual features Result : Statistical Decision Tree predicting t initialize a null node; for each element f in the set F do 1 sort meaningful value set V for f ; if jV j &gt; 1 then 2 measure the contribution of f to t; if f contributes the most then select f as the best feature b;</Paragraph>
    <Paragraph position="5"> if there is b selected then set the current node to an internal node; set b as the test feature of the current node; 3 for each v in jV j for b do make SDTL(Eb=v, t, F fbg) as the subtree for the branch corresponding to</Paragraph>
    <Paragraph position="7"> Equation (8) and (9) represent first- and second-order Markov models respectively. Equation (10) (13) represent self-organizing Markov models at various settings where the classification functions ( ) are intended to be induced from the training corpus.</Paragraph>
    <Paragraph position="8"> For the estimation of the tag-to-word translation model we used the following model:</Paragraph>
    <Paragraph position="10"> Equation (14) uses two different models to estimate the translation model. If the word, wi is a known word, ki is set to 1 so the second model is ignored. ^P means the maximum likelihood probability. P(kijti) is the probability of knownness generated from ti and is estimated by using Good-Turing estimation (Gale and Samson, 1995). If the word, wi is an unknown word, ki is set to 0 and the first term is ignored. ei represents suffix of wi and we used the last two letters for it.</Paragraph>
    <Paragraph position="11"> With the 6 tag language models and the 1 tag-to-word translation model, we construct 6 HMM models, among them 2 are traditional first- and secondhidden Markov models, and 4 are self-organizing hidden Markov models. Additionally, we used T3, a tri-gram-based POS tagger in ICOPOST release 1.8.3 for comparison.</Paragraph>
    <Paragraph position="12"> The overall performances of the resulting models estimated from the test corpus are listed in figure 7. From the leftmost column, it shows the model name, the contextual features, the target features, the performance and the model size of our 6 implementations of Markov models and additionally the performance of T3 is shown.</Paragraph>
    <Paragraph position="13"> Our implementation of the second-order hidden Markov model (HMM-P2) achieved a slightly worse performance than T3, which, we are interpreting, is due to the relatively simple implementation of our unknown word guessing module4.</Paragraph>
    <Paragraph position="14"> While HMM-P2 is a uniformly extended model from HMM-P1, SOHMM-P2 has been selectively extended using the same contextual feature. It is encouraging that the self-organizing model suppress the increase of the model size in half (2,099Kbyte vs 5,630Kbyte) without loss of performance (96.5%).</Paragraph>
    <Paragraph position="15"> In a sense, the results of incorporating word features (SOHMM-P1W1, SOHMM-P2W1 and SOHMM-P2W2) are disappointing. The improvements of performances are very small compared to the increase of the model size. Our interpretation for the results is that because the distribution of words is huge, no matter how many words the models incorporate into context modeling, only a few of them may actually contribute during test phase. We are planning to use more general features like word class, suffix, etc.</Paragraph>
    <Paragraph position="16"> Another positive observation is that a homogeneous context extension (SOHMM-P2) and a heterogeneous context extension (SOHMM-P1W1) yielded significant improvements respectively, and the combination (SOHMM-P2W1) yielded even more improvement. This is a strong point of using decision trees rather than prediction suffix trees.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML