File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2022_metho.xml

Size: 18,870 bytes

Last Modified: 2025-10-06 14:09:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2022">
  <Title>HMM Based Chunker for Hindi</Title>
  <Section position="3" start_page="0" end_page="126" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Motivation and Problem Statement
</SectionTitle>
      <Paragraph position="0"> A robust chunker or shallow parser has emerged as an important component in a variety of NLP applications. It is employed in information extraction, named entity identi cation, search, and even in machine translation. While chunkers may be built using handcrafted linguistic rules, these tend to be fragile, need a relatively long time to develop because of many special cases, and saturate quickly. The task of chunking is ideally suited for machine learning because of robustness and relatively easy training.</Paragraph>
      <Paragraph position="1"> A chunker or shallow parser identi es simple or non-recursive noun phrases, verb groups and simple adjectival and adverbial phrases in running text. In this work, the shallow parsing task has been broken up into two subtasks: rst, identifying the chunk boundaries and second, labelling the chunks with their syntactic categories.</Paragraph>
      <Paragraph position="2"> The rst sub-problem is to build a chunker that takes a text in which words are tagged with part of speech (POS) tags as its input, and marks the chunk boundaries in its output. Moreover, the chunker is to be built by using machine learning techniques requiring only modest amount of training data. The second sub-problem is to label the chunks with their syntactic categories.</Paragraph>
      <Paragraph position="3"> The presented work aims at building a chunker for Hindi. Hindi is spoken by approximately half a billion people in India. It is a relatively free word order language with simple morphology (albeit a little more complex than that of English).</Paragraph>
      <Paragraph position="4"> At present, no POS taggers or chunkers are available for Hindi.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="126" type="sub_section">
      <SectionTitle>
1.2 Survey of Related Work
</SectionTitle>
      <Paragraph position="0"> Chunking has been studied for English and other languages, though not very extensively. The earliest work on chunking based on machine learning goes to (Church K, 1988) for English. (Ramshaw and Marcus, 1995) used transformation based learning using a large annotated corpus for English. (Skut and Brants, 1998) modi ed Church's approach, and used standard HMM based tagging methods to model the chunking process. (Zhou,et al., 2000) continued using the same methods, and achieved an accuracy of 91.99% precision and 92.25% recall using a contextual lexicon.</Paragraph>
      <Paragraph position="1"> (Kudo and Matsumoto, 2001) use support vec- null tor machines for chunking with 93.48% accuracy for English. (Veenstra and Bosch, 2000) use memory based phrase chunking with accuracy of 91.05% precision and 92.03% recall for English.</Paragraph>
      <Paragraph position="2"> (Osborne, 2000) experimented with various sets of features for the purpose of shallow parsing.</Paragraph>
      <Paragraph position="3"> In this work, we have used HMM based chunking. We report on a number of experiments showing the effect of different encoding methods on accuracy. Different encodings of the input show the effect of including either words only, POS tags only, or a combination thereof, in training. Their effect on transition probabilities is also studied. We do not use any externally supplied lexicon.</Paragraph>
      <Paragraph position="4"> Analogous to (Zhou,et al., 2000), we found that for certain POS categories, a combination of word and the POS category must be used in order to obtain good results. We report on detailed experiments which show the effect of each of these combinations on the accuracy. This experience can also be used to build chunkers for other languages. The overall accuracy reached for Hindi is 92.63% precision with 100% recall for chunk boundaries.</Paragraph>
      <Paragraph position="5"> The rest of the paper is structured as follows.</Paragraph>
      <Paragraph position="6"> Section 2 discusses the problem formulation and reports the results of some initial experiments. In Section 3, we present a different representation of chunks which signi cantly increased the accuracy of chunking. In Section 4, we present a detailed error analysis, based on which changes in chunk tags are carried out. These changes increased the accuracy. Section 5 describes experiments on labelling of chunks using rule-based and statistical methods.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="126" end_page="126" type="metho">
    <SectionTitle>
2 Initial Experiments
</SectionTitle>
    <Paragraph position="0"> Given a sequence of words W n =</Paragraph>
    <Paragraph position="2"> set and the sequence of corresponding part of speech (POS) tags T n = (t1; t2; ; tn); ti 2 T where T is the POS tag set, the aim is to create most probable chunks of the sequence W n. The chunks are marked with chunk tag sequence Cn = (c1; c2; ; cn) where ci stands for the chunk tag corresponding to each word wi, ci 2 C.</Paragraph>
    <Paragraph position="3"> C here is the chunk tag set which may consist of symbols such as STRT and CNT for each word marking it as the start or continuation of a chunk. In our experiment, we combine the corresponding words and POS tags to get a sequence of new tokens V n = (v1; v2; ; vn) where vi = (wi; ti) 2 V. Thus the problem is to nd the sequence Cn given the sequence of tokens V nwhich maximizes the probability</Paragraph>
    <Paragraph position="5"> which is equivalent to maximizing P(V njCn)P(Cn).</Paragraph>
    <Paragraph position="6"> We assume that given the chunk tags, the tokens are statistically independent of each other and that each chunk tag is probabilistically dependent on the previous k chunk tags ((k + 1)-gram model). Using chain-rule, the problem reduces to that of Hidden Markov Model (HMM) given by</Paragraph>
    <Paragraph position="8"> (2) where the probabilities in the rst term are emission probabilities and in the second term are transition probabilities. The optimal sequence of chunk tags can be found using the Viterbi algorithm. For training and testing of HMM we have used the TnT system (Brants, 2000). Since TnT is implemented up to a tri-gram model, we use a second order HMM (k = 2) in our study.</Paragraph>
    <Paragraph position="9"> Before discussing the possible chunk sets and the token sets, we consider an example below.</Paragraph>
  </Section>
  <Section position="5" start_page="126" end_page="127" type="metho">
    <SectionTitle>
STRT CNT STRT CNT
</SectionTitle>
    <Paragraph position="0"> In this example, the chunk tags considered are STRT and CNT where STRT indicates that the new chunk starts at the token which is assigned this tag and CNT indicated that the token which is assigned this tag is inside the chunk. We refer to this as 2-tag scheme. Under second-order HMM, the prediction of chunk tag at ith token is conditional on the only two previous chunk tags.</Paragraph>
    <Paragraph position="1"> Thus in the example, the fact that the chunk terminates at the word pIche (behind) with the POS tag PREP is not captured in tagging the token jangal (forest). Thus, the assumptions that the  tokens given the chunk tags are independent restricts the prediction of subsequent chunk tags. To overcome this limitation in using TnT, we experimented with additional chunk tags.</Paragraph>
    <Paragraph position="2"> We rst considered a 3-tag scheme by including an additional chunk tag STP which indicates end of chunk. It was further extended to a 4-tag scheme by including one more chunk tag STRT STP to mark the chunks which consist of a single word. A summary of the different tag  schemes and the tag description is given below.</Paragraph>
    <Paragraph position="3"> 1. 2-tag Scheme: fSTRT, CNTg 2. 3-tag Scheme: fSTRT, CNT, STPg 3. 4-tag Scheme: fSTRT, CNT, STP,</Paragraph>
  </Section>
  <Section position="6" start_page="127" end_page="128" type="metho">
    <SectionTitle>
STRT STPg
</SectionTitle>
    <Paragraph position="0"> where tags stand for: STRT: A chunk starts at this token CNT: This token lies in the middle of a chunk STP: This token lies at the end of a chunk STRT STP: This token lies in a chunk of its own We illustrate the three tag schemes using part of the earlier example sentence.</Paragraph>
    <Paragraph position="1">  We further discuss the different types of input tokens used in the experiment. Since the tokens are obtained by combining the words and POS tags we considered 4 types of tokens given by  1. Word only 2. POS tag only: Only the part of speech tag of the word was used 3. Word POStag: A combination of the word followed by POS tag 4. POStag Word: A combination of POS tag  followed by word.</Paragraph>
    <Paragraph position="2"> Note that the order of Word and POS tag in the token might be important as the TnT module uses suf x information while carrying out smoothing of transition and emission probabilities for sparse data. An example of the Word POStag type of tokens is given below.</Paragraph>
    <Paragraph position="3"> ((sher ))((hiraN ke pIche))...</Paragraph>
    <Paragraph position="4"> lion deer of behind...</Paragraph>
    <Paragraph position="6"> The annotated data set contains Hindi texts of 200,000 words. These are annotated with POS tags, and chunks are marked and labelled (NP, VG, JJP, RBP, etc). This annotated corpus was prepared at IIIT Hyderabad from funds provided by HP Labs. The POS tags used in the corpus are based on the Penn tag set. Hewever, there are a few additional tags for compound nouns and verbs etc.</Paragraph>
    <Paragraph position="7"> Out of the total annotated data, 50,000 tokens were kept aside as unseen data. A set of 150,000 tokens was used for training the different HMM representations. This set converted into the appropriate format based on the representation being used. 20,000 tokens of the unseen data were used for development testing.</Paragraph>
    <Paragraph position="8">  The initial results using various tag sets and token sets are presented in Table 1. The rst three rows show the raw scores of different tagging schemes. To compare across the different schemes, the output were converted to the reduced chunk tag sets which are denoted by 4!3, 4!2 and 3!2 in the table. This ensures that the measurement metric is the same no matter which tagging scheme is used, thus allowing us to compare across the tagging schemes. The last three rows show the result of using It should be noted that converting from the 4 tag set to 3 or 2 tags results in no loss in information. This is because it is trivial to convert  fromt the 2-tag set to the corresponding 4-tag set and vice-versa. Even though the information content in the 3 different chunk tag representations is the same, using higher tag scheme for training and then later converting back to 2-tags results in a signi cant improvement in the precision of the tagger. For example, in the case where we took 'Word POSTag' as the token, using 4-tag set the original precision was 73.64%. When precision was measured by reducing the tag set to 3 tags, we obtained a precision of 79.56%. Four tags reduced to two gave the highest precision of 85.6%. However, these differences may be interpreted as the result of changing the measurement metric.</Paragraph>
    <Paragraph position="9"> This gure of 85.6% may be compared with a precision of 81.85% obtained when the 2-tag set was used. Recall in all the cases was 100%.</Paragraph>
  </Section>
  <Section position="7" start_page="128" end_page="128" type="metho">
    <SectionTitle>
3 Incorporating POS Context in Output
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="128" end_page="128" type="sub_section">
      <SectionTitle>
Tags
</SectionTitle>
      <Paragraph position="0"> We attempted modi cation of chunk tags using contextual information. The new output tags considered were a combination of POS tags and chunk tags using any one of the chunk tag schemes discussed in the earlier section.</Paragraph>
      <Paragraph position="1"> The new format of chunk tags considered was POS:ChunkTag, which is illustrated for 2-tag scheme in the example below.</Paragraph>
      <Paragraph position="3"> Token: sher_NN hiran_NN ke_PREP...</Paragraph>
      <Paragraph position="4"> 2-tag: NN:STRT NN:STRT PREP:CNT...</Paragraph>
      <Paragraph position="5"> The tokens (V) were left unchanged. Our intention in doing this was to bring in a ner degree of learning. By having part of speech information in the chunk tag, the information about the POS-tag of the previous word gets incorporated in the transition probabilities. In the earlier chunk schemes, this information was lost due to the assumption of independence of tokens given chunk tags. In other words, part of speech information would now in uence both the transition and emission probabilities of the model instead of just the emission probabilities.</Paragraph>
      <Paragraph position="6"> We carried out the experiment with these modi ed tags. Based on the results in Table 1 for various tokens, we restricted our choice of tokens to Word POStags only. Also, while combining POS tags with chunk tags, the 4-tag scheme was used.</Paragraph>
      <Paragraph position="7"> The accuracy with 4-tag scheme was 78.80% and for 4 ! 2 scheme, it turned out to be 88.63%.</Paragraph>
      <Paragraph position="8"> This was a signi cant improvement.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="128" end_page="129" type="metho">
    <SectionTitle>
4 Error Analysis and Further
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="128" end_page="129" type="sub_section">
      <SectionTitle>
Enhancements
</SectionTitle>
      <Paragraph position="0"> We next carried out the error analysis on the results of the last experiment. We looked at which type of words were resulting in the maximum errors, that is, we looked at the frequencies of errors corresponding to the various part of speech.</Paragraph>
      <Paragraph position="1"> These gures are given in Table 2. On doing this analysis we found that a large number of errors were associated with NN (nouns), VFM ( nite verbs) and JJ (adjectives). Most of these errors were coming in possibly because of sparsity of the data. Hence we removed the word information from these types of input tokens and left only the POS tag. This gave us an improved precision of 91.04%. Further experiments were carried out on  the other POS tags. Experiments were done to see what performed better - a combination of word and POS tag or the POS tag alone. It was found that seven groups of words - PRP, QF (quantiers), QW, RB (adverbs), VRB, VAUX (auxillary verbs) and RP (particles) performed better with a combination of word and POS tag as the token.</Paragraph>
      <Paragraph position="2"> All the other words were replaced with their POS tags.</Paragraph>
      <Paragraph position="3"> An analysis of the errors associated with punctuations was also done. It was found that the set of punctuations f ! : ? , ' g was better at marking chunks than other symbols. Therefore, these punctuations were kept in the tokens while the  other symbols were reduced to a common marker (SYM).</Paragraph>
      <Paragraph position="4"> After performing these steps, the chunker was tested on the same testing corpus of 20,000 tokens. The precision achieved was 92.03% with a recall of 100% for the development testing data. Table 3 gives the stepwise summary of results of this experiment. The rst coloumn of the table gives different token sets described above. Error  marks to f! , : ? 'g 84.03 92.03 analysis of this experiment is given in Table 4. On comparing with Table 2, it may be seen that the number of errors associated with almost all the POS types has reduced signi cantly, thereby resulting in the improved precision.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="129" end_page="130" type="metho">
    <SectionTitle>
5 Chunk Labels
</SectionTitle>
    <Paragraph position="0"> Once the chunk boundaries are marked, the next task is to classify the chunk. In our scheme there are 5 types of chunks - NP (noun phrase), VG (verb group), JJP (adjectival phrase) RBP (adverbial phrase) and BLK (others). We tried two methods for deciding chunk labels. One was based on machine learning while the other was based on rules.</Paragraph>
    <Section position="1" start_page="129" end_page="129" type="sub_section">
      <SectionTitle>
5.1 HMM Based Chunk Labelling
</SectionTitle>
      <Paragraph position="0"> In this method, the chunk boundary tags are augmented with the chunk labels while learning. For example, the tags for the last token in a chunk could have additional information in the form of the chunk label.</Paragraph>
      <Paragraph position="1">  was marked with the chunk label. (See example above. ) The best results were obtained with scheme 3, which when reduced to the common metric of 2-tags only gave a precision of 92.15% (for chunk boundaries only) which exceeded the result for chunk boundaries alone (92.03%). The accuracy for scheme 3 with the chunk boundaries and chunk labels together was 90.16%. The corresponding gures for scheme 1 were 91.70% and 90.00%, while for scheme 2 they were 92.02% and 88.05%.</Paragraph>
    </Section>
    <Section position="2" start_page="129" end_page="130" type="sub_section">
      <SectionTitle>
5.2 Rules Based Chunk Labels
</SectionTitle>
      <Paragraph position="0"> Since there are only ve types of chunks, it turns out that the application of rules to nd out the chunk-type is very effective and gives good results. An outline of the algorithm used for the purpose is given below.</Paragraph>
      <Paragraph position="1"> For each chunk, nd the last token ti whose POS does not belong to the set fSYM, RP, CC, PREP, QFg. (Such tags do not help in classifying the chunks.)  If ti is a noun/pronoun, verb, adjective or adverb, then label the chunk as NP, VG, JJP or RBP respectively.</Paragraph>
      <Paragraph position="2"> Otherwise, label the chunk as BLK.</Paragraph>
      <Paragraph position="3"> In our experiments, we found that over 99% of the chunks identi ed were given the correct chunk labels. Thus, the best method for doing chunk boundary identi cation is to train the HMM with both boundary and syntactic label information together (as given in Section 6.1). Now given a test sample, the trained HMM can identify both the chunk boundaries and labels. The chunk labels are then dropped to obain data marked with chunk boundaries only. Now rule based labelling is applied ( with an accuracy of over 99%) yielding a precision of 91.70% (test set) for the composite task.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML