File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-1309_abstr.xml
Size: 8,711 bytes
Last Modified: 2025-10-06 13:41:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1309"> <Title>Error-driven HMM-based Chunk Tagger with Context-dependent Lexicon</Title> <Section position="1" start_page="0" end_page="72" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper proposes a new error-driven HMM-based text chunk tagger with context-dependent lexicon. Compared with standard HMM-based tagger, this tagger uses a new Hidden Markov Modelling approach which incorporates more contextual information into a lexical entry.</Paragraph> <Paragraph position="1"> Moreover, an error-driven learning approach is adopted to decrease the memory requirement by keeping only positive lexical entries and makes it possible to further incorporate more context-dependent lexical entries. Experiments show that this technique achieves overall precision and recall rates of 93.40% and 93.95% for all chunk types, 93.60% and 94.64% for noun phrases, and 94.64% and 94.75% for verb phrases when trained on PENN WSJ TreeBank section 00-19 and tested on section 20-24, while 25-fold validation experiments of PENN WSJ TreeBank show overall precision and recall rates of 96.40% and 96.47% for all chunk types, 96.49% and 96.99% for noun phrases, and 97.13% and 97.36% for verb phrases.</Paragraph> <Paragraph position="2"> Introduction Text chunking is to divide sentences into non-overlapping segments on the basis of fairly superficial analysis. Abney(1991) proposed this as a useful and relatively tractable precursor to full parsing, since it provides a foundation for further levels of analysis, while still allowing more complex attachment decisions to be postponed to a later phase.</Paragraph> <Paragraph position="3"> Text chunking typically relies on fairly simple and efficient processing algorithms.</Paragraph> <Paragraph position="4"> Recently, many researchers have looked at text chunking in two different ways: Some researchers have applied rule-based methods, combining lexical data with finite state or other rule constraints, while others have worked on inducing statistical models either directly from the words and/or from automatically assigned part-of-speech classes. On the statistics-based approaches, Skut and Brants(1998) proposed a HMM-based approach to recognise the syntactic structures of limited length. Buchholz, Veenstra and Daelemans(1999), and Veenstra(1999) explored memory-based learning method to fred labelled chunks. Ratnaparkhi(1998) used maximum entropy to recognise arbitrary chunk as part of a tagging task. On the rule-based approaches, Bourigaut(1992) used some heuristics and a grammar to extract &quot;terminology noun phrases&quot; from French text. Voutilainen(1993) used similar method to detect English noun phrases. Kupiec(1993) applied.</Paragraph> <Paragraph position="5"> finite state transducer in his noun phrases recogniser for both English and French.</Paragraph> <Paragraph position="6"> Ramshaw and Marcus(1995) used transformation-based learning, an error-driven learning technique introduced by Eric Bn11(1993), to locate chunks in the tagged corpus. Grefenstette(1996) applied finite state transducers to fred noun phrases and verb phrases.</Paragraph> <Paragraph position="7"> In this paper, we will focus on statistics-based methods. The structure of this paper is as follows: In section 1, we will briefly describe the new error-driven HMM-based chunk tagger with context-dependent lexicon in principle. In section 2, a baseline system which only includes the current part-of-speech in the lexicon is given. In section 3, several extended systems with different context-dependent lexicons are described. In section 4, an error=driven learning method is used to decrease memory requirement of the lexicon by keeping only positive lexical entries and make it possible to further improve the accuracy by merging different context-dependent lexicons into one after automatic analysis of the chunking errors. Finally, the conclusion is given.</Paragraph> <Paragraph position="8"> The data used for all our experiments is extracted from the PENN&quot; WSJ Treebank (Marcus et al. 1993) by the program provided by Sabine Buchholz from Tilbug University.</Paragraph> <Paragraph position="9"> We use sections 00-19 as the training data and 20-24 as test data. Therefore, the performance is on large scale task instead of small scale task on CoNLL-2000 with the same evaluation program.</Paragraph> <Paragraph position="10"> For evaluation of our results, we use the precision and recall measures. Precision is the percentage of predicted chunks that are actually correct while the recall is the percentage of correct chunks that are actually found. For convenient comparisons of only one value, we also list the F~= I value(Rijsbergen 1979): (/32 + 1). precision, recall , with/3 = 1.</Paragraph> <Paragraph position="12"> The idea of using statistics for chunking goes back to Church(1988), who used corpus frequencies to determine the boundaries of simple non-recursive noun phrases. Skut and Brants(1998) modified Church's approach in a way permitting efficient and reliable recognition of structures of limited depth and encoded the structure in such a way that it can be recognised by a Viterbi tagger. This makes the process run in time linear to the length of the input string.</Paragraph> <Paragraph position="13"> Our approach follows Skut and Brants' way by employing HMM-based tagging method to model the chunking process.</Paragraph> <Paragraph position="14"> Given a token sequence G~ = g~g2 &quot;&quot;g,, the goal is to fred a stochastic optimal tag sequence Tin = tlt2...t n which maximizes log P(T~&quot; I Of ) : e(:q&quot;,G?) log P(Ti n \[ G? ) = log P(Ti n ) + log P(Ti n )&quot; P(G? ) The second item in the above equation is the mutual information between the tag sequence Tin and the given token sequence G~. By assuming that the mutual information between G~ and T1 ~ is equal to the summation of mutual information between G~ and the individual tag</Paragraph> <Paragraph position="16"> The first item of above equation can be solved by using chain rules. Normally, each tag is assumed to be probabilistic dependent on the N-1 previous tags. Here, backoff bigram(N=2) model is used. The second item is the summation of log probabilities of all the tags.</Paragraph> <Paragraph position="17"> Both the first item and second item correspond to the language model component of the tagger while the third item corresponds to the lexicon component of the tagger. Ideally the third item can be estimated by using the forward-backward algorithm(Rabiner 1989) recursively for the first-order(Rabiner 1989) or second-order HMMs(Watson and Chunk 1992). However, several approximations on it will be attempted later in this paper instead. The stochastic optimal tag sequence can be found by maxmizing the above equation over all the possible tag sequences. This is implemented by the Viterbi algorithm.</Paragraph> <Paragraph position="18"> The main difference between our tagger and other standard taggers lies in our tagger has a context-dependent lexicon while others use a context-independent lexicon.</Paragraph> <Paragraph position="19"> For chunk tagger, we haveg 1 = piwi where W~ n = w~w2---w n is the word-sequence and</Paragraph> <Paragraph position="21"> sequence. Here, we use structural tags to representing chunking(bracketing and labelling) structure. The basic idea of representing the structural tags is similar to Skut and Brants(1998) and the structural tag consists of three parts: 1) Structural relation. The basic idea is simple: structures of limited depth are encoded using a finite number of flags. Given a sequence of input tokens(here, the word and part-of-speech pairs), we consider the structural relation between the previous input token and the current one. For the recognition of chunks, it is sufficient to distinguish the following four different structural relations which uniquely identify the sub-structures of depth l(Skut and Brants used seven different structural relations to identify the sub-structures of depth 2).</Paragraph> <Paragraph position="22"> 00 the current input token and the previous one have the same parent 90 one ancestor of the current input token and the previous input token have the same parent 09 the current input token and one ancestor of the previous input token have the same parent 99 one ancestor of the current input token and one ancestor of the previous input token have the same parent For example, in the following chunk tagged sentence(NULL represents the beginning and end of the sentence):</Paragraph> </Section> class="xml-element"></Paper>