File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0737_metho.xml
Size: 7,225 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0737"> <Title>Hybrid Text Chunking</Title> <Section position="3" start_page="0" end_page="163" type="metho"> <SectionTitle> 2 HMM-based Chunk Tagger with Context-dependent Lexicon </SectionTitle> <Paragraph position="0"> Given a token sequence G~ = glg2&quot;&quot;gn , the goal is to find an optimal tag sequence</Paragraph> <Paragraph position="2"> The second item in the above equation is the mutual information between the tag sequence T~ and the given token sequence G~. By assuming that the mutual information between G~ and T~ is equal to the summation off mutual information between G~ and the individual tag</Paragraph> <Paragraph position="4"> The first item of above equation can be solved by chain rules. Normally, each tag is assumed to be probabilistic dependent on the N-1 previous tags. Here, backoff bigram(N=2) model is used. The second item is the summation of log probabilities of all the tags. Both the first item and second item constitute the language model component while the third item constitutes the lexicon component. Ideally the third item can be estimated by the forward-backward algorithm(Rabiner 1989) recursively for the firstorder(Rabiner 1989) or second-order HMMs.</Paragraph> <Paragraph position="5"> However, several approximations on it will be attempted later in this paper instead. The stochastic optimal tag sequence can be found by maximizing the above equation over all the possible tag sequences using the Viterbi algorithm. null The main difference between our tagger and the standard taggers lies in our tagger has a context-dependent lexicon while others use a context-independent lexicon.</Paragraph> <Paragraph position="6"> For chunk tagger, we have gl = piwi where</Paragraph> <Paragraph position="8"> sequence. Here, we use structural tags to representing chunking(bracketing and labeling) structure. The basic idea of representing the structural tags is similar to Skut and Brants(1998) and the structural tag consists of three parts: 1) Structural relation. The basic idea is simple: structures of limited depth are encoded using a finite number of flags. Given a sequence of input tokens(here, the word and POS pairs), we consider the structural relation between the previous input token and the current one. For the recognition of chunks, it is sufficient to distinguish the following four different structural relations which uniquely identify the sub-structures of depth l(Skut and Brants used seven different structural relations to identify the sub-structures of depth 2).</Paragraph> <Paragraph position="9"> * 00: the current input token and the previous one have the same parent * 90: one ancestor of the current input token and the previous input token have the same parent * 09: the current input token and one ancestor of the previous input token have the same parent * 99 one ancestor of the current input token and one ancestor of the previous input token have the same parent Compared with the B-Chunk and I-Chunk used in Ramshaw and Marcus(1995)~, structural relations 99 and 90 correspond to B-Chunk which represents the first word of the chunk, and structural relations 00 and 09 correspond to I-Chunk which represents each other in the chunk while 90 also means the beginning of the sentence and 09 means the end of the sentence. 2)Phrase category. This is used to identify the phrase categories of input tokens.</Paragraph> <Paragraph position="10"> 3)Part-of-speech. Because of the limited number of structural relations and phrase categories, the POS is added into the structural tag to represent more accurate models.</Paragraph> <Paragraph position="11"> Principally, the current chunk is dependent on all the context words and their POSs. However, in order to decrease memory requirement and computational complexity, our base-line HMM-based chunk tagger only considers previous POS, current POS and their word tokens whose POSs are of certain kinds, such as preposition and determiner etc. The overall precision, recall and F~=i rates of our baseline tagger on the test data of the shared task are 89.58%, 89.56% and 89.57%.</Paragraph> </Section> <Section position="4" start_page="163" end_page="163" type="metho"> <SectionTitle> 3 Error-driven Learning </SectionTitle> <Paragraph position="0"> After analysing the chunking results, we find many errors are caused by a limited number of words. In order to overcome such errors, we include such words in the chunk dependence context by using error-driven learning. First, the above HMM-based chunk tagger is used to chunk the training data. Secondly, the chunk tags determined by the chunk tagger are compared with the given chunk tags in the training data. For each word, its chunking error number is summed. Finally, those words whose chunking error numbers are equal to or above a given threshold(i.e. 3) are kept. The HMM-based chunk tagger is re-trained with those words considered in the chunk dependence context.</Paragraph> <Paragraph position="1"> The overall precision, recall and FZ=i rates of our error-driven HMM-based chunk tagger on the test data of the shared task are 91.53%, 92.02% and 91.77</Paragraph> </Section> <Section position="5" start_page="163" end_page="164" type="metho"> <SectionTitle> 4 Memory based Learning </SectionTitle> <Paragraph position="0"> Memory-based learning has been widely used in NLP tasks in the last decade. Principally, it falls into two paradigms. First paradigm represents examples as sets of features and carries out induction by finding the most similar cases. Such works include Daelemans et a1.(1996) for POS tagging and Cardie(1993) for syntactic and semantic tagging. Second paradigm makes use of raw sequential data and generalises by reconstructing test examples from different pieces of the training data. Such works include Bod(1992) for parsing, Argamon et a1.(1998) for shallow natural language patterns and Daelemans et a1.(1999) for shallow parsing.</Paragraph> <Paragraph position="1"> The memory-based method presented here follows the second paradigm and makes use of raw sequential data. Here, generalization is performed online at recognition time by comparing the new pattern to the ones in the training corpus. null Given one of the N most probable chunk sequences extracted by the error-driven HMM-based chunk tagger, we can extract a set of chunk patterns, each of them with the format: XP 1 n n+l r~+l = poroPlrn Pn+l, where is the structural relation between Pi and Pi+l.</Paragraph> <Paragraph position="2"> As an example, from the bracketed and la- null For every chunk pattern, we estimate its probability by using memory-based learning. If the chunk pattern exists in the training corpus, its probability is computed by the probability of such pattern among all the chunk patterns. Otherwise, its probability is estimated by the multiply of its overlapped sub-patterns. Then the probability of each of the N most probable chunk sequences is adjusted by multiplying the probabilities of its extracted chunk patterns. Table 1 shows the performance of error-driven HMM-based chunk tagger with memory-based learning.</Paragraph> </Section> class="xml-element"></Paper>