File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1061_intro.xml

Size: 4,955 bytes

Last Modified: 2025-10-06 14:03:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1061">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Segment-based Hidden Markov Models for Information Extraction</Title>
  <Section position="3" start_page="0" end_page="481" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A Hidden Markov Model (HMM) is a finite state automaton with stochastic state transitions and symbol emissions (Rabiner, 1989). The automaton models a random process that can produce a sequence of symbols by starting from some state, transferring from one state to another state with a symbol being emitted at each state, until a final state is reached. Formally, a hidden Markov model (HMM) is specified by a five-tuple (S,K,P,A,B), where S is a set of states; K is the alphabet of observation symbols; P is the initial state distribution; A is the probability distribution of state transitions; and B is the probability distribution of symbol emissions. When the structure of an HMM is determined, the complete model parameters can be represented as l = (A,B,P).</Paragraph>
    <Paragraph position="1"> HMMs are particularly useful in modelling sequential data. They have been applied in several areas within natural language processing (NLP), with one of the most successful efforts in speech recognition. HMMs have also been applied in information extraction. An early work of using HMMs for IE is (Leek, 1997) in which HMMs are trained to extract gene name-location facts from a collection of scientific abstracts. Another related work is (Bikel et al., 1997) which used HMMs as part of its modelling for the name finding problem in information extraction.</Paragraph>
    <Paragraph position="2"> A more recent work on applying HMMs to IE is (Freitag and McCallum, 1999), in which a separate HMM is built for extracting fillers for each slot. To train an HMM for extracting fillers for a specific slot, maximum likelihood estimation is used to determine the probabilities (i.e., the initial state probabilities, the state transition probabilities, and the symbol emission probabilities) associated with each HMM from labelled texts.</Paragraph>
    <Paragraph position="3"> One characteristic of current HMM-based IE systems is that an HMM models the entire document. Each document is viewed as a long sequence of tokens (i.e., words, punctuation marks etc.), which is the observation generated from the given HMM. The extraction is performed by finding the best state sequence for this observed long token sequence constituting the whole document, and the subsequences of tokens that pass through the target filler state are extracted as fillers. We call such approaches to applying HMMs to IE at the document level as document-based HMM IE or document HMM IE for brevity.</Paragraph>
    <Paragraph position="4">  In addition to HMMs, there are other Markovian sequence models that have been applied to IE. Examples of these models include maximum entropy Markov models (McCallum et al., 2000), Bayesian information extraction network (Peshkin and Pfeffer, 2003), and conditional random fields (Mc-Callum, 2003) (Peng and McCallum, 2004). In the IE systems using these models, extraction is performed by sequential tag labelling. Similar to HMM IE, each document is considered to be a single steam of tokens in these IE models as well.</Paragraph>
    <Paragraph position="5"> In this paper, we introduce the concept of extraction redundancy, and show that current document HMM IE systems often produce undesired redundant extractions. In order to address this extraction redundancy issue, we propose a segment-based two-step extraction approach in which a segment retrieval step is imposed before the extraction step. Our experimental results show that the resulting segment-based HMM IE system not only achieves near-zero extraction redundancy but also improves the overall extraction performance.</Paragraph>
    <Paragraph position="6"> This paper is organized as follows. In section 2, we describe our document HMM IE system in which the Simple Good-Turning (SGT) smoothing is applied for probability estimation. We also evaluate our document HMM IE system, and compare it to the related work. In Section 3, we point out the extraction redundancy issue in a document HMM IE system. The definition of the extraction redundancy is introduced for better evaluation of an IE system with possible redundant extraction. In order to address this extraction redundancy issue, we propose our segment-based HMM IE method in Section 4, in which a segment retrieval step is applied before the extraction is performed. Section 5 presents a segment retrieval algorithm by using HMMs to model and retrieve segments. We compare the performance between the segment HMM IE system and the document HMM IE system in Section 6. Finally, conclusions are made and some future work is mentioned in Section 7.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML