File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1061_evalu.xml
Size: 3,670 bytes
Last Modified: 2025-10-06 13:59:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1061"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Segment-based Hidden Markov Models for Information Extraction</Title> <Section position="7" start_page="486" end_page="487" type="evalu"> <SectionTitle> 7 Conclusions and future work </SectionTitle> <Paragraph position="0"> In current HMM based IE systems, an HMM is used to model at the document level which causes certain redundancy in the extraction. We propose a segment-based HMM IE modelling method in order to achieve near-zero redundancy extraction. In our segment HMM IE approach, a segment retrieval step is first applied so that the HMM extractor identifies fillers from a smaller set of extraction-relevant segments. The resulting segment HMM IE system using the segment retrieval method has not only achieved nearly zero extraction redundancy, but also improved the overall extraction performance. The effect of the segment-based HMM extraction goes beyond applying a post-processing step to the document-based HMM extraction, since the latter can only reduce the redundancy but not improve the F1 scores.</Paragraph> <Paragraph position="1"> For the template-filling style IE problems, it is more reasonable to perform extraction by HMM state labelling on segments, instead of on the entire document. When the observation sequence to be labelled becomes longer, finding the best single state sequence for it would become a more difficult task. Since the effect of changing a small part in a very long state sequence would not be as obvious, with regard to the state path probability calculation, as changing the same subsequence in a much shorter state sequence. In fact, this perspective not only applies in HMM IE modelling, but also applies in any IE modelling in which extraction is performed by sequential state labelling.</Paragraph> <Paragraph position="2"> We are working on extending this segment-based framework to other Markovian sequence models used for IE.</Paragraph> <Paragraph position="3"> Segment retrieval for extraction is an important step in segment HMM IE, since it filters out irrelevant segments from the document. The HMM for extraction is supposed to model extraction-relevant segments, so the irrelevant segments that are fed to the second step would make the extraction more difficult by adding noise to the competition among relevant segments. We have presented and evaluated our segment retrieval method. Document-wise retrieval performance can give us more insights on the goodness of a particular segment retrieval method for our purpose: the document-wise retrieval recall using the least correctly filtered measure provides an upper bound on the final extraction performance.</Paragraph> <Paragraph position="4"> Our current segment retrieval method requires the training documents to be segmented in advance. Although sentence segmentation is a relatively easy task in NLP, some segmentation errors are still unavoidable especially for ungrammatical online texts. For example, an improper segmentation could set a segment boundary in the middle of a filler, which would definitely affect the final extraction performance of the segment HMM IE system. In the future, we intend to design segment retrieval methods that do not require documents to be segmented before retrieval, hence avoiding the possibility of early-stage errors introduced from the text segmentation step. A very promising idea is to adapt a naive Bayes IE to perform redundant extractions directly on an entire document to retrieve filler-containing text segments for a segment</Paragraph> </Section> class="xml-element"></Paper>