File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-2204_concl.xml
Size: 3,370 bytes
Last Modified: 2025-10-06 13:55:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2204"> <Title>Transductive Pattern Learning for Information Extraction</Title> <Section position="6" start_page="28" end_page="30" type="concl"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> We have described TPLEX, a semi-supervised algorithm for learning information extraction patterns. The key idea is to exploit the following recursive definition: good patterns are those that extract good fragments, andgoodfragmentsarethosethatareextractedbygood patterns. This definition allows TPLEX to perform well withverylittletrainingdataindomainswhereotherapproaches that assume fragment redundancy would fail. Conclusions. From our experiments we have observed that our algorithm is particularly competitive in scenarios where very little labelled training data is available. We contend that this is a result of our algorithm's ability to use the unlabelled test data to validate the patterns learned from the training data.</Paragraph> <Paragraph position="1"> We have also observed that the number of fields that are being extracted in the given domain affects the performance of our algorithm. TPLEX extracts all fields simultaneously and uses the scores from each of the nar dataset trained on only labeled data and trained on labeled and unlabeled data patterns that extract a given position to determine the most likely field for that position. With more fields in the problem domain there is potentially more information on each of the candidate positions to constrain these decisions.</Paragraph> <Paragraph position="2"> Future work. We are currently extending TPLEX in several directions. First, position filtering is currently performed as a distinct post-processing step. It would be more elegant (and perhaps more effective) to incorporate the filtering heuristics directly into the position scoring mechanism. Second, so far we have focused on a BWI-like pattern language, but we speculate that richer patterns permitting (for example) optional or re-ordered tokens may well deliver substantial increases in accuracy.</Paragraph> <Paragraph position="3"> We are also exploring ideas for semi-supervised learning from the machine learning community.</Paragraph> <Paragraph position="4"> Specifically, probabilistic finite-state methods such hidden Markov models and conditional random fields have been shown to be competitive with more traditional pattern-based approaches to information extraction (Fuchun and McCallum, 2004), and these methods can exploit the Expectation Maximization algorithm to learn from a mixture of labelled and unlabelled data (Lafferty et al., 2004). It remains to be seen whether this approach would be effective for information extraction. null Another possibility is to explore semi-supervised extensions to boosting (d'Alch'e Buc et al., 2002). Boosting is a highly effective ensemble learning technique, and BWI uses boosting to tune the weights of the learned patterns, so if we generalize boosting to handle unlabelled data, then the learned weights may well be more effective than those calculated by TPLEX.</Paragraph> <Paragraph position="5"> Acknowledgements. This research was supported by grants SFI/01/F.1/C015 from Science Foundation Ireland, and N00014-03-1-0274 from the US Office of Naval Research.</Paragraph> </Section> class="xml-element"></Paper>