File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-1671_concl.xml
Size: 2,744 bytes
Last Modified: 2025-10-06 13:55:41
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1671"> <Title>Learning Field Compatibilities to Extract Database Records from Unstructured Text</Title> <Section position="7" start_page="609" end_page="609" type="concl"> <SectionTitle> 5 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> We have investigated graph partitioning methods for discovering database records from fields annotated in text. We have proposed a cluster compatibility function that measures how likely it is that two sets of fields belong to the same cluster. We argue that this enhancement to existing techniques provides more representational power.</Paragraph> <Paragraph position="1"> We have evaluated these methods on a set of hand-annotated data and concluded that (1) graph partitioningtechniquesaremoreaccuratethanperforming transitive closure, and (2) cluster compatibility methods can avoid common mistakes made by pairwise compatibility methods.</Paragraph> <Paragraph position="2"> As information extraction systems become more reliable, it will become increasingly important to develop accurate ways of associating disparate fields into cohesive records. This will enable more complex reasoning over text.</Paragraph> <Paragraph position="3"> One shortcoming of this approach is that fields are not allowed to belong to multiple records, because the partitioning algorithm returns non-overlapping clusters. Exploring overlapping clustering techniques is an area of future work.</Paragraph> <Paragraph position="4"> Another avenue of future research is to consider syntactic information in the compatibility function. While performance on contact record extraction is highly influenced by formatting features, many fields occur within sentences, and syntactic information (such as dependency trees or phrase-structure trees) may improve performance.</Paragraph> <Paragraph position="5"> Overall performance can also be improved by increasing the sophistication of the partitioning method. For example, we can examine &quot;block moves&quot; to swap multiple fields between clusters in unison, possibly avoiding local minima of the greedy method (Kanani et al., 2006). This can be especially helpful because many mistakes may be made at the start of clustering, before clusters are large enough to reflect true records.</Paragraph> <Paragraph position="6"> Additionally, many personal web pages contain a time-line of information that describe a person's educational and professional history. Learning to associate time information with each contact record enables career path modeling, which presents interesting opportunities for knowledge discovery techniques, a subject of ongoing work.</Paragraph> </Section> class="xml-element"></Paper>