File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1011_intro.xml
Size: 2,215 bytes
Last Modified: 2025-10-06 14:02:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1011"> <Title>Handling Figures in Document Summarization</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 The Current State of the Diagrams Field </SectionTitle> <Paragraph position="0"> Automated text summarization has at its disposal, electronic documents that allow the use of all the techniques of computational linguistics. Diagrams in documents are in a more primitive state.</Paragraph> <Paragraph position="1"> The overwhelming majority of diagrams available in the electronic forms of the research literature today are in raster format. What is needed are diagrams in vector format in which an object such as a line is represented not by pixels, but as a line object defined by its endpoints and width.</Paragraph> <Paragraph position="2"> We found only 52 pages containing vector-based diagrams in a collection of 20,000 recent Biology papers in PDF format (Futrelle, Shao, Cieslik, & Grimes, 2003). Vectorization converts raster diagrams to vector format, much as OCR converts rasters to characters. But the resulting vectorized diagram is an unordered collection of objects in a two dimensional space. An additional analysis step of parsing is required. Our system for parsing diagrams (Futrelle & Nikolakis, 1995) produces descriptions for a data graph, for example, by discovering structures such as scale lines and sets of data points.</Paragraph> <Paragraph position="3"> There appear to be no non-proprietary vectorization systems that are up to the task of vectorizing the diagrams from the scientific literature, so our group is currently focused on developing a system for this in Java. We are also redeveloping our parsing system in Java. Until this work is completed, there will be few diagrams available for the application of diagram summarization techniques. This notwithstanding, diagram summarization is an interesting and ultimately important task, which is why we are discussing it here. This work is part of our laboratory's long-term effort to characterize the conceptual content of the Biology literature, including the text and figural content.</Paragraph> </Section> class="xml-element"></Paper>