File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1084_intro.xml

Size: 2,097 bytes

Last Modified: 2025-10-06 14:05:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1084">
  <Title>Integrated Text and Image Understanding for Document Understanding</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2. GENERAL IDUS SYSTEM
DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> IDUS employs four general technologies - image understanding, OCR, do(:ument layout analysis and text understanding - in a knowledge-based cooperative fashion\[l~ 2\]. The curre,t implemenlation is on a SPAl~Cstation TM II with the UNIX TM operating system using the 'C' and Prolog programlning languages. OCR is performed with the Xerox hnaging Systems ScanWorX TM Application Programmer's Interface toolkit. All features are accessible via an X-Windows TM/Motif TM user interface.</Paragraph>
    <Paragraph position="1"> After scanning the document page(s), IDUS works pagg-bypage, pcrtorming image-based segmentation to initially locatc the primitiw~ regions of text and nontext which are manipulated d,riug the logical and functional analysis of the document. Each text unit's content is internally homogeneous in physical attributes such as font size and spacing style.</Paragraph>
    <Paragraph position="2"> The ASCII text associated with each block is found through OCR and a set of features based both on text attrib.tes (e.g., number of text lines, font size and type) and geometric attributes (e.g., location on page, size of block) is used to refine the segmentation and organize the blocks into proper logical groupings, i.e., &amp;quot;articles&amp;quot;. The ASCII text for each &amp;quot;article&amp;quot; is assembled in a proper reading order. During this process the column structure of the document is determi,(,(I, and noise and nontext blocks are eliminated. A text processing component performs a finguistic analysis to extract key ideas from each article and then represent them by a semantic component, the case frame. Each &amp;quot;article&amp;quot; text is saved as part of the document corpus and may be retrieved through a query interface.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML