File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1049_intro.xml

Size: 2,751 bytes

Last Modified: 2025-10-06 14:00:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1049">
  <Title>Layout and Language: Integrating Spatial and Linguistic Knowledge for Layout Understanding Tasks</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> There is currently a significant amount of work being carried out on applications which aim to deduce layout information fl'om a spatial descrit/tion of a document. The tasks vary in detail, however they generally take as input a document description which presents areas of text (including titles, headlags, paragraphs, lists and tables) marked implicitly by position. A simple example is a flat text document which uses white space to demonstrate alignmeat at the edges of textual blocks and blank lines to indicate vertical spatial cohesion and separatiou between blocks. 1 Rus and Summers ((Rus and Su,nmers, 1994)) state that &amp;quot;the non-teztual content of documents \[complement\] the tcztual content and should play art equal role&amp;quot;. This is clearly desirable: textual and spatial properties, as described in tiffs paper, are inter-related and it is in fact highly beneficial to exploit the relationships which exist between them. In  lexical cohesion by Morris and Hirst ((Morris and Hirst, 1991)). Text which is cohesive is text which has a quality of unity (p. 21). Objects which have spatial cohesion have a quality of unity indicated by spatial features; in the words of Morris and Hirst: they &amp;quot;stick together&amp;quot;.</Paragraph>
    <Paragraph position="1"> algorithmic terms, this implies implementing solutions which use both spatial and linguistic features to detect coherent textual objects its the raw text.</Paragraph>
    <Paragraph position="2"> Apt)roaches to tile problem are limited to those exploiting spatial cohesion. There are two techuiques for achieving this. The first looks for features of space, identifying rivers of space which ruts around text blocks in some memfingflfl maimer. Tim second looks at non-linguistic qualities of the text including alignment of tokens between lines as well as certain types of global interactions (e.g. (Kieninger and Dengel, 1998)). Although this second type focuses on the characters rather than the spaces in the text, tim features that it detects are implications of tile spatial arrangement of tile text: judging two words to be overlapping in the horizontal axis is not a feature of tile words in terms of their content~ but of their position. Elements of the above basic methods may be combined and, as with any f'eatnre vector type of mmlysis, machine learning algorithins may be applied (e.g. (Ng el; al., 1999)).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML