File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1049_metho.xml

Size: 17,951 bytes

Last Modified: 2025-10-06 14:07:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1049">
  <Title>Layout and Language: Integrating Spatial and Linguistic Knowledge for Layout Understanding Tasks</Title>
  <Section position="3" start_page="0" end_page="335" type="metho">
    <SectionTitle>
2 A New Method
</SectionTitle>
    <Paragraph position="0"> Tile methods based on spatial cohesion outlined above make assumptions about the application of layout to the textual content of the document in order to derive features indicating higher order structure. These assumptions rely on tile realisation of layout as space and do not always hold (see, e.g.,  biguities. However, there is another source of information which can be exploited to recover layout. Though layout imt)oses spatial etfects, it has little or no effect on low level language phenomena witlfin the distinct layout document objects: we do not ca'poet the layout of tea;t to render it ungr'ammatical? Conversely, we do not expect grmmnaticality to persist in an incorrect interpretation of layout.</Paragraph>
    <Paragraph position="1"> For example, applying this observation to the segmentation of a double colmnn of text will indicate 2It is clear that layout does has very definite consequences for tile content of textual document elements, however those features we are concerned with here are below even this rudimentary level of language analysis.</Paragraph>
    <Paragraph position="2">  the line breaks, see Figure 4: Double Cohunns. :3 The aI)l)lication of a low level hmguage model to the interpretation of spatially distinct textual areas can be applied in many cases where a tmrely spatial algorithm may fail. The following is an incomplete list of possible cases of application (concrete examples may be found in Figure 4): Multi Column Text When the cohmms are separated by only one space, a language model may be aI)l)lied to determine if and where the blocks exist. These m W be confused with False Space Positives where, by chance, the text tbnnatting introduces streams of white space within contiguous text.</Paragraph>
    <Paragraph position="3"> Apposed/Marginal Material Text which is off'set from the main body of text, similarly to multi column text, will contain its own line breaks.</Paragraph>
    <Paragraph position="4"> Unmarked Headers Headers may be unmarked and appear silnilm' to single line 1)aragraphs.</Paragraph>
    <Paragraph position="5"> Double Spacing The introduction of more tlmn one line of spacing within contiguous text causes ambiguities with paragraph set)aration, headers alld so oi1.</Paragraph>
    <Paragraph position="6"> Elliptical Lists When text continues through a layout device, a language model may 1)e used to detect it. a Short Paragraphs When a t)aragral)hs is 1)articularly short, the insertion of a line break may (:ause prot)lems.</Paragraph>
    <Paragraph position="7"> Another exmnple, and a usefifl at)plication, is that to the 1)rot)lore of table segmentation. Once a table has been located using this method or other methods, the cells must be located.</Paragraph>
    <Section position="1" start_page="334" end_page="334" type="sub_section">
      <SectionTitle>
Multi-Cohunn Cells A cell sl)ans multit)le
</SectionTitle>
      <Paragraph position="0"> cohmms. This may easily 1)e conflised with Multi-Row Cells where a cell contains more than one line and must be groul)ed according to the line breaks.</Paragraph>
    </Section>
    <Section position="2" start_page="334" end_page="335" type="sub_section">
      <SectionTitle>
Elliptical Cell Contents Cells which tbrm a dis-
</SectionTitle>
      <Paragraph position="0"> junctiou of possible contilmations to the content of another cell can be identified using a language model.</Paragraph>
      <Paragraph position="1"> Grid Quantization When a plain text table contains ceils which arc not wholly aligned with aIn Figure 4: Doubh,. Cohmms, we know, through the al)plicatlon of a language model, that there in a line t)reak after paragraph as a paragraph of text in more likely than a paragraph Applying, and Applying this of text is grammatically.</Paragraph>
      <Paragraph position="2"> 4'l'his bares similm'ities with a simple list, lint the language. is that of the textual lint; which uses flmctional words and lmnctuation to indicate disjmmtion.</Paragraph>
      <Paragraph position="3"> other cells in the stone grid row or column, it is difficult to associate the cells correctly.</Paragraph>
      <Paragraph position="4"> Languages which permit vertical and horizontal orthography (such as Japanese) pose additional t)roblems when extracting layout features from l)lain text data.</Paragraph>
      <Paragraph position="5"> Orientation Detection With mixed orientation, a language model may be used to distinguish vertical and horizontal text blocks) We can hyl)othesise that spatially cohesive areas of the document are renderings of some underlying textual representation. If, at some level, the text is set)arated from the layout (the text is linearised by removing line breaks), then we may observe certain linguistic phenomena which are characteristic of the bmguage. Reversing this allows us to identify the sl)atially cohesive objects in the document t)y discovering the transfonnatioll to the text (the application of layout, i.e. the insertion of spacing and line breaks) which preserve our observations about the language. One such observation is the ordering of words. Consequently, we can apply a language model to a line of text in a docuinent to determine where line breaks have been inserted into the text for layout purt)oses by observing where the language model breaks down and where our simt)le notion of layout 1)ased on sl)atial features i)ermits text block segmelttation. This is an ideal. In fact, knowledge of layout and lan.q'uagc is required to over'co'me th, e short comings of each,.</Paragraph>
      <Paragraph position="6"> There are many tyt)es of language model which may be applied to the problem being considered, ranging from the analytical - which provide an indication of linguisti(&amp;quot; structure), to tile classi\[ying which indicate if (and to what extent) the intmt tits the model. The analytical, such as a context free grmnmar, are not appropriate for this problem as they require a broad intmt and are not suited to the fraglnents of&amp;quot; int)ut envisioned for this at)t)lications. The 1)rime purpose of the language model we wish to use is to t)rovide some ranking of candidate contimmtions of a particular set of one or more tokens. A simple examI)le is the bigrmn model. This uses fl'equency counts of pairs of words derived froln a corlms. Although there are advantages and disadvantages to this model, it will serve as an exmnI)le though other more Sol)histicated and reliable models inay easily be at)i)lied.</Paragraph>
      <Paragraph position="7"> 5In Figure 4: Orientation Variation, the. column of text on the left of the tattle is a vertically orientated label (%W~m nlgeTk2L) whereas the remainder of the table is horizontally orientated. The apparent cohnnn oll the right of the tal)le is an artifact of the spacing and has no linguistic cohesion.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="335" end_page="336" type="metho">
    <SectionTitle>
3 Basic Algorithm
</SectionTitle>
    <Paragraph position="0"> The problem can be generally described in the following manner: given a set of objects distributed in a two dimensional grid, for each pair of objects, determine if they belong to tile same cohesive set.</Paragraph>
    <Paragraph position="1"> TILe objects are tokens, or words, and the measm'e of cohesion is that one word follows from the other in accordance with the the nature of the language, the content of the document, and tlm idiom of the particular document element within which they may be contained and that the spatial model of the lay-out of the docmnent permits cohesion. In summary, the cohesion is spatial and linguistic.</Paragraph>
    <Paragraph position="2"> However, such a general description is not computationally sensible and the search space will be reduced if we consider the cases where we expect ambiguities to occur. This can be approached by recognising that when there is the potential for ambiguity there is often present some artifact, which tory well help identify the domain of the ambiguity: these are generally the markers of spatial cohesion; e.g., where there arc double cohnnns, we may also identify left justification. Consequently, for a given word in tile the double column area, tim mnbiguity may be resolved by inspecting tile word to the right, or tile set of words which may be left justified with the line currently under inspection on the line below.</Paragraph>
    <Paragraph position="3"> Therefore, tile application of tile language model to tile disambiguation problems mentioned above takes place between a small set of candidate continuation positions.</Paragraph>
    <Paragraph position="4"> These continuation points are located as prescribed by the markers of the spatial layout of text. Consequently, any algorithm using linguistic knowledge must exploit layout knowledge in order to 1)oth arrive at an economic sohltion, and also to be robust to weaknesses in tile language model. The general method described here relies on and determines both spatial and linguistic information in a tightly integrated manner. Tile algorithm falls ill to the following broad steps:  1. detect potential for ambiguity.</Paragraph>
    <Paragraph position="5"> 2. compute the set of possible continuation points by using knowledge of spatial layout.</Paragraph>
    <Paragraph position="6"> 3. disambiguate using a combination of hmguagc and layout knowledge.</Paragraph>
    <Paragraph position="7">  For examtfle, the words marked with a clear box in Figure 2, upper, are those which, according to a naive spatial algorithm, m'e possibly in close proximity to tile right edge of a text block. Having detected them, tile possit)le continuation points, shaded boxes, are comlmted (here for a single word for illustration). A language model may then be applied to determine the most likely contimmtion.</Paragraph>
    <Paragraph position="8"> Care must be taken wlmn discovering equally likely continuations as opt)osed to a single most likely (me. Figure 2, lower, contains two examples.</Paragraph>
    <Paragraph position="9"> Tile first illustrates tile case when there is 11o continuation appropriate (there are three equally likely continuations; as none is the most likely, no continnation should be proposed). In the second example, a unique continuation is preferred. The general algorithln above provides ammtation to the tokens in tile document which may then be used to drive a text-block recognition algorithm.</Paragraph>
    <Paragraph position="10"> Detecting the Potential for Ambiguity The potential for ambiguity occurs when a feature of the document is discovered which may indicate the immediate boundary of a text block. As we arc dealing with the basic element of a token (or word), the potential for ambiguity may occur at the end of a word, or between any two words in a sequence on tile line.</Paragraph>
    <Paragraph position="11"> However, we only need to consider those cases where a spatial algorithnl may determine a block boundary (correctly or incorrectly). In order to do this we need a characterisation of a spatial algorithln in terms of the features it uses to determine text block boundaries.These are naturally related to space in the text, and so onr algorithm will be concerned with the following three types of space: 1) Between words where there is a vertical river of white space which continues above and below according to some threshold; 2) Between words larger than a nfinimum amount of space; 3) At; tim right hand side of the document when no more tokens are found. These describe potential points for line break insertion into text and constitute a partial fllnctionat model of layout.</Paragraph>
    <Paragraph position="12"> Computing the Set; of Continuation Points The set of continuation points is comtmted according to the assumptions used to deterinine if there is the potential for ambiguity. The continuation point from a point of potential ambiguity are: 1) The next word to tile right; 2) The first word on tile next line; 3) All tile continuation points on the next line which are to the left of the current word. These represent the complement to the above functional model of 1wout. Thus we have a model of 1wout which is intentionally over general as it uses local features which are ambiguous.</Paragraph>
    <Paragraph position="13"> Disambiguation Disambiguation may be carried out in a number of ways depending on the extent required by the language model being employed. However, regardless of what range of history or lookahead is required by tile language model, the process of dismnbiguation is not a simple matter of selecting tile best possible continuation as proposed by the statistical or other elements of the language model.</Paragraph>
    <Paragraph position="14"> The interactions between layout and language require that a nmnber of constraints be considered.</Paragraph>
    <Paragraph position="15"> These constraints model tile ambiguities cruised by  the layout and the language.</Paragraph>
    <Paragraph position="16"> For any 1)otential point of anlbiguity, a single (or null) l)oint of continuation must be found. And for any l)oint of continuation, a single source of its history is required. If token A has potential continuation points X and Y, and token B has potential continuation points Y and Z, mid the best; continuation as predicted by the model for A is X and that tor B is also X, then both A mid B Call not be succeeded by their respective best continuations. The selection of continuation points nmst 1)e l)ased on the set of possible continuation points for the connected graph in which a potential point of ambiguity occurs (see Figure 3). An additional constraint inlposed by the 1wout of the text is that links representing continuation cmmot cross. This constraint is a feature of the interaction between tile spatial layotlt all(1 the linguistic model.</Paragraph>
    <Section position="1" start_page="336" end_page="336" type="sub_section">
      <SectionTitle>
3.1 Extensions
</SectionTitle>
      <Paragraph position="0"> The above algorittnn is not callable, of capturing all types of continuation observed in the basic text; blocks of certain document elements. Specifically, there is an imi)licit restriction on a uni(lue continuation of the language through certain layout features. This may be called the one to one model of the interaction t)etween layout and language. I\]owever, the less fre(luent~ though equally inlt)ortant (:ases of one to many and many to one intera(:tions must also l)e considered. In Figure d: Many to One, exanll)les of 1)oth are given. Significantly, these cases exists at the boundaries between t)asi(: textual COml)onents of large (loculnent ol).ie(:ts (here tables). It is suggested, the, n, that the detection of equally likely contimlation 1)oints may be used to dete, ct boml(larie, s where there is little or no sl)atial separation. (;</Paragraph>
    </Section>
    <Section position="2" start_page="336" end_page="336" type="sub_section">
      <SectionTitle>
3.2 Exi)erimentation
</SectionTitle>
      <Paragraph position="0"> Ill order to test the lmsic ideas described in this pa1)er, a siml)le systenl was imt)lenlented. A (:orplts of documents was collected from the SEC archive (www.sec.gov). These docmnents are rich ill va.rious docunlent elenmnts inchlding tables, lists and headers. The documents are essentially flat, though there is solne anlount of header information encoded in XML as well as a nfinimal anlount of nmrkul) in the document; body.</Paragraph>
      <Paragraph position="1"> A simple 1)igram nlodel of the, language used was created. This was (:onstructed 1)artly from general texts (a corlms of English literature.) of which it was assumed there was no complex content, and 1)artly from tile SEC docunmnts. 7 A system was iml)ledegThis begs a definition of equally likely - which would be, dependent on the language model and implementation.</Paragraph>
      <Paragraph position="2"> 7An import;ant i)rocess in the creation of a language model for 1wout problems is the identification of usable language in the COl'pll8. ~\]~o these ends, the SEC (loculne.nl, s were marke(l up by hand to identiI~, i)aragrai)h text. These, text blocks mented which marked the potential points of mnbigully and tile continuation points and then at)plied the chlster and selection algorithln to determine the presence of spatio-linguistically cohesive text blocks (see example output ill Figure 1).</Paragraph>
      <Paragraph position="3"> As yet, no formal ewfluation of the implementation is available. It can be asserted, however, that tile results obtained fl'om this preliminary implementation indicate that the general method produces significant results, and that the basic notion of combining spatial and linguistic infornmtion tbr the deternfination of cohesive elements in a conlplex doeunlent is a powerful one.</Paragraph>
      <Paragraph position="4"> Another experiment investigated tlle utility of the mettlods described in this paper. We wanted to determine how often mnbiguities occurred and how inl1)ortant correct resolution was. Looking at; the ambiguity in table stub (:ells - tile mnlfiguity between multi-row ceils and multiple ceils below a header resulted ill some significant results. For a sample of 28 tables (1704 ceils); ill tile 131 stub cells we found 68 examl)les of multi-row cells, and 35 of headers to multiple cells (note tlmt these are not disjoint sets). Using the SEC bigram model, the cases were disanll)iguated l)y hand, resulting in a 74 % success rate,. This sinlple investigation demonstrates that tim disalnbiguation is required and that linguistic inforination cm~ 1)e applied successflfily.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML