File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1025_metho.xml

Size: 13,355 bytes

Last Modified: 2025-10-06 14:07:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1025">
  <Title>Mining Tables from Large Scale HTML Texts</Title>
  <Section position="2" start_page="0" end_page="167" type="metho">
    <SectionTitle>
1 Tables in HTML
</SectionTitle>
    <Paragraph position="0"> HTML table begins with au optional caption t'ollowcd one or more rows. Each row is formed by one or more cells, which are classified into header and data cells. Cells can be merged across rows and colulnns. The following tags arc used:  (1) &lt;table...&gt; &lt;/table&gt; (2) &lt;tr ...&gt; &lt;/tr&gt; (3) &lt;td...&gt; &lt;/td&gt; (4) &lt;th ...&gt; &lt;/th&gt; (5) &lt;caption ...&gt; &lt;/caption&gt;  ................ T~;t,r i~o'iic ................... } ......... isi'gi;)XR()iAii .......... } ' ........... ;diiii~i ............... i 1999iii;fiOJ-2iJO0103131 i i ...... Ci,,,L0i.~x/o;\];\],,;i ............ { i,;ccgn\[;/,ii c (!izls Y ll.;x{c;isii;ii i il i, Siligiei(i;t;ii; i, 35,450 I 2510 ' i Adtilt i11 l)oublc Room .... :3:2;.5(J(J I i2)3i) I i II. (iixi;:{\[iGi i ....... 3i/556 ......... -7}6 i ........ &gt;\[ _ &lt; , I  ' !d Occupatioll i 25800 i i430 i Child ii~l ExU'aBed i 23,850 i '7\]0&amp;quot; i ....................... ' i i t No Occt )a oi/I 22,900 360 i They denote main wrapper, table row, table data, table header, and caption for a table. Table 1 shows an example that lists the prices for a tour. The interpretation of this table in terms of altribute-wdue relationships is shown as follows:  Cell may play the role o1' attribute and/or value. Several cells may be concatenated to denote an altribute. For example, &amp;quot;AdulI-Price-Single \]),ecru-Economic Chlss&amp;quot; means the ;tdl.llt price for economic class and single room. The relationships may 13o read in column wise or in row wise depending on the interpretation. For example, the relationship for &amp;quot;Tour Code:I)P9LAXOIAB&amp;quot; is in row wise. The prices for &amp;quot;Economic Class&amp;quot; are in column wise. The table wrapper (&lt;table&gt; ... &lt;/table&gt;) is a useful cue lkw table recognition. The H'FMI, text for the above example is shown as follows. The table tags are enclosed by a table wrapper.  ltowever, ;l taMe does not always exist when table wrapper al3pears in ft'I'MI, text. This is because writers often employ table tags to i'cpresent forlll or IllOlltl. That allows users to input queries or make selections.</Paragraph>
    <Paragraph position="1"> Another fx)int that shoukt be mentioned is: table designers usually employ COLSPAN  (ROWSPAN) to specify how many cohunns (rows) a table cell should span. In this example, the COI,SPAN of cell &amp;quot;Tour Code&amp;quot; is 3. That means &amp;quot;Tour Code&amp;quot; spans 3 columns.</Paragraph>
    <Paragraph position="2">  Similarly, the P, OWSI~AN o1' cell &amp;quot;Adult&amp;quot; is 3. This cell spans 3 rows. COLSPAN and ROWSPAN provide flexibility for users to design any kinds ot' tables, but they make automatic table interpretation more challengeable.</Paragraph>
  </Section>
  <Section position="3" start_page="167" end_page="167" type="metho">
    <SectionTitle>
2 Flow of Table Mining
</SectionTitle>
    <Paragraph position="0"> The flow of table nfining is shown as Figure 1. It is composed of five modules.</Paragraph>
    <Paragraph position="1"> Hypertext processing module analyses HTML text, and extracts the table tags. Table filtering module filters out impossible cases by heuristic rules. The remaining candidates are sent to table recognition module for further analyses.</Paragraph>
    <Paragraph position="2"> The table interpretation module differentiates the roles of cells in the tables. The final module tackles how to present and employ the mining results. The first two modules are discussed in tile following paragraph, and the last three modules will be dealt with in the following sections in detail.</Paragraph>
    <Paragraph position="3">  As specified above, table wrappers do not always introduce tables. Two filtering rules are employed to disambiguate their functions:  (1) A table must contain at least two cells to represent attribute and value. In other words, the structure with only one cell is filtered out. (2) If the content enclosed by table  wrappers contain too much hyperlinks, forms and figures, then it is not regarded as a table. To evaluate the performance of table mining, we prepare the test data selected from airline information in travelling category o1' Chinese Yahoo web site (http://www.yaboo.com. tw). Table 2 shows the statistics of our test data.</Paragraph>
    <Paragraph position="4">  number of web pages, total number of table wrappers, and total number of tables, respectively. On the average, there are 2.35 table wrappers, and 0.67 tables for each web page. The statistics shows that table tags are used quite often in HTML text, and only 28.53% are actual tables. Table 3 shows the results after we employ the filtering rules on the test data. Tile 5 th row shows how many non-table candidates are filtered out by the proposed rules, and tile 6 th row shows the nulnbcr of wrong filters. On the average, the correct rate is 98.93%. Total 423 of 2300 nou-tables are remained.</Paragraph>
  </Section>
  <Section position="4" start_page="167" end_page="169" type="metho">
    <SectionTitle>
3 Table Recognition
</SectionTitle>
    <Paragraph position="0"> After simple analyses specified in the previous sectiou, there are still 423 non-tables passing the filtering criteria. Now we consider the content of the cells. A cell is much shorter than a senteuce in plain text. In our study, the length of 43,591 cells (of 61,770 cells) is smaller than 10 characters 2. Because of the space lilnitation in a table, writers often use shorthand notations to describe their intention. For a A Chinese character is represented by two bytes.</Paragraph>
    <Paragraph position="1"> That is, a cell contains 5 Chinese characters oil the average.</Paragraph>
    <Paragraph position="2">  example, they may use a Chinese character (&amp;quot;:~,\]&amp;quot;, dao4) to represent a two-character word &amp;quot;~j~&amp;quot; (dao4da2, arrive), and a character (&amp;quot;C/~&amp;quot;, 1i2) to denote the Chinese word &amp;quot;,~$ i~,~l&amp;quot; (li2kail, leave). They even employ special symbols like * and Y to represent &amp;quot;increase&amp;quot; and &amp;quot;decrease&amp;quot;. Thus it is hard to determine if a fragment of ttTML text is a table depending on a cell only.</Paragraph>
    <Paragraph position="3"> The context among cells is important.</Paragraph>
    <Paragraph position="4"> Value cells under the same attribute names demonstrate similar concepts. WE employ the following metrics to measure the cell similarity.</Paragraph>
    <Paragraph position="5">  (1) String similarity  We measure how many characters are common in neighboring cells. I1' the lmmber is above zt threshold, we call lhe two cells are similar.</Paragraph>
    <Paragraph position="6"> (2) Named entity simihuily The metric considers semantics of cells.</Paragraph>
    <Paragraph position="7"> We adopt some named entity expressions defined in MUG (1998) such as date/time expressions and monetary and percentage expressions.</Paragraph>
    <Paragraph position="8"> A role-based lnethod similar {o lhe  paper (Chert, Ding, and Tsai, 1998) is employed to tell if a cell is a specific named entity. The neighboring cells belonging to the same llalned entity category are similar.</Paragraph>
    <Paragraph position="9"> (3) Number category similarily  Number characters (0-9) appear very often. If total number characters in a cell exceeds a threshold, we call tlae cell belongs to !.he number category.</Paragraph>
    <Paragraph position="10"> The neighboring cells in number category are similar.</Paragraph>
    <Paragraph position="11"> We count how many neighboring cells are similar. If the percentage is above a threshold, the table tags are interpreted as a table. The data after table filtering (Section 2) is used to evaluate the strategies in table recognition. 'Fables 4-6 show the experimental results when the three metrics are applied incrementally. Precision rate (P), recall rate (R), and F-measure (F) defined below are adopted to measure the performance.</Paragraph>
    <Paragraph position="13"> Table 4 shows that string similarity cannot capture the similar concept between neighboring cells very well. The F-measure is 55.50%.</Paragraph>
    <Paragraph position="14"> Table 5 tries to incorporate more semantic features, i.e., categories of named entity.</Paragraph>
    <Paragraph position="15"> Unlbrtunately, the result does not meet our expectation. The performance only increases a little. The major reason is that the keywords (pro/am, $, %, etc.) for date/time expressions and monetary and percentage expressions are usually omitted in {able description. Table 6 shows that the F-measure achieves 86.50% when number category is used. Compared wilh Tables 4 and 5, the performance is improved  drastically.</Paragraph>
  </Section>
  <Section position="5" start_page="169" end_page="170" type="metho">
    <SectionTitle>
4 Table Interpretation
</SectionTitle>
    <Paragraph position="0"> As specified in Section 1, the attribute-value relationship may be interpreted in colunm wise or in row wise. If the table tags in questions do not contain COLSPAN (ROWSPAN), the problem is easier. The first row and/or the first column consist of the attribute cells, and the others are value cells. Cell similarity guides us how to read a table.</Paragraph>
    <Paragraph position="1"> We define row (or column) similarity in terms of cell similarity as follows. Two rows (or columns) are similar il' most of the corresponding cells between these two rows (or columns) are similar.</Paragraph>
    <Paragraph position="2"> A basic table interpretation algorithm is shown below. Assume there are n rows and m Let % denote a cell in i m row and jth colulnns. col u mn.</Paragraph>
    <Paragraph position="3">  (1) I1' there is only one row or column, then the problem is trivial. We jnst read it in row wise or column wise.</Paragraph>
    <Paragraph position="4"> (2) Otherwise, we start the similarity checking froln the right-bottom position, i.e., c,~,n. That is, the n th row and the in th column arc regarded as base for comparisons.</Paragraph>
    <Paragraph position="5"> (3) For each row i (1 _&lt; i &lt; n), compute the similarity of the two rows i and n.</Paragraph>
    <Paragraph position="6"> (4) Count how many pairs of rows are similar.</Paragraph>
    <Paragraph position="7"> (5) If the count is larger than (n-2)/2, and  the similarity of row 1 and row n is smaller than the similarity of the other row pairs, then we say this table can be read in column wise. In other words, the first row contains attribute cells.</Paragraph>
    <Paragraph position="8"> (6) The interpretation from row wise is done in the similar way. We start checking from in th coluInn, compare it with each column j (1 &lt; j &lt; in), and count how many pail's of columns are similar.</Paragraph>
    <Paragraph position="9"> (7) If neither &amp;quot;row-wise&amp;quot; nor &amp;quot;column-wise&amp;quot; can be assigned, then the default is set to &amp;quot;row wise&amp;quot;. Table 6 is an example. The first column contains attribute ceils. The other cells arc statistics of an expel'imel~tal result. We read it in row wise. ff COLSPAN (ROWSPAN) is used, the table iutet'pretation is more difficult. Table 1 is a typical example. Five COLSPANs and two ROWSPANs are used to create a better layout. The attributes are formed hierarchically. The following is an example of hierarchy.</Paragraph>
    <Paragraph position="10"> Adult ..... I'rice ........... Double Room ............. Single Room ............ Extra Bed Here, we extend the above algorithm to deal with table interpretation with COLSPAN (ROWSPAN). At first, we drop COLSPAN and ROWSPAN by duplicating several copies o1' cells in their proper positions. For example, COLSPAN=3 for &amp;quot;Tour Code&amp;quot; in Table 1, thus we duplicate &amp;quot;Tour Code&amp;quot; at colunms 2 and 3. Table 7 shows the final reformulation el' the example in Table 1. Then we employ the above algorithln with slight inodification to l'ind the reading direction.</Paragraph>
    <Paragraph position="11"> The modification is that spanning cells ale boundaries for similarity checking. Take Table 7 as an example. We start the similarity checking from the i'ight-I~ottom cell, i.e., 360, and consider each row and column within boundaries. The cell &amp;quot;1999.04.01- 2000.03.31&amp;quot; is a spanning cell, so that 2 &amp;quot;a row is a boundary. &amp;quot;Price&amp;quot; is a spanning cell, thus 2 '''1 column is a boundary. In this case, we can interpret the table tags in both row wise and column wise.  After that, a second cycle begins. The starling points are moved lo new right-bottom positkms, i.e., (3, 5) and (9, 3). In this cycle, boundaries are reset. The cells I)P9LAX01AIF' and &amp;quot;Adtll\[&amp;quot; (&amp;quot;(~.hild&amp;quot;) itlC spalltlil\]g coils, so that I st row alld is |colun\]n {ll*C |low bouildarios. At this \[illle, &amp;quot;to\v-\vise&amp;quot; i:; selected.</Paragraph>
    <Paragraph position="12"> In final cycle, Ihc starting positions are (2,5) and (9, 2). The boundaries arc 0 'l' rOW and ()u~ column. Those two siib-tables are road it\] row wise.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML