File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0308_evalu.xml
Size: 3,103 bytes
Last Modified: 2025-10-06 14:00:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0308"> <Title>On aligning trees</Title> <Section position="7" start_page="78" end_page="78" type="evalu"> <SectionTitle> 5 Results of processing on two </SectionTitle> <Paragraph position="0"> corpora We have processed the entire Susanne corpus and the corresponding parts of the Penn treebank, and produced tables of alignments for each pair of marked-up texts. Inputs for this process were a Susanne file and the corresponding &quot;combined&quot; file from the treebank (i.e. including part:of-speech information). Recalling that the treebank marks up the relationship between pre-terminal and terminal as a unary tree (and that Susanne doesn't do this), the treebank regularly contains more trees than Susanne.</Paragraph> <Paragraph position="1"> First, a definition: a tree is maximal if it is not part of another tree within a corpus. We ignore maximal trees of depth one in both corpora (as these correspond to indications of textual units rather than sentence-internal structural markup). Each maximal tree containing a tree of greater than depth one in the treebank may also contain sentence punctuation which is treated within the structural markup. As such markup is typically treated as external to structural annotations within Susanne, trees containing a sentence and sentence punctuation cannot be a possible target for alignment across the two corpora. We can take the number of maximal trees of depth more than one within Susanne as an indication of the number of trees within the treebank which are unalignable as a consequence of decisions about markup. This figure comes to 2431.</Paragraph> <Paragraph position="2"> With those considerations, we report the following findings: * There are 156584 terminal elements in Susanne and of those we find a total of 145583 (93%) for which a corresponding element is identified in the treebank. The corresponding figure for the treebank is 86% (of 169782 terminal elements in the treebank).</Paragraph> <Paragraph position="3"> * There are 110484 trees in Susanne (including 1952 maximal trees of depth one) and so a total of 108532 potentially aligned trees. Of these 76011 (70%) are aligned with trees in the treebank. null * There are 301086 trees in the treebank, of which we can eliminate 169782 as trees indicating preterminals (which includes 122174 containing just a textual delimiter), and an estimated further 2431 as representing trees including sentence punctuation. This gives a total of 128873 (= 59%) of trees in the treebank possibly aligned with those in Susanne are in fact aligned.</Paragraph> <Paragraph position="4"> The figures above bear out the impression that trees in the Penn treebank are more highly articulated than those in Susanne, even leaving aside the additional structure induced by the treatment of punctuation and preterminals in the treebank.</Paragraph> <Paragraph position="5"> The entire process of computing the above output completes in approximately fifty minutes on an unloaded Sun SparcStation 20.</Paragraph> </Section> class="xml-element"></Paper>