File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0308_metho.xml
Size: 10,562 bytes
Last Modified: 2025-10-06 14:14:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0308"> <Title>On aligning trees</Title> <Section position="4" start_page="75" end_page="76" type="metho"> <SectionTitle> 2 The Task </SectionTitle> <Paragraph position="0"> In this section, we provide a general characterization of agreement in analysis between two corpora.</Paragraph> <Paragraph position="1"> We assume the existence of two corpora, C l and C rl. The contents of each corpus is a sequence of elements drawn from a collection of terminal elements, markers for the left and right structural delimiters (LSD and RSD, respectively) and possibly other markup irrelevant to the content of the text or its structural analysis. Occurrences of structural delimiters are taken to be properly nested. We assume only that the terminal elements of some corpus can 1for left and right.</Paragraph> <Paragraph position="2"> be determined, and not that the definition of terminal element correspond to some notion of, say, word. A consequence of this is that markers in a corpus for empty elements may be retained, and operated on, even if such markers are additional to the original text, and represent part of a hypothesis as to the text's linguistic organization.</Paragraph> <Paragraph position="3"> The following sequences can then be computed from each corpus: W (z,*} the terminal elements S {t,*} the terminal elements and structural delimiters So S is the corpus retaining structural annotation, and W is a &quot;text only&quot; version of the corpus. As each of these is a sequence, we can pick out elements of each by an index, that is W~ will pick out the nth terminal element of the left corpus.</Paragraph> <Paragraph position="4"> The following definitions allow us to refer to structural units (subtrees) within the two corpora. (We omit the superscript indicating which corpus we are dealing with.) Numbering subtrees We number the subtrees in each corpus as follows. If Si is the ith occurrence of LSD in S and Sj is the matching RSD of Si, then the extent of subtree (i) of S is the sequence Si... Sj. The terminal yield of a subtree is then its extent less any occurrences of LSD and RSD. This can be conveniently represented as the stretch of terminal elements included within a pair of structural delimiters, i.e.</Paragraph> <Paragraph position="6"> where Wk is the first element in the extent of t and Wt the last. We'll refer to a subtree's number as its index. Let Subtrees(C) be the set of yields in C.</Paragraph> <Paragraph position="7"> Two corollaries The following result will be useful later on: for two subtrees from a corpus, if t < t ~ then either t t is a subtree of t or there is no dominance relation between t and t t.</Paragraph> <Paragraph position="8"> Likewise, we claim that, if a subtree is greater than unary branching, then it is uniquely identified by its yield. To see this, suppose that there are two distinct subtrees, t, t r such that yield(t) = yield(t t) or = (i,j). Then, no terminal element intervenes between Wi and t's LSD, or between Wj and t's RSD, and the same condition holds of t ~. It must therefore follow that t is a subtree of t ~ or vice versa and that they are connected by a series of only unary branching trees.</Paragraph> <Paragraph position="9"> Alignment of terminal elements We want to compute the minimal set of differences between W l and W r, i.e. a monotone, bijective partial function defined as follows: 2 Let 5 be the largest subset of i x j for 0 < i < length(W l) and 0 < j < length(W r) such that 5 is monotone and bijective, and</Paragraph> <Paragraph position="11"> In other words, 6 records exact matches between the left and right corpora, or mismatches involving only a single element, with exact matches to either side.</Paragraph> <Paragraph position="12"> This allows minor editorial differences and choice of markup for terminal elements to have no effect in overall alignment.</Paragraph> <Paragraph position="13"> Aligned subtrees We now offer the following definition. Two trees in C t and C r are aligned, if they share the same yield (under the image of ~), i.e.:</Paragraph> <Paragraph position="15"> Two subtrees are strictly aligned if the above conditions hold and neither tree is a unary branch.</Paragraph> <Paragraph position="16"> (This definition will be extended shortly.) We saw above that, if a tree is not unary branching then its yield is unique.</Paragraph> <Paragraph position="17"> Unary branching In the case of unary branching, the inverse of yield will not be a function. In other words, two subtrees have the same yield. The situation is straightforward if both corpora share the same number of unary trees for some yield: we can pair off subtrees in increasing order of index. (Recall that, under dominance, a higher subtree index indicates domination by a lower index.) In this case we will say that the unary trees in question are also strictly aligned.</Paragraph> <Paragraph position="18"> If the two corpora differ on the number of unary branches relating two nodes, there is no principled way of pairing off nodes, without exploiting more detailed, and probably corpus- or markup-specific information about the contents of the corpora.</Paragraph> <Paragraph position="19"> Linking to original corpus For each of the corpora we assume we can define two functions, one terminal location will give the location in the original corpus of a terminal element (e.g. a function function will be unique (although perhaps empty).</Paragraph> <Paragraph position="20"> from terminal indices to, say, byte offsets in a file), and the other tree location will give the location in the original corpus of a subtree (in terms, say, of byte offsets of the left and right delimiters). Tree locations will therefore include any additional information within the corpus stored between the left and right delimiters.</Paragraph> <Paragraph position="21"> Output of the procedure The following information may be output from this procedure in the form of tables * of subtree indices indicating strict alignment of two trees * a table of pairs of sequences of subtree indices indicating potential alignment * of pairs of terminal element indices, (i.e. the function 5) and * of single terminal element mismatches, for later processing to detect consistent differences in markup.</Paragraph> <Paragraph position="22"> * of the results of applying the functions terminal location and tree location to the relevant information above.</Paragraph> <Paragraph position="23"> This output can be thought of as a form of &quot;stand off&quot; annotation, from which other forms of information about the corpora can be derived.</Paragraph> </Section> <Section position="5" start_page="76" end_page="77" type="metho"> <SectionTitle> 3 A portable implementation </SectionTitle> <Paragraph position="0"> In this section we describe the implementation of the above procedure which abstracts away from details of the markup used in any particular corpus. The overall shape of the implementation is shown in Figure 1. The program described here is implemented in Perl.</Paragraph> <Paragraph position="1"> Normalization We can abstract away from details of the markup used in a particular corpus by providing the following externally defined functions. annotation removal and transformation As our procedure works only in terms of terminal elements and structural annotation, all other information may be removed from a corpus before processing. We also take this opportunity to transform the LSD and RSD used in the corpus into tokens used by the core processor (that is, { and } respectively). We may also choose at this point to normalize other aspects of markup known to consistently differ between the two corpora.</Paragraph> <Paragraph position="2"> terminal and tree locations Similarly, separate programs may be invoked to provide tables of byte offsets of terminals and start- and end-points of trees.</Paragraph> <Paragraph position="3"> With these functions in place, we proceed to the description of the core algorithm.</Paragraph> <Paragraph position="4"> Computing minimal differences We use the program diff and interpret its output to compute the function 6. Specifically we use the Free Software Foundations gdiff with the options --minimal, --ignore-case and --ignore-all-space, to guarantee optimal matches of terminals, and allowing editorial decisions that result in differences in capitalization. null Subtree indexing and alignment detection We use the following for representation of subtrees and the time-efficient detection of aligned trees. Trees in the right corpus (which we can think of as the target) are represented as elements in a hash table, whose key is computed from the terminal indices of the start and end of its yield. Each element in the hash table is a set of numbers, to allow for the hashing of multiple unary trees to the same cell in the table.</Paragraph> <Paragraph position="5"> In processing the subtrees for the left corpus, we can simply check whether there is an element in the hash table for the terminal indices of the yield of the tree in the left corpus under the image of the function 6.</Paragraph> </Section> <Section position="6" start_page="77" end_page="78" type="metho"> <SectionTitle> 4 An example </SectionTitle> <Paragraph position="0"> IN this section we give a brief example to illustrate the operations of the algorithm. The start of the Susanne corpus is shown in the table here: the \[O\[S\[Nns:s.</Paragraph> <Paragraph position="1"> while the corresponding part of the treebank looks as follows.</Paragraph> <Paragraph position="3"> The process of numbering the terminal elements and computing the set of minimal differences will give rise to a normalized form of the two corpora something like the following, where the two leftmost columns come from Susanne, the others from Penn.</Paragraph> <Paragraph position="4"> (The numbers here have been altered slightly for the purposes of exposition.) Susanne word position Penn word position the 2 the 1 Fulton 3 Fulton 2 County 4 County 3 Grand 5 Grand 4 Jury 6 Jury 5 Note that the function ~ will in this case map 2 to 1, 3 to 2 and so on. Note that the whole of this sequence of words is bracketed off in both corpora. Accordingly, we will record the existence of a tree spanning 1 to 5 in the treebank. The alignment of the corresponding tree from Susanne will be detected by the noting that 5(2) = 1 and 5(6) = 5.</Paragraph> </Section> class="xml-element"></Paper>