File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0308_intro.xml
Size: 4,701 bytes
Last Modified: 2025-10-06 14:06:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0308"> <Title>On aligning trees</Title> <Section position="3" start_page="0" end_page="75" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We present here a general design for, and modular implementation of, an algorithm for computing areas of agreement between structurally annotated corpora. Roughly speaking, if two corpora bracket off the same stretches of words in their structural analysis of a text, the corpora agree that that stretch of text should be considered a single unit at some level of structure. We will (borrowing a usage from (Church and Gale, 1993) term this agreement (sub)tree alignment.</Paragraph> <Paragraph position="1"> We make the following assumptions, which appear reasonable for markup schemes with which we are familiar: * the &quot;content&quot; of each text consists of a sequence of &quot;terminal&quot; elements. That is, the content is a collection of elements generally corresponding to words and punctuation and this will be roughly constant across the two corpora. It may also contain additional elements to represent, for example, the positing of orthographically null categories.</Paragraph> <Paragraph position="2"> * the two corpora whose trees are to be aligned contain identifiable structural markup. That is, structural &quot;delimiters&quot; are distinct from other forms of markup and content.</Paragraph> <Paragraph position="3"> * two corpora agree on an analysis when they bracket off the same content.</Paragraph> <Paragraph position="4"> * The corpora may contain additional markup provided this is distinct from content and structural markup.</Paragraph> <Paragraph position="5"> Our goal, then, is to determine those stretches of a text's content which two corpora agree on. Why might we want to do this? There are several reasons: * increase confidence in markup and determine areas of disagreement If two or more corpora agree on parts of an analysis, one may &quot;trust&quot; that choice of grouping more than those groupings on which the corpora differ. Alignment can be used to detect disagreements between manual annotators.</Paragraph> <Paragraph position="6"> * verify preservation of analyses across multiple versions of a corpus If all the subtrees of a corpus are aligned with those of another, then the second is consistent with the first, and represents analyses at least as detailed as those in the first. Such automatic checking will be useful both in the case of manual edits to a corpus, and also in the case where automatic analysis is performed.</Paragraph> <Paragraph position="7"> * import markup from one corpus to another If one corpus contains &quot;richer&quot; information than another, for example in terms of annotation of syntactic function or of lexical category, the markup from the first may be interpreted with respect to analyses in the second.</Paragraph> <Paragraph position="8"> * determine constant markup transformations Having identified aligned subtrees, the labels of a pair of trees may be recorded, and the results for the pair of corpora analysed to determine consistent differences in markup.</Paragraph> <Paragraph position="9"> * determine constant tree transformations A set of pairings between aligned subtrees can be used as a bootstrap for semi-automatic markup of corpora.</Paragraph> <Paragraph position="10"> We can also identify some specific motivations and applications. First, in the automatic determination of subcategorization information, confidence in the choice of subcategorization may be improved by analyses which confirm that subcategorization from other corpora. Second, the algorithm we have developed is robust in the face of minor editorial differences, choice of markup for punctuation, and overall presentation of the corpora. We have processed the Susanne corpus (Sampson, 1995) and Penn treebank (Marcus et al, 1993) to provide tables of word and subtree alignments. Third, on the basis of the computed alignments between the two corpora, and the tree transformations they imply, the possibility is now open to produce, semi-antomatically, versions of those parts of the Brown corpus covered by the Penn treebank but not by Susanne, in a Susannelike format. Finally, in the development of phrasal parsers, our results can be used to obtain a measure of how contentious the analysis of different phrase types is.</Paragraph> <Paragraph position="11"> Obviously, the utility of algorithms such as the one we present here is dependent on the quality and reliability of markup in the corpora we process.</Paragraph> </Section> class="xml-element"></Paper>