File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/w97-0308_concl.xml

Size: 2,596 bytes

Last Modified: 2025-10-06 13:57:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0308">
  <Title>On aligning trees</Title>
  <Section position="8" start_page="78" end_page="79" type="concl">
    <SectionTitle>
6 Conclusions and Limitations
</SectionTitle>
    <Paragraph position="0"> We have seen above a formal characterization and implementation of an algorithm for determining the extent of agreement between two corpora. The core algorithm itself and output formats are completely independent of the markup used for the different corpora. The alignments computed for the Susanne corpus and corresponding portion of the Penn treebank have been presented and discussed.</Paragraph>
    <Paragraph position="1"> Having computed the alignment of trees across corpora, one option is to compute (either explicitly or in some form of stand-off annotation) a corpus combining the information from both sources, thereby allowing the use of the distinctions made by each corpus at once.</Paragraph>
    <Paragraph position="2"> There are many future experiments of obvious interest, particularly those to do with examining potential factors in cases of agreement or disagreement: * analysis of consistency of annotation by markup label Certain phrase types may be more consistently annotated than others, so that we can be more confident in our analyses of such phrases.</Paragraph>
    <Paragraph position="3">  * analysis of consistency of annotation by depth in tree From the above discussion we can see that alignment of maximal trees approximates 100%, while that for terminals approximates 90%.</Paragraph>
    <Paragraph position="4"> Therefore (and unsurprisingly) the bulk of disagreement lies somewhere in between. Is that disagreement evenly distributed or are there factors to do with the complexity of analysis at play? These proposals have to do essentially with formal aspects of markup. Other, perhaps more interesting questions, touch on the linguistic content of analyses, and whether for example particular linguistic phenomena are associated with divergence between the corpora.</Paragraph>
    <Paragraph position="5"> The assumption that trees within corpora are strictly nested represents an obvious limitation on the scope of the algorithm. In cases where markup is more complex, other strategies will have to be developed for detecting agreement between corpora. That said, the class of markup for which the algorithm presented here is applicable is very large, including perhaps most importantly normalized forms of SGML (Goldfarb, 1990), for example that proposed by (Thompson and McKelvie, 1996).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML