XML Viewer - w05-1201

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1201_metho.xml
Size: 9,014 bytes
Last Modified: 2025-10-06 14:10:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1201">
  <Title>Classification of semantic relations by humans and machines [?] Erwin Marsi and Emiel Krahmer Communication and Cognition</Title>
  <Section position="5" start_page="1" end_page="3" type="metho">
    <SectionTitle>
3 Alignment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Interannotator agreement
</SectionTitle>
      <Paragraph position="0"> Interannotator agreement was calculated in terms of precision, recall and F-score (with b = 1) on aligned  to alignment between annotators 1 and 2 before (A1,A2) and after (A1prime,A2prime) revision , and between the consensus and annotator 1 (Ac,A1prime) and annotator 2 (Ac,A2prime) respectively.</Paragraph>
      <Paragraph position="1"> node pairs as follows:</Paragraph>
      <Paragraph position="3"> where Areal is the set of all real alignments (the reference or golden standard), Apred is the set of all predicted alignments, and Apred[?]Areal is the set all correctly predicted alignments. For the purpose of calculating interannotator agreement, one of the annotations (A1) was considered the 'real' alignment, the other (A2) the 'predicted'. The results are summarized in Table 1 in column (A1,A2).1 As explained in section 2.3, both annotators revised their initial annotations. This improved their agreement, as shown in column (A1prime,A2prime). In addition, they agreed on a single consensus annotation (Ac). The last two columns of Table 1 show the results of evaluating each of the revised annotations against this consensus annotation. The F-score of .98 can therefore be regarded as the upper bound on the alignment task.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Automatic alignment
</SectionTitle>
      <Paragraph position="0"> Our tree alignment algorithm is based on the dynamic programming algorithm in (Meyers et al., 1996), and similar to that used in (Barzilay, 2003).</Paragraph>
      <Paragraph position="1"> It calculates the match between each node in dependency tree D against each node in dependency tree Dprime. The score for each pair of nodes only depends on the similarity of the words associated with the nodes and, recursively, on the scores of the best 1Note that since there are no classes, we can not calculate change agreement rethe Kappa statistic.</Paragraph>
      <Paragraph position="2"> matching pairs of their descendants. The node similarity function relies either on identity of the lemmas or on synonym, hyperonym, and hyponym relations between them, as retrieved from EuroWordNet.</Paragraph>
      <Paragraph position="3"> Automatic alignment was evaluated with the consensus alignment of the first chapter as the gold standard. A baseline was constructed by aligning those nodes which stand in an equals relation to each other, i.e., a node v in D is aligned to a node vprime in Dprime iff STR(v) =STR(vprime). This baseline already achieves a relatively high score (an F-score of .56), which may be attributed to the nature of our material: the translated sentence pairs are relatively close to each other and may show a sizeable amount of literal string overlap. In order to test the contribution of synonym and hyperonym information for node matching, performance is measured with and without the use of EuroWordNet. The results for automatic alignment are shown in Table 2. In comparison with the baseline, the alignment algorithm without use of EuroWordnet loses a few points on precision, but improves a lot on recall (a 200% increase), which in turn leads to a substantial improvement on the overall F-score. The use of EurWordNet leads to a small increase (two points) on both precision and recall, and thus to small increase in F-score. However, in comparison with the gold standard human score for this task (.95), there is clearly room for further improvement.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="5" type="metho">
    <SectionTitle>
4 Classification of semantic relations
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Interannotator agreement
</SectionTitle>
      <Paragraph position="0"> In addition to alignment, the annotation procedure for the first chapter of The little prince by two annotators (cf. section 2.3) also involved labeling of the semantic relation between aligned nodes. Interannotator agreement on this task is shown Table 3, before and after revision. The measures are weighted precision, recall and F-score. For instance, the precision is the weighted sum of the separate precision scores for each of the five relations. The table also shows the k-score. The F-score of .97 can be regarded as the upper bound on the relation labeling task. We think these numbers indicate that the classification of semantic relations is a well defined task which can be accomplished with a high level of interannotator agreement.</Paragraph>
      <Paragraph position="1">  mantic relation labeling between annotators 1 and 2 before (A1,A2) and after (A1prime,A2prime) revision , and between the consensus and annotator 1 (Ac,A1prime) and annotator 2 (Ac,A2prime) respectively.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
4.2 Automatic classification
</SectionTitle>
      <Paragraph position="0"> For the purpose of automatic semantic relation labeling, we approach the task as a classification problem to be solved by machine learning. Alignments between node pairs are classified on the basis of the lexical-semantic relation between the nodes, their corresponding strings, and - recursively - on previous decisions about the semantic relations of daughter nodes. The input features used are: * a boolean feature representing string identity between the strings corresponding to the nodes * a boolean feature for each of the five semantic relations indicating whether the relation holds for at least one of the daughter nodes; * a boolean feature indicating whether at least one of the daughter nodes is not aligned; * a categorical feature representing the lexical semantic relation between the nodes (i.e. the lemmas and their part-of-speech) as found in EuroWordNet, which can be synonym, hyperonym, or hyponym.2 To allow for the use of previous decisions, the nodes of the dependency analyses are traversed in a bottom-up fashion. Whenever a node is aligned, the classifier assigns a semantic label to the alignment. Taking previous decisions into account may  SD) over all 5 folds on automatic classification of semantic relations cause a proliferation of errors: wrong classification of daughter nodes may in turn cause wrong classification of the mother node. To investigate this risk, classification experiments were run both with and without (i.e. using the annotation) previous decisions. null Since our amount of data is limited, we used a memory-based classifier, which - in contrast to most other machine learning algorithms - performs no abstraction, allowing it to deal with productive but low-frequency exceptions typically occurring in NLP tasks(Daelemans et al., 1999). All memory-based learning was performed with TiMBL, version 5.1 (Daelemans et al., 2004), with its default settings (overlap distance function, gain-ratio feature weighting, k = 1).</Paragraph>
      <Paragraph position="1"> The five first chapters of The little prince were used to run a 5-fold cross-validated classification experiment. The first chapter is the consensus alignment and relation labeling, while the other four were done by one out of two annotators. The alignments to be classified are those from to the human alignment. The baseline of always guessing equals - the majority class - gives a precision of 0.26, a recall of 0.51, and an F-score of 0.36. Table 4 presents the results broken down to relation type. The combined F-score of 0.64 is almost twice the baseline score. As expected, the highest score goes to equals, followed by a reasonable score on restates. Performance on the other relation types is rather poor, with even no predictions of specifies and intersects at all.</Paragraph>
      <Paragraph position="2"> Faking perfect previous decisions by using the annotation gives a considerable improvement, as shown in Table 5, especially on specifies, generalizes and intersects. This reveals that the proliferation of classification errors is indeed a problem that should be addressed.</Paragraph>
      <Paragraph position="3">  SD) over all 5 folds on automatic classification of semantic relations without using previous decisions. In sum, these results show that automatic classification of semantic relations is feasible and promising - especially when the proliferation of classification errors can be prevented - but still not nearly as good as human performance.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML