File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0906_evalu.xml

Size: 7,637 bytes

Last Modified: 2025-10-06 13:58:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0906">
  <Title>Learning Argument/Adjunct Distinction for Basque</Title>
  <Section position="5" start_page="1" end_page="3" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> We found in the literature two main approaches to evaluate a system like the one proposed in this paper (T. Briscoe &amp; J. Carroll 1997, A. Sarkar &amp; D. Zeman 2000, A. Korhonen 2001):  There are two ways of interpreting Fisher's test, as one or two sided test. In the one sided fashion there is still another interpretation, as a right or left sided test. * Comparing the obtained information with a gold standard.</Paragraph>
    <Paragraph position="1"> * Calculating the coverage of the obtained information on a corpus. This can give an estimate of how well the information obtained could help a parser on that corpus.</Paragraph>
    <Paragraph position="2"> Under the former approach a further distinction emerges: using a dictionary as a gold standard, or performing manual evaluation, where some linguists extract the subcategorization frames appearing in a corpus and comparing them with the set of subcategorization frames obtained automatically.</Paragraph>
    <Paragraph position="3"> We decided to evaluate the system both ways, that is to say, using a gold standard and calculating the coverage over a corpus. The intention was to determine, all things being equal, the impact of doing it one way or the other.</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Evaluation 1: comparison of the results with a
</SectionTitle>
      <Paragraph position="0"> gold standard From the 640 analyzed verbs, we selected 10 for evaluation. For each of these verbs we extracted from the corpus the list of all their dependents. The list was a set of bare verb-case pairs, that is, no context was involved and, therefore, as the sense of the given verb could not be derived, different senses of the verb were taken into account. We provided 4 human annotators/taggers with this list and they marked each dependent as either argument or adjunct. The taggers accomplished the task three times. Once, with the simple guideline of the implicational test and obligatoriness test, but with no further consensus. The inter-tagger agreement was low (57%). The taggers gathered and realized that the problem came mostly from semantics. While some taggers tagged the verb-case pairs assuming a concrete semantic domain the others took into account a wider rage of senses (moreover, in some cases the senses did not even match). So the tagging was repeated when all of them considered the same semantics to the different verbs. The inter-tagger agreement raised up to a 80%. The taggers gathered again to discuss, deciding over the non clear pairs.</Paragraph>
      <Paragraph position="1"> The list obtained from merging  the 4 lists in one is taken to be our gold standard. Notice that  Merging was possible once the annotators agreed on the marking of each element.</Paragraph>
      <Paragraph position="2"> when the annotators decided whether a possible argument was really an argument or not, no context was involved. In other words, they were deciding over bare pairs of verbs and cases.</Paragraph>
      <Paragraph position="3"> Therefore different senses of the verb were considered because there was no way to disambiguate the specific meaning of the verb. So the evaluation is an approximation of how well would the system perform over any corpus. Table 4 shows the results in terms of Precision and Recall.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Evaluation 2: Calculation of the coverage on a
corpus
</SectionTitle>
      <Paragraph position="0"> The initial corpus was divided in two parts, one for training the system and another one for evaluating it. From the fraction reserved for evaluation we extracted 200 sentences corresponding to the same 10 verbs used in the &amp;quot;gold standard&amp;quot; based evaluation. In this case, the task carried out by the annotators consisted in extracting, for each of the 200 sentences, the elements (arguments/adjuncts) linked to the corresponding verb. Each element was marked as argument or adjunct. Note that in this case the annotation takes place inside the context of the sentence. In other words, the verb shows precise semantics.</Paragraph>
      <Paragraph position="1"> We performed a simple evaluation on the sentences (see table 5), calculating precision and recall over each argument marked by the annotators  . For example, if a verb appeared in a sentence with two arguments and the statistical filters were recognizing them as arguments, both precision and recall would be 100%. If, on the contrary, only one was found, then precision would be 100%, and recall 50%.</Paragraph>
      <Paragraph position="2">  The inter-tagger agreement in this case was of 97%.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Discussion
</SectionTitle>
      <Paragraph position="0"> It is obvious that the results attained in the first evaluation are different than those in the second one. The origin of this difference comes mostly, on one hand, from semantics and, on the other hand, from the nature of statistics: * Semantic source. The former evaluation was not contextualized, while the latter used the sentence context. Our experience showed us that broader semantics (non-contextualized evaluation) leads to a situation where the number of arguments increases with respect to narrower (contextualized evaluation) semantics. This happens because in many cases different senses of the same verb require different arguments. So when the meaning of the verb is not specified, different meanings have to be taken into account and, therefore, the task becomes more difficult.</Paragraph>
      <Paragraph position="1"> * Statistical reason. The disagreement in the results comes from the nature of the statistics themselves. Any statistical measure performs better on the most frequent cases than on the less frequent ones. In the first experiment all possible arguments are evaluated, including the less frequent ones, whereas in the second experiment only the possible arguments found in the piece of corpus used were evaluated. In most of the cases, the possible arguments found were the most frequent ones.</Paragraph>
      <Paragraph position="2"> At this point it is important to note that the system deals with non-structural cases. In Basque there are three structural cases (ergative, absolutive and dative) which are special because, when they appear, they are always arguments. They correspond to the subject, direct object and indirect object functions. These cases are not very conflictive about argumenthood, mainly because in Basque the auxiliary bears information about their appearance in the sentence. So they are easily recognized and linked to the corresponding verb.</Paragraph>
      <Paragraph position="3"> That is the reason for not including them in this work. Precision and recall would improve considerably if they were included because they are the most frequent cases (as statistics perform well over frequent data), and also because the shallow parser links them correctly using the information carried by the auxiliary. Notice that we did not incorporate them because in the future we would like to use the subcategorization information obtained for helping our parser, and the non-structural cases are the most problematic ones.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML