File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0806_metho.xml

Size: 9,379 bytes

Last Modified: 2025-10-06 14:09:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0806">
  <Title>Senseval-3: The Spanish Lexical Sample Task</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Spanish Lexicon: MiniDir-2.1
</SectionTitle>
    <Paragraph position="0"> Due to the enormous effort needed for rigorously developing lexical resource and manually annotated corpora, we limited our work to the treatment of 46 words of three syntactic categories: 21 nouns, 7 adjectives, and 18 verbs. The selection was made trying to maintain the core words of the Senseval-2 Spanish task and sharing around 10 of the target words with Basque, Catalan, English, Italian, and Rumanian lexical tasks. Table 1 shows the set of selected words.</Paragraph>
    <Paragraph position="1"> We used the MiniDir-2.1 dictionary as the lexical resource for corpus tagging, which is a subset of the broader MiniDir1. MiniDir-2.1 was designed as a resource oriented to WSD tasks, i.e., with a granularity level low enough to avoid the overlapping of senses that commonly characterizes lexical sources.</Paragraph>
    <Paragraph position="2"> Regarding the words selected, the average number of senses per word is 5.33, corresponding to 4.52 senses for the nouns subgroup, 6.78 for verbs and 4 for adjectives (see table 1, right numbers in column '#senses').</Paragraph>
    <Paragraph position="3"> The content of MiniDir-2.1 has been checked and refined in order to guarantee not only its consis- null tency and coverage but also the quality of the gold standard. Each sense in Minidir-2.1 is linked to the corresponding synset numbers in EuroWordNet (Vossen, 1999) and contains syntagmatic information as collocates and examples extracted from corpora2. Regarding the dictionary entries, every sense is organized in nine lexical fields. See figure 1 for an example of one sense of the lexical entry conducir ('to drive').</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Spanish Corpus: MiniCors
</SectionTitle>
    <Paragraph position="0"> MiniCors is a semantically tagged corpus according to the Senseval lexical sample setting, labeled with the MiniDir-2.1 sense repository. The MiniCors corpus is formed by 12,625 tagged examples, covering 35,875 sentences and 1,506,233 words. The context considered for each example includes the target sentence, plus the previous and the following ones. All the examples have been extracted from the year-2000 corpus of the Spanish EFE News Agency, which includes 289,066 news (2,814,291 sentences and 95,344,946 words) spanning from January to December of 2000.</Paragraph>
    <Paragraph position="1"> For every word, a minimum of 200 examples have been manually tagged by three independent expert human annotators and disagreement cases have been resolved by another lexicographer (assigning a unique sense to each example). The annotation process has been assisted by a graphical Perl-Tk interface specifically designed for this task, and a  word.POS #senses #train / test / unlab %MFS actuar.v 3 / 4 133 / 67 / 1,500 73.13 apoyar.v 3 / 4 259 / 128 / 1,500 92.97 apuntar.v 4 / 9 213 / 106 / 1,500 50.94 arte.n 3 / 4 251 / 121 / 1,500 95.87 autoridad.n 2 / 4 268 / 132 / 1,500 96.97 bajar.v 3 / 5 235 / 115 / 1,500 84.35 banda.n 4 / 7 230 / 114 / 1,500 70.18 brillante.a 2 / 2 126 / 63 / 1,369 88.89 canal.n 4 / 6 262 / 131 / 1,500 65.65 canalizar.v 2 / 3 253 / 126 / 700 96.83 ciego.a 3 / 5 102 / 52 / 390 59.62 circuito.n 4 / 5 261 / 132 / 1,500 56.82 columna.n 7 / 8 129 / 64 / 1,258 20.31 conducir.v 4 / 5 134 / 66 / 1,094 45.45 coraz'on.n 3 / 6 123 / 62 / 1,500 45.16 corona.n 3 / 4 124 / 64 / 916 68.75 duplicar.v 2 / 2 254 / 126 / 1,500 96.03 explotar.v 5 / 5 212 / 103 / 1,500 45.63 ganar.v 3 / 8 237 / 118 / 1,500 90.68 gracia.n 3 / 5 72 / 38 / 1,209 50.00 grano.n 3 / 4 117 / 61 / 524 60.66 hermano.n 2 / 3 128 / 66 / 1,500 90.91 jugar.v 3 / 5 236 / 117 / 1,500 90.60 letra.n 5 / 5 226 / 114 / 1,251 34.21 masa.n 3 / 4 172 / 85 / 1,151 43.53 mina.n 2 / 4 134 / 66 / 1,458 51.52 natural.a 5 / 6 215 / 107 / 1,500 46.73 naturaleza.n 3 / 4 258 / 128 / 1,500 67.19 operaci'on.n 3 / 4 134 / 66 / 1,500 54.55 'organo.n 2 / 3 263 / 131 / 1,500 85.50 partido.n 2 / 2 133 / 66 / 1,500 56.06 pasaje.n 4 / 4 220 / 111 / 375 45.95 perder.v 4 / 11 218 / 106 / 1,500 60.38 popular.a 3 / 3 133 / 67 / 1,500 44.78 programa.n 3 / 3 267 / 133 / 1,500 75.19 saltar.v 8 / 15 200 / 101 / 1,117 29.70 simple.a 3 / 4 117 / 61 / 1,500 70.49 subir.v 3 / 5 231 / 114 / 1,500 74.56 tabla.n 3 / 6 130 / 64 / 1,500 76.56 tocar.v 6 / 13 158 / 78 / 1,500 28.21 tratar.v 3 / 12 143 / 72 / 1,235 43.06 usar.v 2 / 3 263 / 130 / 1,500 97.69 vencer.v 3 / 7 134 / 65 / 1,500 80.00 verde.a 2 / 5 69 / 33 / 1,500 60.61 vital.a 2 / 3 131 / 65 / 1,500 75.38 volar.v 3 / 6 122 / 60 / 705 53.33 avg/total 3.30 / 5.33 8,430 / 4,195 / 61,252 67.72  tagging handbook for the annotators. The inter-annotator complete agreement achieved was 90% for nouns, 83% for adjectives, and 83% for verbs. These are the best results obtained in a comparative study (Taul'e et al., 2004) with other dictionaries used for tagging the same corpus. The senses corresponding to multi-word expressions were eliminated since they are not considered in MiniDir-2.1. The initial goal was to obtain for each word at least 75 examples plus 15 examples per sense. For the words below these figures we performed a second round by labeling up to 200 examples more. After that, senses with less than 15 occurrences ( a0 3.5% of the examples) have been simply discarded from the datasets. See table 1, left numbers in column '#senses', for the final ambiguity rates. We know that this is a quite controversial decision that leads to a simplified setting. But we preferred to maintain the proportions of the senses naturally appearing in the EFE corpus rather than trying to artificially find examples of low frequency senses by mixing examples from many sources or by getting them with specific predefined patterns. Thus, systems trained on the MiniCors corpus are intended to discriminate between the typical word senses appearing in a news corpus.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Resources Provided to Participants
</SectionTitle>
    <Paragraph position="0"> Participants were provided with the complete Minidir-2.1 dictionary, a training set with 2/3 of the labeled examples, a test set with 1/3 of the examples and a complementary big set of unlabeled examples, limited to 1,500 for each word (when available). Each example is provided with a non null list of category-labels marked according to two annotation schemes: ANPA and IPTC3.</Paragraph>
    <Paragraph position="1"> Aiming at helping teams with few resources on the Spanish language, sentences in all datasets were tokenized, lemmatized and POS tagged, using the Spanish linguistic processors developed at TALP-CLiC4, and provided as complementary files. Table 1 contains information about the sizes of the datasets and the proportion of the most-frequent sense for each word (MFC). The baseline MFC classifier obtains a high accuracy of 67.72% due to the moderate number of senses considered.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 The Participant Systems
</SectionTitle>
    <Paragraph position="0"> Seven teams took part on the Spanish Lexical Sample task, presenting a total of nine systems. We will refer to them as: IRST, UA-NSM, UA-NP, UA-SRT, UMD, UNED, SWAT, Duluth-SLSS, and CSUSMCS. From them, seven are supervised and two unsupervised (UA-NSM, UA-NP). Only one of the participant systems uses a mixed learning strategy that allows to incorporate the knowledge from the unlabeled examples, namely UA-SRT. It is a Maximum Entropy-based system, which makes use of a re-training algorithm (inspired by Mitchell's cotraining) for iteratively relabeling unannotated examples with high precision and adding them to the training of the MaxEnt algorithm.</Paragraph>
    <Paragraph position="1">  method based on co-occurrences (UNED). Some systems used a voted combination of these basic learning algorithms to produce the final WSD system (SWAT, Duluth-SLSS). The two unsupervised algorithms apply only to nouns and target at obtaining high precision results (the annotations on adjectives and verbs come from a supervised MaxEnt system). UA-NSM method is called Specification Marks and uses the words that co-occur with the target word and their relation in the noun WordNet hierarchy. UA-NP bases the disambiguation on syntactic patterns and unsupervised corpus, relying on the &amp;quot;one sense per pattern&amp;quot; assumption.</Paragraph>
    <Paragraph position="2"> All supervised teams used the POS and lemmatization provided by the organization, except Duluth-SLSS, which only used raw lexical information. A few systems used also the category labels provided with the examples. Apparently, none of them used the extra information in MiniDir (examples, collocations, synonyms, WordNet links, etc.), nor syntactic information. Thus, we think that there is room for substantial improvement in the feature set design. It is worth mentioning that the IRST system makes use of a kernel including semantic information within the SVM framework.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML