File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0836_metho.xml

Size: 9,553 bytes

Last Modified: 2025-10-06 14:09:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0836">
  <Title>Senseval-3: The Catalan Lexical Sample Task</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Catalan Lexicon: MiniDir-Cat
</SectionTitle>
    <Paragraph position="0"> Catalan language participates for the first time in the Senseval evaluation exercise. Due to the time constraints we had to reduce the initial expectations on providing annotated corpora for up to 45 words to the final 27 word set treated. We preferred to reduce the number of words, while maintaining the quality in the dictionary development, corpus annotation process, and number of examples per word.</Paragraph>
    <Paragraph position="1"> These words belong to three syntactic categories: 10 nouns, 5 adjectives, and 12 verbs. The selection was made by choosing a subset of the Spanish lexical sample task and trying to share around 10 of the target words with Basque, English, Italian, and Rumanian lexical sample tasks. See table 1 for a complete list of the words.</Paragraph>
    <Paragraph position="2"> We used the MiniDir-Cat dictionary as the lexical resource for corpus tagging, which is a dictionary being developed by the CLiC research group1. MiniDir-Cat was conceived specifically as a resource oriented to WSD tasks: we have emphasized low granularity in order to avoid the overlapping of senses usually present in many lexical sources.</Paragraph>
    <Paragraph position="4"> #DEFINITION: Grup de persones que s'uneixen amb fins comuns, especialment delictius #EXAMPLE: una banda que prostitu&amp;quot;ia dones i robava cotxes de luxe; la banda ultra de l'Atl`etic de Madrid #SYNONYMS: grup; colla #COLLOCATIONS: banda armada; banda juvenil; banda de delinq&amp;quot;uents; banda mafiosa; banda militar; banda organitzada; banda paramilitar; banda terrorista; banda ultra  Regarding the polysemy of the selected words, the average number of senses per word is 5.37, corresponding to 4.30 senses for the nouns subset, 6.83 for verbs and 4 for adjectives (see table 1, right numbers in column '#senses').</Paragraph>
    <Paragraph position="5"> The content of MiniDir-2.1 has been checked and refined in order to guarantee not only its consistency and coverage but also the quality of the gold standard. Each sense in Minidir-2.1 is linked to the corresponding synset numbers in the semantic net EuroWordNet (Vossen, 1999) (zero, one, or more synsets per sense) and contains syntagmatic information as collocates and examples extracted from corpora2. Every sense is organized in the nine following lexical fields: LEMMA, POS, SENSE, DEF-INITION, EXAMPLES, SYNONYMS, ANTONYMS (only in the case of adjectives), COLLOCATIONS, and SYNSETS. See figure 1 for an example of one sense of the lexical entry banda (noun 'gang').</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Catalan Corpus: MiniCors-Cat
</SectionTitle>
    <Paragraph position="0"> MiniCors-Cat is a semantically tagged corpus according to the Senseval lexical sample setting, so one single target word per example is semantically labeled with the MiniDir-Cat sense repository. The MiniCors-Cat corpus is formed by 6,722 tagged examples, covering 45,509 sentences and  word.POS #senses #train / test / unlab %MFS actuar.v 2 / 3 197 / 99 / 2,442 80.81 apuntar.v 5 / 11 184 / 93 / 1,881 50.54 autoritat.n 2 / 2 188 / 93 / 102 87.10 baixar.v 3 / 4 189 / 92 / 1,572 59.78 banda.n 3 / 5 149 / 75 / 180 60.00 canal.n 3 / 6 188 / 95 / 551 56.84 canalitzar.v 2 / 2 196 / 99 / 0 79.80 circuit.n 4 / 4 165 / 83 / 55 46.99 conduir.v 5 / 7 198 / 101 / 764 63.37 cor.n 4 / 7 144 / 72 / 634 50.00 explotar.v 3 / 4 193 / 98 / 69 72.45 guanyar.v 2 / 6 184 / 92 / 2,106 76.09 jugar.v 4 / 4 115 / 61 / 0 57.38 lletra.n 5 / 6 166 / 86 / 538 30.23 massa.n 2 / 3 145 / 74 / 33 59.46 mina.n 2 / 4 185 / 92 / 121 90.22 natural.a 3 / 6 170 / 88 / 2,320 80.68 partit.n 2 / 2 180 / 89 / 2,233 95.51 passatge.n 2 / 4 140 / 70 / 0 55.71 perdre.v 2 / 8 157 / 78 / 2,364 91.03 popular.a 3 / 3 137 / 70 / 2,472 51.43 pujar.v 2 / 4 191 / 95 / 730 71.58 saltar.v 6 / 17 111 / 60 / 134 38.33 simple.a 2 / 3 148 / 75 / 310 85.33 tocar.v 6 / 12 161 / 78 / 789 37.18 verd.a 2 / 5 128 / 64 / 1,315 79.69 vital.a 3 / 3 160 / 81 / 220 60.49 avg/total 3.11 / 5.37 4,469 / 2,253 / 23,935 66.36  per sentence). The context considered for each example includes the paragraph in which the target word occurs, plus the previous and the following paragraphs. All the examples have been extracted from the corpus of the ACN Catalan news agency, which includes about 110,588 news (January 2000-December 2003). This corpus has been tagged with POS. Following MiniDir-2.1, those examples containing the current word in a multiword expression have been discarded.</Paragraph>
    <Paragraph position="1"> For every word, a total of 300 examples have been manually tagged by two independent expert human annotators, though some of them had to be discarded due to errors in the automatic POS tagging and multiword filtering. In the cases of disagreement a third lexicographer defined the definitive sense tags. All the annotation process has been assisted by a graphical Perl-Tk interface specifically designed for this task (in the framework of the Meaning European research project), and a tagging handbook for the annotators (Artigas et al., 2003). The inter-annotator agreement achieved was very high: 96.5% for nouns, 88.7% for adjectives, 92.1% for verbs, 93.16% overall.</Paragraph>
    <Paragraph position="2"> The initial goal was to obtain, for each word, at least 75 examples plus 15 examples per sense. However, after the labeling of the 300 examples, senses with less than 15 occurrences were simply discarded from the Catalan datasets. See table 1, left numbers in column '#senses', for the final ambiguity rates. We know that this is a quite controversial decision that leads to a simplified setting. But we preferred to maintain the proportions of the senses naturally appearing in the ACN corpus rather than trying to artificially find examples of low frequency senses by mixing examples from many sources or by getting them with specific predefined patterns. Thus, systems trained on the MiniCors-Cat corpus are only intended to discriminate between the most important word senses appearing in a general news corpus. null</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Resources Provided to Participants
</SectionTitle>
    <Paragraph position="0"> Participants were provided with the complete Minidir-Cat dictionary, a training set with 2/3 of the labeled examples, a test set with 1/3 of the examples and a complementary large set of all the available unlabeled examples in the ACN corpus (with a maximum of 2,472 extra examples for the adjective popular). Each example is provided with a non null list of category-labels marked according to the newspaper section labels (politics, sports, international, etc.)3. Aiming at helping teams with few resources on the Catalan language, all corpora were tokenized, lemmatized and POS tagged, using the Catalan linguistic processors developed at TALP-CLiC4, and provided to participants.</Paragraph>
    <Paragraph position="1"> Table 1 contains information about the sizes of the datasets and the proportion of the most-frequent sense for each word (MFC). This baseline classifier obtains a high accuracy of 66.36% due to the small number of senses considered.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 The Participant Systems
</SectionTitle>
    <Paragraph position="0"> Five teams took part on the Catalan Lexical Sample task, presenting a total of seven systems. We will refer to them as: IRST, SWAT-AB, SWAT-CP, SWAT-CA, UNED, UMD, and Duluth-CLSS. All of them are purely supervised machine learning approaches, so, unfortunately, none of them incorporates the knowledge from the unlabeled examples. Most of these systems participated also in the Spanish lexical sample task, with almost identical configurations. null Regarding the supervised learning approaches applied, we find AdaBoost, Naive Bayes, vector-based cosine similarity, and Decision Lists (SWAT systems), Decision Trees (Duluth-CLSS), Support  based on co-occurrences (UNED). Some systems used a combination of these basic learning algorithms to produce the final WSD system. For instance, Duluth-CLSS applies a bagging-based ensemble of Decision Trees. SWAT-CP performs a majority voting of Decision Lists, the cosine-based vector model and the Bayesian classifier. SWAT-CA combines, again by majority voting, the previous three classifiers with the AdaBoost based SWAT-AB system. The Duluth-CLSS system is a replica of the one presented at the Senseval-2 English lexical sample task.</Paragraph>
    <Paragraph position="1"> All teams used the POS and lemmatization provided by the organization, except Duluth-CLSS, which only used raw lexical information. A few systems used also the category labels provided with the examples. Apparently, none of them used the extra information in MiniDir (examples, collocations, synonyms, WordNet links, etc.), nor syntactic information. Thus, we think that there is room for substantial improvement in the feature set design. It is worth mentioning that the IRST system makes use of a kernel within the SVM framework, including semantic information. See IRST system description paper for more information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML