File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0805_metho.xml
Size: 13,512 bytes
Last Modified: 2025-10-06 14:09:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0805"> <Title>The Italian Lexical Sample Task at SENSEVAL-3</Title> <Section position="3" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 Manual Annotation </SectionTitle> <Paragraph position="0"> A collection of manually labeled instances was built for three main reasons: 1. automatic evaluation (using the Scorer2 program) required a Gold Standard list of senses provided by human annotators; 2. supervised WSD systems need a labeled set of training data, that in our case was twice larger than the test set; 3. manual semantic annotation is a time null consuming activity, but SENSEVAL represents the framework to build reusable benchmark resources. Besides, manual sense tagging entails the revision of the sense inventory, whose granularity does not always satisfy annotators. null</Paragraph> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> 2.1 Corpus and Words Choice </SectionTitle> <Paragraph position="0"> The document collection from which the annotators selected the text snippets containing the lemmata to disambiguate was the macro-balanced section of the Meaning Italian Corpus (Bentivogli et al., 2003). This corpus is an open domain collection of newspaper articles that contains about 90 million tokens covering a time-spam of 4 years (1998-2001). The corpus was indexed in order to browse it with the Toolbox for Lexicographers (Giuliano, 2002), a concordancer that enables taggers to highlight the occurrences of a token within a context.</Paragraph> <Paragraph position="1"> Two taggers chose 45 lexical entries (25 nouns, 10 adjectives and 10 verbs) according to their polysemy in the sense inventory, their polysemy in the corpus and their frequency (Edmonds, 2000).</Paragraph> <Paragraph position="2"> The words that had already been used at SENSEVAL-2 were avoided. Ten words were shared with the Spanish, Catalan and Basque lexical sample tasks.</Paragraph> <Paragraph position="3"> Annotators were provided with a formula that indicated the number of labeled instances for each No. of labeled instances for each lemma = 75 + (15*no. of attested senses) + (7* no. of attested multiwords), where 75 is a fixed number of examples distributed over all the attested senses.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Association for Computational Linguistics </SectionTitle> <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems siderably frequent and polysemous before starting to tag and save the instances.</Paragraph> <Paragraph position="1"> As a result, average polysemy attested in the labeled data turned out to be quite high: six senses for the nouns, six for the adjectives and seven for the verbs.</Paragraph> </Section> <Section position="3" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 2.2 Sense Inventory and Manual Tagging </SectionTitle> <Paragraph position="0"> Differently from the Italian lexical sample task at SENSEVAL-2, where the instances were tagged according to ItalWordNet (Calzolari et al., 2002), this year annotators used the Italian MultiWord-Net, (hereafter MWN) developed at ITC-Irst (Pianta, 2002). This lexical-semantic database includes about 42,000 lemmata and 60,000 word senses, corresponding to 34,000 synsets. Instead of distributing to participants the senses of each lemma and a limited hierarchical data structure of the semantic relations of the senses (as happened at SENSEVAL-2), the entire resource was made available. Nevertheless, none of the six participating systems, being supervised, actually needed MWN.</Paragraph> <Paragraph position="1"> The annotators' task was to tag one occurrence of each selected word in all the saved instances, assigning only one sense drawn from the Italian MWN. The Toolbox for Lexicographers enabled annotators to browse the document collection and to save the relevant text snippets, while a graphical interface was used to annotate the occurrences, storing them in a database. Generally, instances consisted of the sentence containing the ambiguous lemma, with a preceding and a following sentence. Nevertheless, annotators tended to save the minimal piece of information that a human would need to disambiguate the lemma, which was often shorter than three sentences.</Paragraph> <Paragraph position="2"> The two annotators were involved simultaneously: firstly, each of them saved a part of the instances and tagged the occurrences, secondly they tagged the examples that had been chosen by the other one.</Paragraph> <Paragraph position="3"> More importantly, they interacted with a lexicographer, who reviewed the sense inventory whenever they encountered difficulties. Sometimes there was an overlap between two or more word senses, while in other cases MWN needed to be enriched, adding new synsets, relations or defini- null This tool was designed and developed by Christian Girardi at ITC-Irst, Trento, Italy.</Paragraph> <Paragraph position="4"> tions. All the 45 lexical entries we considered were thoroughly reviewed, so that word senses were as clear as possible to the annotators. On the one hand, the revision of MWN made manual tagging easier, while on the other hand it led to a high Inter Tagger Agreement (that ranged between 73 and 99 per cent), consequently reflected in the K statistics (that ranged between 0.68 and 0.99).</Paragraph> <Paragraph position="5"> Table 1 below summarizes the results of the manual tagging.</Paragraph> <Paragraph position="6"> Once the instances had been collected and tagged by both the annotators, we asked them to discuss the examples about which they disagreed and to find a definitive meaning for them.</Paragraph> <Paragraph position="7"> Since the annotators built the corpus while tagging, they tended to choose occurrences whose meaning was immediately straightforward, avoiding problematic cases. As a consequence, the ITA turned out to be so high and the distribution of the senses in the labeled data set did not reflect the actual frequency in the Italian language, which may have affected the systems' performance.</Paragraph> <Paragraph position="8"> Annotators assigned different senses to 674 instances over a total of 7584 labeled examples. Generally, disagreement depended on trivial mistakes, and in most cases one of the two assigned meanings was chosen as the final one. Nevertheless, in 46 cases the third and last annotation was different from the previous two, which could demonstrate that a few word senses were not completely straightforward even after the revision of the sense inventory.</Paragraph> <Paragraph position="9"> For example, the following instance for the lemma &quot;vertice&quot; (vertex, acme, peak) was annotated in three different ways: La struttura lavorativa - spiega Grandi - ha un carattere paramilitare. Al vertice della piramide c'e il direttore, poi i manager, quelli con la cravatta e la camicia a mezze maniche.</Paragraph> <Paragraph position="10"> Annotator 1 tagged with sense 2 (Factotum, &quot;the highest point of something&quot;), while annotator tersection of lines or the point opposite the base of a figure&quot;) because the text refers to the vertex of a pyramid. Actually, the snippet reported this abstract image to describe the structure of an enterprise, so in the end the two taggers opted for sense 5 (Administration, &quot;the group of the executives of a corporation&quot;). Therefore, subjectivity in manual tagging was considerably reduced by adjusting the sense repository and selecting manually each single instance, but it could not be eliminated.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Automatic Annotation </SectionTitle> <Paragraph position="0"> We provided participants with three data sets: labeled training data (twice larger than the test set), unlabeled training data (about 10 times the labeled instances) and test data. In order to facilitate participation, we PoS-tagged the labeled data sets using an Italian version of the TnT PoS-tagger (Brants, 2000), trained on the Elsnet corpus.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Participants' results </SectionTitle> <Paragraph position="0"> Three groups participated in the Italian lexical sample task, testing six systems: two developed by The baseline results were obtained running a simple algorithm that assigned to the instances of the test set the most frequent sense of each lemma in the training set. All the systems outperformed the baseline and obtained similar results. Compared to the baseline of the other Lexical Sample tasks, ours is much lower because we interpreted the formula described above (see footnote 1), and tagged the same number of instances for all the senses of each lemma disregarding their frequency in the document collection. As a result, the distribution of the examples over the attested senses did not reflect the one in natural language, which may have affected the systems' performance.</Paragraph> <Paragraph position="1"> While at SENSEVAL-2 test set senses were clustered in order to compute mixed- and coarse-grained scores, this year we decided to return just the fine-grained measure, where an automatically tagged instance is correct only if the sense corresponds to the one assigned by humans, and wrong otherwise (i.e. one-to-one mapping).</Paragraph> <Paragraph position="2"> There are different sense clustering methods, but grouping meanings according to some sort of similarity is always an arbitrary decision. We intended to calculate a domain-based coarse-grained score, where word senses were clustered according to the domain information provided in WordNet Domains (Magnini and Cavaglia, 2000). Unfortunately, this approach would have been significant with nouns, but not with adjectives and verbs, that belong mostly to the generic Factotum domain, so we discarded the idea.</Paragraph> <Paragraph position="3"> All the six participating systems were supervised, which means they all used the training data set and no one utilized either unlabelled instances or the lexical database. UNED used also SemCor as an additional source of training examples.</Paragraph> <Paragraph position="4"> IRST-Kernels system exploited Kernel methods for pattern abstraction and combination of different knowledge sources, in particular paradigmatic and syntagmatic information, and achieved the best F-measure score.</Paragraph> <Paragraph position="5"> IRST-Ties, a generalized pattern abstraction system originally developed for Information Extraction tasks and mainly based on the boosted wrapper induction algorithm, used only lemma and POS as features. Proposed as a &quot;baseline&quot; system to discover syntagmatic patterns, it obtained a quite low recall (about 55 per cent), which affected the F-measure, but proved to be the most precise system. null Swarthmore College wrote three supervised classifiers: a clustering system based on cosine similarity, a decision list system and a naive bayes classifier. Besides, Swarthmore group took advantage of two systems developed at the Hong Kong Polytechnic University: a maximum entropy classifier and system which used boosting (Italianswat_hk-bo). The run swat-hk-italian joined all the five classifiers according to a simple majority-vote scheme, while swat-hk-italian did the same using only the three classifiers developed in Swarthmore.</Paragraph> <Paragraph position="6"> The system presented by the UNED group employed similarity as a learning paradigm, considering the co-occurrence of different nouns and adjectives.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 General Remarks on Task Complexity </SectionTitle> <Paragraph position="0"> As we mentioned above, the 45 words for the Italian lexical sample task were chosen according to their polysemy and frequency. We addressed difficult words, that had at least 5 senses in MWN.</Paragraph> <Paragraph position="1"> Actually, polysemy does not seem to be directly related to systems' results (Calzolari, 2002), in fact the average F-measure of our six runs for the nouns (0.512) was higher than for adjectives (0.472) and verbs (0.448), although the former had more attested senses in the labeled data.</Paragraph> <Paragraph position="2"> Complexity in returning the correct sense seems to depend on the blurred distinction between similar meanings rather than on the number of senses themselves. If we consider the nouns &quot;attacco&quot; (attack) and &quot;esecuzione&quot; (performance, execution), for which the systems obtained the worst and one of the best average results respectively, we notice that the 4 attested senses of &quot;esecuzione&quot; were clearly distinguished and referred to different domains (Factotum, Art, Law and Politics), while the 6 attested senses of &quot;attacco&quot; were more subtly defined. Senses 2, 7 and 11 were very difficult to discriminate and often appeared in metaphorical contexts. Senses 5 and 6, for their part, belong to the Sport domain and are not always easy to distinguish. null</Paragraph> </Section> </Section> class="xml-element"></Paper>