File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1053_intro.xml

Size: 4,103 bytes

Last Modified: 2025-10-06 14:02:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1053">
  <Title>Evaluating Cross-Language Annotation Transfer in the MultiSemCor Corpus</Title>
  <Section position="3" start_page="0" end_page="1" type="intro">
    <SectionTitle>
2 The Annotation Transfer Methodology
</SectionTitle>
    <Paragraph position="0"> The MultiSemCor project (Bentivogli and Pianta, 2002) aims at building an English/Italian parallel corpus, aligned at the word level and annotated with PoS, lemma and word sense. The parallel corpus is created by exploiting the SemCor corpus (Landes et al., 1998), which is a subset of the English Brown corpus containing about 700,000 running words. In SemCor all the words are tagged by PoS, and more than 200,000 content words are also lemmatized and sense-tagged with reference to the WordNet lexical database  (Fellbaum, 1998).</Paragraph>
    <Paragraph position="1"> The main hypothesis underlying this methodology is that, given a text and its translation into another language, the semantic information is mostly preserved during the translation process. Therefore, if the texts in one language have been semantically annotated and their translations have not, annotations can be transferred from the source language to the target using word alignment as a bridge.</Paragraph>
    <Paragraph position="2"> The first problem to be solved in the creation of MultiSemCor was the fact that the Italian translations of the SemCor texts did not exist. Our solution was to have the translations made by professional translators. Given the high costs of building semantically annotated corpora, requiring specific skills and very specialized training, we think that manually translating the annotated corpus and automatically transferring the annotations may be preferable to hand-labelling a corpus from scratch. Not only are translators more easily available than linguistic annotators, but translations may be a more flexible and durable kind of annotation. Moreover, the annotation transfer methodology has the further advantage of producing a parallel corpus.</Paragraph>
    <Paragraph position="3"> With respect to a situation in which the translation of a corpus is already available, a corpus translated on purpose presents the advantage that translations can be &amp;quot;controlled&amp;quot;, i.e. carried out following criteria aiming at maximizing alignment and annotation transfer. Our professional translators are asked to use, preferably, the same dictionaries used by the word aligner, and to maximize, whenever possible, the lexical correspondences between source and target texts. The translators are also told that the controlled translation criteria should never be followed to the detriment of a good Italian prose. Controlled translations cost the same as free translations, while having the advantage of  WordNet is an English lexical database, developed at Princeton University, in which nouns, verbs, adjectives, and adverbs are organized into sets of synonyms (synsets) and linked to each other by means of various lexical and semantic relationships. In the last years, within the NLP community WordNet has become the reference lexicon for almost all tasks involving word sense disambiguation (see, for instance, the Senseval competition).</Paragraph>
    <Paragraph position="4"> enhancing the performances of the annotation transfer procedure.</Paragraph>
    <Paragraph position="5"> Once the SemCor texts have been translated, the strategy for creating MultiSemCor consists of (i) automatically aligning Italian and English texts at the word level, and (ii) automatically transferring the word sense annotations from English to the aligned Italian words. The final result of the MultiSemCor project is an Italian corpus annotated with PoS, lemma and word sense, but also an aligned parallel corpus lexically annotated with a shared inventory of word senses. More specifically, the sense inventory used is MultiWordNet (Pianta et al., 2002), a multilingual lexical database in which the Italian component is strictly aligned with the English WordNet.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML