File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0802_metho.xml

Size: 11,710 bytes

Last Modified: 2025-10-06 14:09:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0802">
  <Title>The SENSEVAL-3 Multilingual English-Hindi Lexical Sample Task</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Open Mind Word Expert
</SectionTitle>
    <Paragraph position="0"> The annotated corpus required for this task was built using the Open Mind Word Expert system (Chklovski and Mihalcea, 2002), adapted for multilingual annotations 1.</Paragraph>
    <Paragraph position="1"> To overcome the current lack of tagged data and the limitations imposed by the creation of such data using trained lexicographers, the Open Mind Word  Expert system enables the collection of semantically annotated corpora over the Web. Tagged examples are collected using a Web-based application that allows contributors to annotate words with their meanings. null The tagging exercise proceeds as follows. For each target word the system extracts a set of sentences from a large textual corpus. These examples are presented to the contributors, together with all possible translations for the given target word. Users are asked to select the most appropriate translation for the target word in each sentence. The selection is made using check-boxes, which list all possible translations, plus two additional choices, &amp;quot;unclear&amp;quot; and &amp;quot;none of the above.&amp;quot; Although users are encouraged to select only one translation per word, the selection of two or more translations is also possible. The results of the classification submitted by other users are not presented to avoid artificial biases.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Sense Inventory Representation
</SectionTitle>
    <Paragraph position="0"> The sense inventory used in this task is the set of Hindi translations associated with the English words in our lexical sample. Selecting an appropriate English-Hindi dictionary was a major decision early in the task, and it raised a number of interesting issues. null We were unable to locate any machine readable or electronic versions of English-Hindi dictionaries, so it became apparent that we would need to manually enter the Hindi translations from printed materials. We briefly considered the use of Optical Character Recognition (OCR), but found that our available tools did not support Hindi. Even after deciding to enter the Hindi translations manually, it wasn't clear how those words should be encoded. Hindi is usually represented in Devanagari script, which has a large number of possible encodings and no clear standard has emerged as yet.</Paragraph>
    <Paragraph position="1"> Association for Computational Linguistics for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems We decided that Romanized or transliterated Hindi text would be the the most portable encoding, since it can be represented in standard ASCII text.</Paragraph>
    <Paragraph position="2"> However, it turned out that the number of English-Hindi bilingual dictionaries is much less than the number of Hindi-English, and the number that use transliterated text is smaller still.</Paragraph>
    <Paragraph position="3"> Still, we located one promising candidate, the English-Hindi Hippocrene Dictionary (Raker and Shukla, 1996), which represents Hindi in a transliterated form. However, we found that many English words only had two or three translations, making it too coarse grained for our purposes2.</Paragraph>
    <Paragraph position="4"> In the end we selected the Chambers English-Hindi dictionary (Awasthi, 1997), which is a high quality bilingual dictionary that uses Devanagari script. We identified 41 English words from the Chambers dictionary to make up our lexical sample. Then one of the task organizers, who is fluent in English and Hindi, manually transliterated the approximately 500 Hindi translations of the 41 English words in our lexical sample from the Chambers dictionary into the ITRANS format (http://www.aczone.com/itrans/). ITRANS software was used to generate Unicode for display in the OMWE interfaces, although the sense tags used in the task data are the Hindi translations in transliterated form.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Training and Test Data
</SectionTitle>
    <Paragraph position="0"> The MultiLingual lexical sample is made up of 41 words: 18 nouns, 15 verbs, and 8 adjectives. This sample includes English words that have varying degrees of polysemy as reflected in the number of possible Hindi translations, which range from a low of 3 to a high of 39.</Paragraph>
    <Paragraph position="1"> Text samples made up of several hundred instances for each of 31 of the 41 words were drawn from the British National Corpus, while samples for the other 10 words came from the SENSEVAL-2 English lexical sample data. The BNC data is in a &amp;quot;raw&amp;quot; text form, where the part of speech tags have been removed. However, the SENSEVAL-2 data includes the English sense-tags as determined by human taggers.</Paragraph>
    <Paragraph position="2"> After gathering the instances for each word in the lexical sample, we tokenized each instance and removed those that contain collocations of the target word. For example, the training/test instances for arm.n do not include examples for contact arm, 2We have made available transcriptions of the entries for approximately 70 Hippocrene nouns, verbs, and adjectives at http://www.d.umn.edu/~pura0010/hindi.html, although these were not used in this task.</Paragraph>
    <Paragraph position="3"> pickup arm, etc., but only examples that refer to arm as a single lexical unit (not part of a collocation). In our experience, disambiguation accuracy on collocations of this sort is close to perfect, and we aimed to concentrate the annotation effort on the more difficult cases.</Paragraph>
    <Paragraph position="4"> The data was then annotated with Hindi translations by web volunteers using the Open Mind Word Expert (bilingual edition). At various points in time we offered gift certificates as a prize for the most productive tagger in a given day, in order to spur participation. A total of 40 volunteers contributed to this task.</Paragraph>
    <Paragraph position="5"> To create the test data we collected two independent tags per instance, and then discarded any instances where the taggers disagreed. Thus, each instance that remains in the test data has complete agreement between two taggers. For the training data, we only collected one tag per instance, and therefore this data may be noisy. Participating systems could choose to apply their own filtering methods to identify and remove the less reliably annotated examples.</Paragraph>
    <Paragraph position="6"> After tagging by the Web volunteers, there were two data sets provided to task participants: one where the English sense of the target word is unknown, and another where it is known in both the training and test data. These are referred to as the translation only (t) data and the translation and sense (ts) data, respectively. The t data is made up of instances drawn from the BNC as described above, while the ts data is made up of the instances from SENSEVAL-2. Evaluations were run separately for each of these two data sets, which we refer to as the t and ts subtasks.</Paragraph>
    <Paragraph position="7"> The t data contains 31 ambiguous words: 15 nouns, 10 verbs, and 6 adjectives. The ts data contains 10 ambiguous words: 3 nouns, 5 verbs, and 2 adjectives, all of which have been used in the English lexical sample task of SENSEVAL-2. These words, the number of possible translations, and the number of training and test instances are shown in  the two sub-tasks is 10,449, and the total number of test instances is 1,535.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Participating Systems
</SectionTitle>
    <Paragraph position="0"> Five teams participated in the t subtask, submitting a total of eight systems. Three teams (a subset of those five) participated in the ts subtask, submitting a total of five systems. All submitted systems employed supervised learning, using the training examples provided. Some teams used additional resources as noted in the more detailed descriptions</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 NUS
</SectionTitle>
      <Paragraph position="0"> The NUS team from the National University of Singapore participated in both the t and ts subtasks. The t system (nusmlst) uses a combination of knowledge sources as features, and the Support Vector Machine (SVM) learning algorithm. The knowledge sources used include part of speech of neighboring words, single words in the surrounding context, local collocations, and syntactic relations. The ts system (nusmlsts) does the same, but adds the English sense of the target word as a knowledge source.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 LIA-LIDILEM
</SectionTitle>
      <Paragraph position="0"> The LIA-LIDILEM team from the Universit'e d' Avignon and the Universit'e Stendahl Grenoble had two systems which participated in both the t and ts subtasks. In the ts subtask, only the English sense tags were used, not the Hindi translations.</Paragraph>
      <Paragraph position="1"> The FL-MIX system uses a combination of three probabilistic models, which compute the most probable sense given a six word window of context. The three models are a Poisson model, a Semantic Classification Tree model, and a K nearest neighbors search model. This system also used a part of speech tagger and a lemmatizer.</Paragraph>
      <Paragraph position="2"> The FC-MIX system is the same as the FL-MIX system, but replaces context words by more general synonym-like classes computed from a word aligned English-French corpus which number approximately 850,000 words in each language.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 HKUST
</SectionTitle>
      <Paragraph position="0"> The HKUST team from the Hong Kong University of Science and Technology had three systems that participated in both the t and ts subtasks The HKUST me t and HKUST me ts systems are maximum entropy classifiers. The HKUST comb t and HKUST comb ts systems are voted classifiers that combine a new Kernel PCA model with a maximum entropy model and a boosting-based model. The HKUST comb2 t and HKUST comb2 ts are voted classifiers that combine a new Kernel PCA model with a maximum entropy model, a boosting-based model, and a Naive Bayesian model.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 UMD
</SectionTitle>
      <Paragraph position="0"> The UMD team from the University of Maryland entered (UMD-SST) in the t task. UMD-SST is a supervised sense tagger based on the Support Vector Machine learning algorithm, and is described more fully in (Cabezas et al., 2001).</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Duluth
</SectionTitle>
      <Paragraph position="0"> The Duluth team from the University of Minnesota, Duluth had one system (Duluth-ELSS) that participated in the t task. This system is an ensemble of three bagged decision trees, each based on a different type of lexical feature. This system was known as Duluth3 in SENSEVAL-2, and it is described more fully in (Pedersen, 2001).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML