File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0826_metho.xml

Size: 3,465 bytes

Last Modified: 2025-10-06 14:09:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0826">
  <Title>UBBNBC WSD System Description</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Stemming
</SectionTitle>
    <Paragraph position="0"> The preprocessing of the corpora is one of the most result-influential steps. The preprocessing consists of the removal of suffixes and the elimination of the irrelevant data. The removal of suffixes is performed trough a simple dictionary based method.</Paragraph>
    <Paragraph position="1"> For every w</Paragraph>
    <Paragraph position="3"> selected from the dictionary containing the word stems. Then a similarity score is calculated between the word to be stemmed and the candidates, as follows:</Paragraph>
    <Paragraph position="5"> =0, otherwise.</Paragraph>
    <Paragraph position="6"> The result is the candidate with the highest score if its score is above a certain threshold, otherwise the word is leaved untouched.</Paragraph>
    <Paragraph position="7"> In the preprocessing phase we also erase the pronouns and prepositions from the examined context. This exclusion was made upon a list of stop words.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Learning
</SectionTitle>
    <Paragraph position="0"> The training is conducted according to the NBC algorithm. First a database is built, with the following tables: words - contains all the words found in the corpora. Its role is to assign a sense id to every word.</Paragraph>
    <Paragraph position="1"> wordsenses - contains all the tagged words in the corpora linked with their possible senses. One entry for a given sense and word.</Paragraph>
    <Paragraph position="2"> Association for Computational Linguistics for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems nosenses - number of tagged contexts, with a given sense nocontexts - number of tagged contexts of a given word occurrences - number of co-occurrences of a given word with a given sense Figure1: The tables of the database The training of the system is nothing but filling up the tables of the database.</Paragraph>
    <Paragraph position="3">  if (exists entry in occurrences where wordid=vi and senseid=s</Paragraph>
    <Paragraph position="5"> step to next word endscan step to next entry endscan corpora As it is obvious, the database is filled up (so the system is trained) only upon the training corpus provided for the Senseval3 Romanian Lexical Sample task.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Disambiguation
</SectionTitle>
    <Paragraph position="0"> The basic assumption of the Naive Bayes method is that the contextual features are not dependent on each other. In this particular case, we assume that the probability of co-occurrence of a word v</Paragraph>
    <Paragraph position="2"> the ambiguous word w of sense s is not dependent on other co-occurrences.</Paragraph>
    <Paragraph position="3"> The goal is to find the correct sense s' , of the word w, for a given context. This s' sense maximizes the following equation.</Paragraph>
    <Paragraph position="5"> At this point we make the simplifying &amp;quot;naive&amp;quot; assumption: null  )- from nosenses.</Paragraph>
    <Paragraph position="6"> wordsenses is being used to determine the possible senses of a given word.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML