File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0804_metho.xml
Size: 12,530 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0804"> <Title>Experiments in Word Domain Disambiguation for Parallel Texts</Title> <Section position="5" start_page="27" end_page="29" type="metho"> <SectionTitle> 3 Word Domain Disambiguation </SectionTitle> <Paragraph position="0"> In this section we present two baseline algorithms for word domain disambiguation and we propose some variants of them to deal with WDD in the context of parallel texts.</Paragraph> <Section position="1" start_page="27" end_page="28" type="sub_section"> <SectionTitle> 3.1 Baseline algorithms </SectionTitle> <Paragraph position="0"> To decide a proper baseline for Word Domain Disambiguation we wanted to be sure that it was applicable to both the languages (i.e. English and Italian) used in the experiment. This caused the exclusion of a selection based on the domain frequency computed as a function of the frequency of the WORDNET senses, because we did not have a frequency estimation for Italian senses.</Paragraph> <Paragraph position="1"> We adopted two alternative frequency measures, based respectively on the intra text frequency and the intra word frequency of a domain label. Both of them are computed with a two-stage disambignation process, structurally similar to the algorithm used in \[Voorhees, 1998\].</Paragraph> <Paragraph position="2"> Baseline 1: Intra text domain frequency.</Paragraph> <Paragraph position="3"> The baseline algorithm follows two steps. First, all the words in the text are considered and for each domain label allowed by the word the label score is incremented by one. At the second step each word is reconsidered, and the domain label (or labels, depending on how many best solutions are requested) with the highest score is selected as the result of the disambiguation.</Paragraph> <Paragraph position="4"> Baseline 2: Intra word domain frequency.</Paragraph> <Paragraph position="5"> In this version of the baseline algorithm, step 1 is modified in that each domain label allowed by the word is incremented by the frequency of the label among the senses of that word. For instance, if &quot;book&quot; is the word (see Figure 1), PUBLISHING will receive .42 (i.e. three senses out of seven belong to PUBLISHING), while the others domain labels will receive .14 each.</Paragraph> <Paragraph position="6"> 3.1.1 The &quot;factotum&quot; effect As we mentioned in Section 2, a FACTOTUM label is used to mark WORDNET senses that do not belong to a specific domain, but rather are highly widespread across texts of different domains. A consequence is that very often, at the end of step 1 of the disambignation algorithm, FACTOTUM out-performs the other domains, this way affecting the selection carried out at step 2 (i.e. in case of ambiguity FACTOTUM is often preferred).</Paragraph> <Paragraph position="7"> For the purposes of the experiment described in the next sections the FACTOTUM problem has been resolved with a slight modification at step 2 of the baseline algorithm: when FACTOTUM is the best selection for a word, also the second available choice is considered as a result of the disambiguation process.</Paragraph> </Section> <Section position="2" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 3.2 Extensions for parallel texts </SectionTitle> <Paragraph position="0"> We started with the following working hypothesis. Using aligned wordnets to disambiguate parallel texts allows us to calculate the intersection among the synsets accessible from an English text through the English WoRDNET and the synsets accessible from the parallel Italian text through the Italian WORDNET. It would seem reasonable that the synset intersection maximizes the number of significant synsets for the two texts, and at the same time tends to exclude synsets whose meaning is not pertinent to the content of the text.</Paragraph> <Paragraph position="1"> Let us try to make the point clearer with an example. Suppose we find in an English text the word &quot;bank&quot; and in the Italian parallel text the word &quot;banca', which we do not know being the translation of &quot;bank&quot;, because we do not have word alignments. For &quot;bank&quot; we get ten senses from WORDNET 1.6 (reported in Figure 2), while for &quot;banca&quot; we get two senses from MULTIWORDNET (reported in Figure 2). As the two wordnets are aligned (i.e. they share synset offsets), the intersection can be straightforwardly determined.</Paragraph> <Paragraph position="2"> In this case it includes 06227059, corresponding to bank#1 and banca#1, and 02247680, corresponding to bank#4 and banca#2, which both pertain to the BANKING domain, and excludes, among the others, bank#2, which happens to be an homonym sense in English but not in Italian.</Paragraph> <Paragraph position="3"> Incidentally, if &quot;istituto di credito&quot; were not in the synset 06227059 (e.g. because of the incompleteness of the Italian WORDNET) and it were the only word present in the Italian news to denotate the bank#1 sense, the synset intersection would have been empty.</Paragraph> <Paragraph position="4"> As far as disambiguation is concerned it seems a reasonable hypothesis that the synset intersection could bring constraints on the sense selection for a word (i.e. it is highly probable that the correct choice belongs to the intersection). Following this line we have elaborated a mutual help disambiguation strategy where the synset intersection can be accessed to help the disambiguation process of both English and Italian texts.</Paragraph> <Paragraph position="5"> In addition to the synset intersection, we wanted to consider the intersection of domain labels, that is domains that are shared among the 1. J{06227059}\[ depository financial institution, bank, banking concern, banking company I I -- (a financial institution that accepts deposits and channels the money into lending activities; ) 2. {06800223} bank -- (sloping land (especially the slope beside a body of water)) 3. {09626760} bank -- (a supply or stock held in reserve especially for future use (especially in emergencies)) 4. {02247680}\[ bank, bank building -- (a building in which commercial banking is transacted; ) B 5. {06250735} bank -- (an ~rrangement of similar objects in a row or in tiers; ) 6. {03277560} savings bank, coin bank, money box, bank -- (a container (usually with a slot in the top) for keeping money at home;) 7. {06739355} bank -- (a long ridge or pile; &quot;a huge bank of earth&quot;) 8. {09616845} bank -- (the :funds held by a gambling house or the dealer in some gambling games; ) 9. {06800468} bank, cant, camber -- (a slope in the turn of a road or track;) I0. {00109955} bank -- (a flight maneuver; aircraft tips laterally about its longitudinal senses of the parallel texts. In the example above the domain intersection would include just one label (i.e. BANKING), in place of the two synsets of the synset intersection. The hypothesis is that domain intersection could reduce problems due to possible misalignments among the synsets of the two wordnets.</Paragraph> <Paragraph position="6"> Two mutual help algorithms have been implemented, weak mutual help and strong mutual help, which are described in the following.</Paragraph> <Paragraph position="7"> Weak Mutual help. In this version of the mutual help algorithm, step 1 of the baseline is moditied in that, if the domain label is found in the synset or domain intersection, a bonus is assigned to that label, doubling its score. In case of empty intersection (i.e. either no synset or no domain is shared by the two texts) this algorithm guarantees the same performances of the baseline.</Paragraph> <Paragraph position="8"> Strong Mutual help. In the strong version of the mutual help strategy, step 1 of the baseline is modified in that the domain label is scored if and only if it is found in the synset or domain intersection. While this algorithm does not guarantee the baseline performance (because the intersection may not contain all the correct synsets or domains), the precision score will give us indications about the quality of the synset intersection.</Paragraph> </Section> </Section> <Section position="6" start_page="29" end_page="30" type="metho"> <SectionTitle> 4 Experimental Setting </SectionTitle> <Paragraph position="0"> The goal of the experiment is to establish some reference figures for Word Domain Disambiguation. Only nouns have been considered, mostly because the coverage of both MULTIWORDNET and of the domain mapping for verbs is far from being complete.</Paragraph> <Section position="1" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 4.1 Lexlcal resources </SectionTitle> <Paragraph position="0"> Besides the English WORDNET 1.6 we used MUL-TIWoRDNET \[Artale et al., 1997; Magnini and Strapparava, 1997\], an Italian version of the English WoRDNET. It is based on the assumption that a large part of the conceptual relations defined for the English language can be shared with Italian. From an architectural point of view, MULTIWORDNET implements an extension of the WoRDNET lexical matrix to a &quot;multilingual lexi- null cal matrix&quot; through the addition of a third dimension relative to the language. MULTIWORDNET currently includes about 30,000 lemmas.</Paragraph> <Paragraph position="1"> As a matter of comparison, in particular to estimate the lack of coverage of MULTIWORDNET, we consider some data from the Italian dictionary &quot;DISC&quot; \[Sabatini and Coletti, 1997\], a large size monolingual dictionary, available both as printed version and as CD-ROM.</Paragraph> <Paragraph position="2"> Table 1 shows some general figures (only for nouns) about the number of lemmas, the number of senses and the average polysemy for the three lexical resources considered.</Paragraph> </Section> <Section position="2" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 4.2 Parallel Texts </SectionTitle> <Paragraph position="0"> Experiments have been carried out on a news corpus kindly placed at our disposal by AdnKronos, an important Italian news provider. The corpus consists of 168 parallel news (i.e. each news has both an Italian and an English version) concerning various topics (e.g. politics, economy, medicine, food, motors, fashion, culture, holidays). The average length of the news is about 265 words.</Paragraph> <Paragraph position="1"> Table 2 reports the average lexical coverage (i.e.</Paragraph> <Paragraph position="2"> percent of lemmas found in the news corpus) for WORDNET 1.6, MULTIWORDNET and the Disc dictionary. A practically zero variance among the various news is exhibited. We observe a full coverage for the Disc dictionary; in addition, the incompleteness of MULTIWORDNET is limited to 5% with respect to WoRDNET 1.6. The table also reports the average amount of unique synsets for each news. In this case the incompleteness of Italian WoRDNET with respect to WORDNET 1.6 raises to 30%, showing that a significant amount of word senses is missing.</Paragraph> <Paragraph position="3"> Table 3 shows the average polysemy of the news corpus considering both word senses and word domain labels. The figures reveal a polysemy reduction of 17-18% when we deal with domain polysemy. null Manual Annotation. A subset of forty news pairs (about half of the initial corpus) have been manually annotated with the correct domain label. Annotators were instructed about the domain hierarchy and then asked to select one domain label for each lemma among those allowed by that lemma.</Paragraph> <Paragraph position="4"> Uncertain cases have been reviewed by a second annotator and, in case of persisting conflict, a third annotator was consulted to take a decision.</Paragraph> <Paragraph position="5"> Lemmatization errors as well as cases of incomplete coverage of domain labels have been detected and excluded. The whole manual set consists of about 2500 annotated nouns.</Paragraph> <Paragraph position="6"> Although we do not have empirical evidences, our practical experience confirms the intuition that annotating texts with domain labels is an easier task than sense annotation.</Paragraph> <Paragraph position="7"> Forty-two domain labels, representing the more informative level of the domain hierarchy mentioned in Section 1, have been used for the experiment. Table 4 reports the complete list.</Paragraph> </Section> </Section> class="xml-element"></Paper>