File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2111_metho.xml

Size: 24,465 bytes

Last Modified: 2025-10-06 14:10:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2111">
  <Title>Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity</Title>
  <Section position="5" start_page="866" end_page="866" type="metho">
    <SectionTitle>
4 candidates.
</SectionTitle>
    <Paragraph position="0"> Lin et al. (2003) try to tackle the problem of identifying synonyms among distributionally related words in two ways: Firstly, by looking at the overlap in translations of semantically similar words in multiple bilingual dictionaries. Secondly, by looking at patterns speci cally designed to lter out antonyms. They evaluate on a set of 80 synonyms and 80 antonyms from a thesaurus.</Paragraph>
    <Paragraph position="1"> Wu and Zhou's (2003) paper is most closely related to our study. They report an experiment on synonym extraction using bilingual resources (an English-Chinese dictionary and corpus) as well as monolingual resources (an English dictionary and corpus). Their monolingual corpus-based approach is very similar to our monolingual corpus-based approach. The bilingual approach is different from ours in several aspects. Firstly, they do not take the corpus as the starting point to retrieve word alignments, they use the bilingual dictionary to retrieve multiple translations for each target word. The corpus is only employed to assign probabilities to the translations found in the dictionary. Secondly, the authors use a parallel corpus that is bilingual whereas we use a multi-lingual corpus containing 11 languages in total.</Paragraph>
    <Paragraph position="2"> The authors show that the bilingual method out-performs the monolingual methods. However a combination of different methods leads to the best performance.</Paragraph>
  </Section>
  <Section position="6" start_page="866" end_page="867" type="metho">
    <SectionTitle>
3 Methodology
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="866" end_page="867" type="sub_section">
      <SectionTitle>
3.1 Measuring Distributional Similarity
</SectionTitle>
      <Paragraph position="0"> An increasingly popular method for acquiring semantically similar words is to extract distributionally similar words from large corpora. The underlying assumption of this approach is that semantically similar words are used in similar contexts.</Paragraph>
      <Paragraph position="1"> The contexts a given word is found in, be it a syntactic context or an alignment context, are used as the features in the vector for the given word, the so-called context vector. The vector contains frequency counts for each feature, i.e., the multiple contexts the word is found in.</Paragraph>
      <Paragraph position="2"> Context vectors are compared with each other in order to calculate the distributional similarity between words. Several measures have been proposed. Curran and Moens (2002) report on a large-scale evaluation experiment, where they evaluated the performance of various commonly used methods. Van der Plas and Bouma (2005) present a similar experiment for Dutch, in which they tested most of the best performing measures according to Curran and Moens (2002). Pointwise Mutual Information (I) and Dicey performed best in their experiments. Dice is a well-known combinatorial measure that computes the ratio between the size of the intersection of two feature sets and the sum of the sizes of the individual feature sets. Dicey is a measure that incorporates weighted frequency counts.</Paragraph>
      <Paragraph position="4"> ,where f is the feature W1 and W2 are the two words that are being compared, and I is a weight assigned to the frequency counts.</Paragraph>
    </Section>
    <Section position="2" start_page="867" end_page="867" type="sub_section">
      <SectionTitle>
3.2 Weighting
</SectionTitle>
      <Paragraph position="0"> We will now explain why we use weighted frequencies and which formula we use for weighting.</Paragraph>
      <Paragraph position="1"> The information value of a cell in a word vector (which lists how often a word occurred in a speci c context) is not equal for all cells. We will explain this using an example from mono-lingual syntax-based distributional similarity. A large number of nouns can occur as the subject of the verb have, for instance, whereas only a few nouns may occur as the object of squeeze. Intuitively, the fact that two nouns both occur as sub-ject of have tells us less about their semantic similarity than the fact that two nouns both occur as object of squeeze. To account for this intuition, the frequency of occurrence in a vector can be replaced by a weighted score. The weighted score is an indication of the amount of information carried by that particular combination of a noun and its feature.</Paragraph>
      <Paragraph position="2"> We believe that this type of weighting is bene cial for calculating similarity between word alignment vectors as well. Word alignments that are shared by many different words are most probably mismatches.</Paragraph>
      <Paragraph position="3"> For this experiment we used Pointwise Mutual Information (I) (Church and Hanks, 1989).</Paragraph>
      <Paragraph position="4"> I(W, f) = log P(W, f)P(W)P(f) ,where W is the target word P(W) is the probability of seeing the word P(f) is the probability of seeing the feature P(W,f) is the probability of seeing the word and the feature together.</Paragraph>
    </Section>
    <Section position="3" start_page="867" end_page="867" type="sub_section">
      <SectionTitle>
3.3 Word Alignment
</SectionTitle>
      <Paragraph position="0"> The multilingual approach we are proposing relies on automatic word alignment of parallel corpora from Dutch to one or more target languages. This alignment is the basic input for the extraction of the alignment context as described in section 5.2.2.</Paragraph>
      <Paragraph position="1"> The alignment context is then used for measuring distributional similarity as introduced above.</Paragraph>
      <Paragraph position="2"> For the word alignment, we apply standard techniques derived from statistical machine translation using the well-known IBM alignment models (Brown et al., 1993) implemented in the open-source tool GIZA++ (Och, 2003). These models can be used to nd links between words in a source language and a target language given sentence aligned parallel corpora. We applied standard settings of the GIZA++ system without any optimisation for our particular input. We also used plain text only, i.e. we did not apply further pre-processing except tokenisation and sentence splitting. Additional linguistic processing such as lemmatisation and multi-word unit detection might help to improve the alignment but this is not part of the present study.</Paragraph>
      <Paragraph position="3"> The alignment models produced are asymmetric and several heuristics exist to combine directional word alignments to improve alignment accuracy. We believe, that precision is more crucial than recall in our approach and, therefore, we apply a very strict heuristics namely we compute the intersection of word-to-word links retrieved by GIZA++. As a result we obtain partially word-aligned parallel corpora from which translational context vectors are built (see section 5.2.2). Note, that the intersection heuristics allows one-to-one word links only. This is reasonable for the Dutch part as we are only interested in single words and their synonyms. However, the distributional context of these words de ned by their alignments is strongly in uenced by this heuristics. Problems caused by this procedure will be discussed in detail in section 7 of our experiments.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="867" end_page="868" type="metho">
    <SectionTitle>
4 Evaluation Framework
</SectionTitle>
    <Paragraph position="0"> In the following, we describe the data used and measures applied.</Paragraph>
    <Paragraph position="1"> The evaluation method that is most suitable for testing with multiple settings is one that uses an available resource for synonyms as a gold standard. In our experiments we apply automatic evaluation using an existing hand-crafted synonym database, Dutch EuroWordnet (EWN, Vossen (1998)).</Paragraph>
    <Paragraph position="2"> In EWN, one synset consists of several synonyms which represent a single sense. Polysemous words occur in several synsets. We have combined for each target word the EWN synsets in which it occurs. Hence, our gold standard consists of a list of all nouns found in EWN and their corresponding synonyms extracted by taking the union of all synsets for each word. Precision is then calculated as the percentage of candidate synonyms that are truly synonyms according to our gold standard. Recall is the percentage of the synonyms according to EWN that are indeed found by the system. We have extracted randomly from all synsets in EWN 1000 words with a frequency  above 4 for which the systems under comparison produce output.</Paragraph>
    <Paragraph position="3"> The drawback of using such a resource is that coverage is often a problem. Not all words that our system proposes as synonyms can be found in Dutch EWN. Words that are not found in EWN are discarded.2. Moreover, EWN's synsets are not exhaustive. After looking at the output of our best performing system we were under the impression that many correct synonyms selected by our system were classi ed as incorrect by EWN. For this reason we decided to run a human evaluation over a sample of 100 candidate synonyms classi ed as incorrect by EWN.</Paragraph>
  </Section>
  <Section position="8" start_page="868" end_page="869" type="metho">
    <SectionTitle>
5 Experimental Setup
</SectionTitle>
    <Paragraph position="0"> In this section we will describe results from the two synonym extraction approaches based on distributional similarity: one using syntactic context and one using translational context based on word alignment and the combination of both. For both approaches, we used a cutoff n for each row in our word-by-context matrix. A word is discarded if the row marginal is less than n. This means that each word should be found in any context at least n times else it will be discarded. We refer to this by the term minimum row frequency. The cutoff is used to make the feature space manageable and to reduce noise in the data. 3</Paragraph>
    <Section position="1" start_page="868" end_page="868" type="sub_section">
      <SectionTitle>
5.1 Distributional Similarity Based on
Syntactic Relations
</SectionTitle>
      <Paragraph position="0"> This section contains the description of the synonym extraction approach based on distributional similarity and syntactic relations. Feature vectors for this approach are constructed from syntactically parsed monolingual corpora. Below we describe the data and resources used, the nature of the context applied and the results of the synonym extraction task.</Paragraph>
      <Paragraph position="1">  As our data we used the Dutch CLEF QA corpus, which consists of 78 million words of Dutch  alignment-based method, the syntax-based method and the combination independently by using a development set of 1000 words that has no overlap with the test set used in evaluation. The minimum row frequency was set to 2 for all alignment-based methods. It was set to 46 for the syntax-based method and the combination of the two methods. subject-verb cat eat verb-object feed cat adjective-noun black cat coordination cat dog apposition cat Garfield prep. complement go+to work  (types) per dependency relation with frequency &gt; 1.</Paragraph>
      <Paragraph position="2"> newspaper text (Algemeen Dagblad and NRC  Handelsblad 1994/1995). The corpus was parsed automatically using the Alpino parser (van der Beek et al., 2002; Malouf and van Noord, 2004). The result of parsing a sentence is a dependency graph according to the guidelines of the Corpus of Spoken Dutch (Moortgat et al., 2000).</Paragraph>
      <Paragraph position="3">  We have used several grammatical relations: subect, object, adjective, coordination, apposition and prepositional complement. Examples are given in table 1. Details on the extraction can be found in van der Plas and Bouma (2005). The number of pairs (types) consisting of a word and a syntactic relation found are given in table 2. We have discarded pairs that occur less than 2 times.</Paragraph>
    </Section>
    <Section position="2" start_page="868" end_page="869" type="sub_section">
      <SectionTitle>
5.2 Distributional Similarity Based on Word
Alignment
</SectionTitle>
      <Paragraph position="0"> The alignment approach to synonym extraction is based on automatic word alignment. Context vectors are built from the alignments found in a parallel corpus. Each aligned word type is a feature in the vector of the target word under consideration.</Paragraph>
      <Paragraph position="1"> The alignment frequencies are used for weighting the features and for applying the frequency cutoff.</Paragraph>
      <Paragraph position="2"> In the following section we describe the data and resources used in our experiments and nally the results of this approach.</Paragraph>
      <Paragraph position="3">  Measures of distributional similarity usually require large amounts of data. For the alignment method we need a parallel corpus of reasonable size with Dutch either as source or as target language. Furthermore, we would like to experiment with various languages aligned to Dutch. The freely available Europarl corpus (Koehn, 2003) includes 11 languages in parallel, it is sentence aligned, and it is of reasonable size. Thus, for acquiring Dutch synonyms we have 10 language pairs with Dutch as the source language. The Dutch part includes about 29 million tokens in about 1.2 million sentences. The entire corpus is sentence aligned (Tiedemann and Nygaard, 2004) which is a requirement for the automatic word alignment described below.</Paragraph>
      <Paragraph position="4">  Context vectors are populated with the links to words in other languages extracted from automatic word alignment. We applied GIZA++ and the intersection heuristics as explained in section . From the word aligned corpora we extracted word type links, pairs of source and target words with their alignment frequency attached. Each aligned target word type is a feature in the (translational) context of the source word under consideration.</Paragraph>
      <Paragraph position="5"> Note that we rely entirely on automatic processing of our data. Thus, results from the automatic word alignments include errors and their precision and recall is very different for the various language pairs. However, we did not assess the quality of the alignment itself which would be beyond the scope of this paper.</Paragraph>
      <Paragraph position="6"> As mentioned earlier, we did not include any linguistic pre-processing prior to the word alignment. However, we post-processed the alignment results in various ways. We applied a simple lemmatizer to the list of bilingual word type links in order to 1) reduce data sparseness, and 2) to facilitate our evaluation based on comparing our results to existing synonym databases. For this we used two resources: CELEX a linguistically annotated dictionary of English, Dutch and German (Baayen et al., 1993), and the Dutch snowball stemmer implementing a suf x stripping algorithm based on the Porter stemmer. Note that lemmatization is only done for Dutch. Furthermore, we removed word type links that include non-alphabetic characters to focus our investigations on 'real words'. In order to reduce alignment noise, we also applied a frequency threshold to remove alignments that occur only once. Finally, we restricted our study to Dutch nouns. Hence, we extracted word type links for all words tagged as noun in CELEX. We also included words which are not found at all in CELEX assuming that most of them will be productive noun constructions.</Paragraph>
      <Paragraph position="7"> From the remaining word type links we populated the context vectors as described earlier. Table 3 shows the number of context elements extracted in this manner for each language pair considered from the Europarl corpus4 #word-transl. pairs #word-transl. pairs</Paragraph>
      <Paragraph position="9"/>
    </Section>
  </Section>
  <Section position="9" start_page="869" end_page="870" type="metho">
    <SectionTitle>
6 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> Table 4 shows the precision recall en F-score for the different methods. The rst 10 rows refer to the results for all language pairs individually.</Paragraph>
    <Paragraph position="1"> The 11th row corresponds to the setting in which all alignments for all languages are combined.</Paragraph>
    <Paragraph position="2"> The penultimate row shows results for the syntax-based method and the last row the combination of the syntax-based and alignment-based method.</Paragraph>
    <Paragraph position="3"> Judging from the precision, recall and F-score in table 4 Swedish is the best performing language for Dutch synonym extraction from parallel corpora. It seems that languages that are similar to the target language, for example in word order, are good candidates for nding synonyms at high precision rates. Also the fact that Dutch and Swedish both have one-word compounds avoids mistakes that are often found with the other languages. However, judging from recall (and Fscore) French is not a bad candidate either. It is possible that languages that are lexically different from the target language provide more synonyms.</Paragraph>
    <Paragraph position="4"> The fact that Finnish and Greek do not gain high scores might be due to the fact that there are only a limited amount of translational contexts (with a frequency &gt; 1) available for these language (as is shown in table 3). The reasons are twofold.</Paragraph>
    <Paragraph position="5">  Firstly, for Greek and Finnish the Europarl corpus contains less data. Secondly, the fact that Finnish is a language that has a lot of cases for nouns, might lead to data sparseness and worse accuracy in word alignment.</Paragraph>
    <Paragraph position="6"> The results in table 4 also show the difference in performance between the multilingual alignmentmethod and the syntax-based method. The mono-lingual alignment-based method outperforms the syntax-based method by far. The syntax-based method that does not rely on scarce multilingual resources is more portable and also in this experiment it makes use of more data. However, the low precision scores of this method are not convincing. Combining both methods does not result in better performance for nding synonyms. This is in contrast with the results reported by Wu and Zhou (2003). This might well be due to the more sophisticated method they use for combining different methods, which is a weighted combination.</Paragraph>
    <Paragraph position="7"> The precision scores are in line with the scores reported by Wu and Zhou (2003) in a similar experiment discussed under related work. The recall we attain however is more than three times higher. These differences can be due to differences between their approach such as starting from a bilingual dictionary for acquiring the translational context versus using automatic word alignments from a large multilingual corpus directly. Furthermore, the different evaluation methods used make comparison between the two approaches dif cult.</Paragraph>
    <Paragraph position="8"> They use a combination of the English Word-Net (Fellbaum, 1998) and Roget thesaurus (Roget, 1911) as a gold standard in their evaluations. It is obvious that a combination of these resources leads to larger sets of synonyms. This could explain the relatively low recall scores. It does however not explain the similar precision scores.</Paragraph>
    <Paragraph position="9"> We conducted a human evaluation on a sample of 100 candidate synonyms proposed by our best performing system that were classi ed as incorrect by EWN. Ten evaluators (authors excluded) were asked to classify the pairs of words as synonyms or non-synonyms using a web form of the format yes/no/don't know. For 10 out of the 100 pairs all ten evaluators agreed that these were synonyms. For 37 of the 100 pairs more than half of the evaluators agreed that these were synonyms.</Paragraph>
    <Paragraph position="10"> We can conclude from this that the scores provided in our evaluations based on EWN (table 4) are too pessimistic. We believe that the actual precision scores lie 10 to 37 % higher than the 22.5 % reported in table 4. Over and above, this indicates that we are able to extract automatically synonyms that are not yet covered by available resources.</Paragraph>
  </Section>
  <Section position="10" start_page="870" end_page="871" type="metho">
    <SectionTitle>
7 Error Analysis
</SectionTitle>
    <Paragraph position="0"> In table 5 some example output is given for the method combining word alignments of all 10 foreign languages as opposed to the monolingual syntax-based method. These examples illustrate the general patterns that we discovered by looking into the results for the different methods.</Paragraph>
    <Paragraph position="1"> The rst two examples show that the syntax-</Paragraph>
  </Section>
  <Section position="11" start_page="871" end_page="871" type="metho">
    <SectionTitle>
ALIGN(ALL) SYNTAX
</SectionTitle>
    <Paragraph position="0"> consensus eensgezindheid evenwicht consensus consensus equilibrium herfst najaar winter autumn autumn winter eind einde begin end end beginning armoede armoedebestrijding werkloosheid poverty poverty reduction unemployment alcohol alcoholgebruik drank alcohol alcohol consumption liquor bes charme perzik berry charm peach de nitie de nie criterium de nition de ne+incor.stemm. criterion verlamming lam verstoring paralysis paralysed disturbance  and their translations in italics based method often nds semantically related words whereas the alignment-based method nds synonyms. The reasons for this are quite obvious. Synonyms are likely to receive identical translations, words that are only semantically related are not. A translator would not often translate auto (car) with vrachtwagen (truck). However, the two words are likely to show up in identical syntactic relations, such as being the object of drive or appearing in coordination with motorcycle. Another observation that we made is that the syntax-based method often nds antonyms such as begin (beginning) for the word einde (end). Explanations for this are in line with what we said about the semantically related words: Synonyms are likely to receive identical translations, antonyms are not but they do appear in similar syntactic contexts. null Compounds pose a problem for the alignmentmethod. We have chosen intersection as alignment method. It is well-known that this method cannot cope very well with the alignment of compounds because it only allows one-to-one word links. Dutch uses many one-word compounds that should be linked to multi-word counterparts in other languages. However, using intersection we obtain only partially correct alignments and this causes many mistakes in the distributional similarity algorithm. We have given some examples in rows 4 and 5 of table 5.</Paragraph>
    <Paragraph position="1"> We have used the distributional similarity score only for ranking the candidate synonyms. In some cases it seems that we should have used it to set a threshold such as in the case of berry and charm. These two words share one translational context : the article el in Spanish. The distributional similarity score in such cases is often very low. We could have ltered some of these mistakes by setting a threshold.</Paragraph>
    <Paragraph position="2"> One last observation is that the alignment-based method suffers from incorrect stemming and the lack of suf cient part-of-speech information. We have removed all context vectors that were built for a word that was registered in CELEX with a PoS-tag different from 'noun'. But some words are not found in CELEX and although they are not of the word type 'noun' their context vectors remain in our data. They are stemmed using the snowball stemmer. The candidate synonym de nie is a corrupted verbform that is not found in CELEX. Lam is ambiguous between the noun reading that can be translated in English with lamb and the adjective lam which can be translated with paralysed. This adjective is related to the word verlamming (paralysis), but would have been removed if the word was correctly PoS-tagged.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML