File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1016_metho.xml
Size: 25,268 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1016"> <Title>Synonymous Collocation Extraction Using Translation Information</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Our Approach </SectionTitle> <Paragraph position="0"> Our method for synonymous collocation extraction comprises of three steps: (1) extract collocations from large monolingual corpora; (2) generate candidates of synonymous collocation pairs with a word thesaurus WordNet; (3) select synonymous collocation candidates using their translations.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Collocation Extraction </SectionTitle> <Paragraph position="0"> This section describes how to extract English collocations. Since Chinese collocations will be used to train the language model in Section 2.3, they are also extracted in the same way.</Paragraph> <Paragraph position="1"> Collocations in this paper take some syntactical relations (dependency relations), such as <verb, OBJ, noun>, <noun, ATTR, adj>, and <verb, MOD, adv>. These dependency triples, which embody the syntactic relationship between words in a sentence, are generated with a parser--we use NLPWIN in this paper1. For example, the sentence &quot;She owned this red coat&quot; is transformed to the following four triples after parsing: <own, SUBJ, she>, <own, OBJ, coat>, <coat, DET, this>, and <coat, ATTR, red>.</Paragraph> <Paragraph position="2"> These triples are generally represented in the form of <Head, Relation Type, Modifier>.</Paragraph> <Paragraph position="3"> The measure we use to extract collocations from the parsed triples is weighted mutual information (WMI) (Fung and Mckeown, 1997), as described as</Paragraph> <Paragraph position="5"/> <Paragraph position="7"> Those triples whose WMI values are larger than a given threshold are taken as collocations. We do not use the point-wise mutual information because it tends to overestimate the association between two words with low frequencies. Weighted mutual information meliorates this effect by adding ),,( 21 wrwp .</Paragraph> <Paragraph position="8"> For expository purposes, we will only look into three kinds of collocations for synonymous collocation extraction: <verb, OBJ, noun>, <noun, ATTR, adj> and <verb, MOD, adv>.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Class #Type #Token </SectionTitle> <Paragraph position="0"> verb, OBJ, noun 1,579,783 19,168,229 noun, ATTR, adj 311,560 5,383,200 verb, Mod, adv 546,054 9,467,103 The English collocations are extracted from Wall Street Journal (1987-1992) and Association Press (1988-1990), and the Chinese collocations are 1 The NLPWIN parser is developed at Microsoft Research, which parses several languages including Chinese and English. Its output can be a phrase structure parse tree or a logical form which is represented with dependency triples.</Paragraph> <Paragraph position="1"> extracted from People's Daily (1980-1998). The statistics of the extracted collocations are shown in Table 1 and 2. The thresholds are set as 5 for both English and Chinese. Token refers to the total number of collocation occurrences and Type refers to the number of unique collocations in the corpus.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Candidate Generation </SectionTitle> <Paragraph position="0"> Candidate generation is based on the following assumption: For a collocation <Head, Relation Type, Modifier>, its synonymous expressions also take the form of <Head, Relation Type, Modifier> although sometimes they may also be a single word or a sentence pattern.</Paragraph> <Paragraph position="1"> The synonymous candidates of a collocation are obtained by expanding a collocation <Head, Relation Type, Modifier> using the synonyms of Head and Modifier. The synonyms of a word are obtained from WordNet 1.6. In WordNet, one synset consists of several synonyms which represent a single sense.</Paragraph> <Paragraph position="2"> Therefore, polysemous words occur in more than one synsets. The synonyms of a given word are obtained from all the synsets including it. For example, the word &quot;turn on&quot; is a polysemous word and is included in several synsets. For the sense &quot;cause to operate by flipping a switch&quot;, &quot;switch on&quot; is one of its synonyms. For the sense &quot;be contingent on&quot;, &quot;depend on&quot; is one of its synonyms. We take both of them as the synonyms of &quot;turn on&quot; regardless of its meanings since we do not have sense tags for words in collocations.</Paragraph> <Paragraph position="3"> If we use Cw to indicate the synonym set of a word w and U to denote the English collocation set generated in Section 2.1. The detail algorithm on generating candidates of synonymous collocation pairs is described in Figure 1. For example, given a collocation <turn on, OBJ, light>, we expand &quot;turn on&quot; to &quot;switch on&quot;, &quot;depend on&quot;, and then expand &quot;light&quot; to &quot;lump&quot;, &quot;illumination&quot;. With these synonyms and the relation type OBJ, we generate synonymous collocation candidates of <turn on, OBJ, light>. The candidates are <switch on, OBJ, light>, <turn on, OBJ, lump>, <depend on, OBJ, illumination>, <depend on, OBJ, light> etc. Both these candidates and the original collocation <turn on, OBJ, light> are used to generate the synonymous collocation pairs.</Paragraph> <Paragraph position="4"> With the above method, we obtained candidates of synonymous collocation pairs. For example, <switch on, OBJ, light> and <turn on, OBJ, light> are a synonymous collocation pair. However, this method also produces wrong synonymous collocation candidates. For example, <depend on, OBJ, illumination> and <turn on, OBJ, light> is not a synonymous pair. Thus, it is important to filter out these inappropriate candidates.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Candidate Selection </SectionTitle> <Paragraph position="0"> In synonymous word extraction, the similarity of two words can be estimated based on the similarity of their contexts. However, this method cannot be effectively extended to collocation similarity estimation. For example, in sentences &quot;They turned on the lights&quot; and &quot;They depend on the illumination&quot;, the meaning of two collocations <turn on, OBJ, light> and <depend on, OBJ, illumination> are different although their contexts are the same.</Paragraph> <Paragraph position="1"> Therefore, monolingual information is not enough to estimate the similarity of two collocations.</Paragraph> <Paragraph position="2"> However, the meanings of the above two collocations can be distinguished if they are translated into a second language (e.g., Chinese). For example, <turn on, OBJ, light> is translated into < , OBJ, &C > (<kai1, OBJ, deng1) and < ' , OBJ, &C > (<da3 kai1, OBJ, deng1>) in Chinese while <depend on, OBJ, illumination> is translated into < &quot; b , OBJ, '; z > (qu3 jue2 yu2, OBJ, guang1 zhao4 du4>).</Paragraph> <Paragraph position="3"> Thus, they are not synonymous pairs because their translations are completely different.</Paragraph> <Paragraph position="4"> In this paper, we select the synonymous collocation pairs from the candidates in the following way. First, given a candidate of synonymous collocation pair generated in section 2.2, we translate the two collocations into Chinese with a simple statistical translation model. Second, we calculate the similarity of two collocations with the feature vectors constructed with their translations. A candidate is selected as a synonymous collocation pair (1) For each collocation (Co1i=<Head, R, Modifier>) U, do the following: a. Use the synonyms in WordNet 1.6 to expand Head and Modifier and get their synonym sets CHead and CModifier b. Generate the candidate set of its synonymous collocations Si={<w1, R, w2> |w1 {Head} $ CHead & w2 {Modifier} $ CModifier & <w1, R, w2> U & <w1, R, w2> ,, Co1i } (2) Generate the candidate set of synonymous collocation pairs SC= {(Co1i, Co1j) |Co1i 8 Co1j Si ' if its similarity exceeds a certain threshold.</Paragraph> <Paragraph position="5"> For an English collocation ecol=<e1, re, e2>, we translate it into Chinese collocations 2 using an English-Chinese dictionary. If the translation sets of e1 and e2 are represented as CS1 and CS2 respectively, the Chinese translations can be represented as S={<c1, rc, c2> |c1 CS1 , c2 CS2 , rc 5 }, with R denoting the relation set.</Paragraph> <Paragraph position="6"> Given an English collocation ecol=<e1, re, e2> and one of its Chinese collocation ccol=<c1, rc, c2> S, the probability that ecol is translated into ccol is calculated as in Equation (1).</Paragraph> <Paragraph position="8"> According to Equation (1), we need to calculate the translation probability p(ecol|ccol) and the target language probability p(ccol). Calculating the translation probability needs a bilingual corpus. If the above equation is used directly, we will run into the data sparseness problem. Thus, model simplification is necessary.</Paragraph> <Paragraph position="9"> Our simplification is made according to the following three assumptions.</Paragraph> <Paragraph position="10"> Assumption 1: For a Chinese collocation ccol and re, we assume that e1 and e2 are conditionally independent. The translation model is rewritten as:</Paragraph> <Paragraph position="12"> Assumption 2: Given a Chinese collocation <c1, rc, c2>, we assume that the translation probability p(ei|ccol) only depends on ei and ci (i=1,2), and p(re|ccol) only depends on re and rc. Equation (2) is rewritten as:</Paragraph> <Paragraph position="14"> It is equal to a word translation model if we take the relation type in the collocations as an element like a word, which is similar to Model 1 in (Brown et al., 1993).</Paragraph> <Paragraph position="15"> Assumption 3: We assume that one type of English</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Some English collocations can be translated into Chi- </SectionTitle> <Paragraph position="0"> nese words, phrases or patterns. Here we only consider the case of being translated into collocations.</Paragraph> <Paragraph position="1"> collocation can only be translated to the same type of Chinese collocations3. Thus, p(re |rc) =1 in our case. Equation (3) is rewritten as:</Paragraph> <Paragraph position="3"> The language model p(ccol) is calculated with the Chinese collocation database extracted in section 2.1. In order to tackle with the data sparseness problem, we smooth the language model with an interpolation method.</Paragraph> <Paragraph position="4"> When the given Chinese collocation occurs in the corpus, we calculate it as in (5).</Paragraph> <Paragraph position="6"> where )( colccount represents the count of the Chinese collocation colc . N represents the total counts of all the Chinese collocations in the training corpus. null For a collocation <c1, rc, c2>, if we assume that two words c1 and c2 are conditionally independent given the relation rc, Equation (5) can be rewritten as in (6).</Paragraph> <Paragraph position="8"> ,*),( 1 crccount : frequency of the collocations with c1 as the head and rc as the relation type.</Paragraph> <Paragraph position="9"> ),(*, 2crcount c : frequency of the collocations with c2 as the modifier and rc as the relation type ,*)(*, crcount : frequency of the collocations with rc as the relation type.</Paragraph> <Paragraph position="10"> With Equation (5) and (6), we get the interpolated language model as shown in (7).</Paragraph> <Paragraph position="12"> where 10 << l . l is a constant so that the probabilities sum to 1.</Paragraph> <Paragraph position="13"> 3 Zhou et al. (2001) found that about 70% of the Chinese translations have the same relation type as the source English collocations.</Paragraph> <Paragraph position="14"> Many methods are used to estimate word translation probabilities from unparallel or parallel bilingual corpora (Koehn and Knight, 2000; Brown et al., 1993). In this paper, we use a parallel bilingual corpus to train the word translation probabilities based on the result of word alignment with a bi-lingual Chinese-English dictionary. The alignment method is described in (Wang et al., 2001). In order to deal with the problem of data sparseness, we conduct a simple smoothing by adding 0.5 to the counts of each translation pair as in (8).</Paragraph> <Paragraph position="16"> where |_ |etrans represents the number of English translations for a given Chinese word c.</Paragraph> <Paragraph position="17"> For each synonymous collocation pair, we get its corresponding Chinese translations and calculate the translation probabilities as in section 2.3.1. These Chinese collocations with their corresponding translation probabilities are taken as feature vectors of the English collocations, which can be represented as: >=< ),(, ... ),,(),,( 2211 imcolimcolicolicolicolicolicol pcpcpcFe The similarity of two collocations is defined as in (9). The candidate pairs whose similarity scores exceed a given threshold are selected.</Paragraph> <Paragraph position="19"> For example, given a synonymous collocation pair <turn on, OBJ, light> and <switch on, OBJ, light>, we first get their corresponding feature vectors.</Paragraph> <Paragraph position="20"> The feature vector of <turn on, OBJ, light>:</Paragraph> <Paragraph position="22"> The feature vector of <switch on, OBJ, light>:</Paragraph> <Paragraph position="24"> The values in the feature vector are translation probabilities. With these two vectors, we get the similarity of <turn on, OBJ, light> and <switch on, OBJ, light>, which is 0.2348.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Implementation of our Approach </SectionTitle> <Paragraph position="0"> We use an English-Chinese dictionary to get the Chinese translations of collocations, which includes 219,404 English words. Each source word has 3 translation words on average. The word translation probabilities are estimated from a bilingual corpus that obtains 170,025 pairs of Chinese-English sentences, including about 2.1 million English words and about 2.5 million Chinese words.</Paragraph> <Paragraph position="1"> With these data and the collocations in section 2.1, we produced 93,523 synonymous collocation pairs and filtered out 1,060,788 candidate pairs with our translation method if we set the similarity threshold to 0.01.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="21" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> To evaluate the effectiveness of our methods, two experiments have been conducted. The first one is designed to compare our method with two methods that use monolingual corpora. The second one is designed to compare our method with a method that uses a bilingual corpus.</Paragraph> <Section position="1" start_page="0" end_page="21" type="sub_section"> <SectionTitle> 3.1 Comparison with Methods using Monolingual Corpora </SectionTitle> <Paragraph position="0"> We compared our approach with two methods that use monolingual corpora. These two methods also employed the candidate generation described in section 2.2. The difference is that the two methods use different strategies to select appropriate candidates. The training corpus for these two methods is the same English one as in Section 2.1.</Paragraph> <Paragraph position="1"> to select synonymous candidates. The purpose of this experiment is to see whether the context method for synonymous word extraction can be effectively extended to synonymous collocation extraction.</Paragraph> <Paragraph position="2"> The similarity of two collocations is calculated with their feature vectors. The feature vector of a collocation is constructed by all words in sentences which surround the given collocation. The context vector for collocation i is represented as in (10). >=< ),(),...,,(),,( 2211 imimiiiiicol pwpwpwFe (10) where N ewcountp</Paragraph> <Paragraph position="4"> co-occurring with the collocation icole N: all counts of the words in the training corpus. With the feature vectors, the similarity of two collocations is calculated as in (11). Those candidates whose similarities exceed a given threshold are selected as synonymous collocations.</Paragraph> <Paragraph position="6"> similarity of two words, this method calculates the similarity of collocations with the similarity of their components. The formula is described in Equation</Paragraph> <Paragraph position="8"> where ),,( 21 iiiicol erelee = . We assume that the relation type keeps the same, so 1),( 21 =relrelsim .</Paragraph> <Paragraph position="9"> The similarity of the words is calculated with the same method as described in (Lin, 1998), which is rewritten in Equation (13). The similarity of the words is calculated through the surrounding context words which have dependency relationships with the investigated words.</Paragraph> <Paragraph position="10"> With the candidate generation method as depicted in section 2.2, we generated 1,154,311 candidates of synonymous collocations pairs for 880,600 collocations, from which we randomly selected 1,300 pairs to construct a test set. Each pair was evaluated independently by two judges to see if it is synonymous. Only those agreed upon by two judges are considered as synonymous pairs. The statistics of the test set is shown in Table 3. We evaluated three types of synonymous collocations: <verb, OBJ, noun>, <noun, ATTR, adj>, <verb, MOD, adv>. For the type <verb, OBJ, noun>, among the 630 synonymous collocation candidate pairs, 197 pairs are correct. For <noun, ATTR, adj>, 163 pairs (among 324 pairs) are correct, and for <verb, MOD, adv>, 124 pairs (among 346 pairs) are correct.</Paragraph> <Paragraph position="11"> verb, OBJ, noun 630 197 noun, ATTR, adj 324 163 verb, MOD, adv 346 124 With the test set, we evaluate the performance of each method. The evaluation metrics are precision, recall, and f-measure.</Paragraph> <Paragraph position="12"> A development set including 500 synonymous pairs is used to determine the thresholds of each method. For each method, the thresholds for getting highest f-measure scores on the development set are selected. As the result, the thresholds for Method 1, Method 2 and our approach are 0.02, 0.02, and 0.01 respectively. With these thresholds, the experimental results on the test set in Table 3 are shown in It can be seen that our approach gets the highest precision (74% on average) for all the three types of synonymous collocations. Although the recall (64% on average) of our approach is below other methods, the f-measure scores, which combine both precision and recall, are the highest. In order to compare our methods with other methods under the same recall value, we conduct another experiment on the type <verb, OBJ, noun>4. We set the recalls of the two methods to the same value of our method, which is 0.6396 in Table 4. The precisions are 0.3190, 0.4922, and 0.6811 for Method 1, Method 2, and our method, respectively. Thus, the precisions of our approach are higher than the other two methods even when their recalls are the same. It proves that our method of using translation information to select the candidates is effective for synonymous collocation extraction.</Paragraph> <Paragraph position="13"> The results of Method 1 show that it is difficult to extract synonymous collocations with monolingual contexts. Although Method 1 gets higher recalls than the other methods, it brings a large number of wrong candidates, which results in lower precision. If we set higher thresholds to get comparable precision, the recall is much lower than that of our approach. This indicates that the contexts of collocations are not discriminative to extract synonymous collocations.</Paragraph> <Paragraph position="14"> The results also show that Model 2 is not suitable for the task. The main reason is that both high scores of ),( 2111 eesim and ),( 2212 eesim does not mean the high similarity of the two collocations.</Paragraph> <Paragraph position="15"> The reason that our method outperforms the other two methods is that when one collocation is translated into another language, its translations indirectly disambiguate the words' senses in the collocation. For example, the probability of <turn on, OBJ, light> being translated into < ' , OBJ, &C > (<da3 kai1, OBJ, deng1>) is much higher than that of it being translated into < &quot; b , OBJ, '; z > (<qu3 jue2 yu2, OBJ, guang1 zhao4 du4>) while the situation is reversed for <depend on, OBJ, illumination>. Thus, the similarity between <turn on, OBJ, light> and <depend on, OBJ, illumination> is low and, therefore, this candidate is filtered out.</Paragraph> </Section> </Section> <Section position="6" start_page="21" end_page="21" type="metho"> <SectionTitle> 4 The results of the other two types of collocations are the </SectionTitle> <Paragraph position="0"> same as <verb, OBJ, noun>. We omit them because of the space limit.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.2 Comparison with Methods using Bilingual Corpora </SectionTitle> <Paragraph position="0"> Barzilay and Mckeown (2001), and Shimohata and Sumita (2002) used a bilingual corpus to extract synonymous expressions. If the same source expression has more than one different translation in the second language, these different translations are extracted as synonymous expressions. In order to compare our method with these methods that only use a bilingual corpus, we implement a method that is similar to the above two studies. The detail process is described in Method 3.</Paragraph> <Paragraph position="1"> Method 3: The method is described as follows: (1) All the source and target sentences (here Chinese and English, respectively) are parsed; (2) extract the Chinese and English collocations in the bilingual corpus; (3) align Chinese collocations ccol=<c1, rc, c2> and English collocations ecol=<e1, re, e2> if c1 is aligned with e1 and c2 is aligned with e2; (4) obtain two English synonymous collocations if two different English collocations are aligned with the same Chinese collocation and if they occur more than once in the corpus.</Paragraph> <Paragraph position="2"> The training bilingual corpus is the same one described in Section 2. With Method 3, we get 9,368 synonymous collocation pairs in total. The number is only 10% of that extracted by our approach, which extracts 93,523 pairs with the same bilingual corpus. In order to evaluate Method 3 and our approach on the same test set. We randomly select 100 collocations which have synonymous collocations in the bilingual corpus. For these 100 collocations, Method 3 extracts 121 synonymous collocation pairs, where 83% (100 among 121) are correct 5. Our method described in Section 2 generates 556 synonymous collocation pairs with a threshold set in the above section, where 75% (417 among 556) are correct.</Paragraph> <Paragraph position="3"> If we set a higher threshold (0.08) for our method, we get 360 pairs where 295 are correct (82%). If we use |A|, |B|, |C |to denote correct pairs extracted by Method 3, our method, both Method 3 and our method respectively, we get |A|=100, |B|=295, and 78 |=.= BAC . Thus, the synonymous collocation pairs extracted by our method cover 78% ( |AC ) of those extracted by Method 5 These synonymous collocation pairs are evaluated by two judges and only those agreed on by both are selected as correct pairs.</Paragraph> <Paragraph position="4"> 3 while those extracted by Method 3 only cover 26% ( |BC ) of those extracted by our method.</Paragraph> <Paragraph position="5"> It can be seen that the coverage of Method 3 is much lower than that of our method even when their precisions are set to the same value. This is mainly because Method 3 can only extract synonymous collocations which occur in the bilingual corpus. In contrast, our method uses the bilingual corpus to train the translation probabilities, where the translations are not necessary to occur in the bilingual corpus. The advantage of our method is that it can extract synonymous collocations not occurring in the bilingual corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>