File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1101_intro.xml
Size: 16,460 bytes
Last Modified: 2025-10-06 14:00:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1101"> <Title>Adapting a synonym database to specific domains</Title> <Section position="4" start_page="0" end_page="4" type="intro"> <SectionTitle> 2 Methodology 2.1 Outline </SectionTitle> <Paragraph position="0"> The synonym pruning task aims at improving both the accuracy and the speed of a synonym database. In order to set the terms of the problem, we find it useful to partition the set of synonymy relations defined in WordNet into three classes: . Relations irrelevant to the specific domain (e.g. relations involving words that seldom or never appear in the specific domain) null . Relations that are relevant but incorrect in the specific domain (e.g. the synonymy of two words that do appear in the specific domain, but are only synonyms in a sense irrelevant to the specific domain); null 3. Relations that are relevant and correct in the specific domain.</Paragraph> <Paragraph position="1"> The creation of a domain specific database aims at removing relations in the first two classes (to improve speed and accuracy, respectively) and including only relations in the third class.</Paragraph> <Paragraph position="2"> The overall goal of the described method is to inspect all synonymy relations in Word-Net and classify each of them into one of the three aforementioned classes. We define a synonymy relation as a binary relation between two synonym terms (with respect to * a particular sense). Therefore, a WordNet synset containing n terms defines ~11 k synonym relations. The assignment of a synonymy relation to a class is based on evidence drawn from a domain specific corpus. We use a tagged and lemmatized corpus for this purpose. Accordingly, all frequencies used in the rest of the paper are to be intended as frequencies of (lemma, tag) pairs.</Paragraph> <Paragraph position="3"> The pruning process is carried out in three steps: (i) manual pruning; (ii) automatic pruning; (iii) optimization. The first two steps focus on incrementally eliminating incorrect synonyms, while the third step focuses on removing irrelevant synonyms. The three steps are described in the following sections.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Manual pruning </SectionTitle> <Paragraph position="0"> Different synonymy relations have a different impact on the behavior of the application in which they are used, depending on how frequently each synonymy relation is used. Relations involving words frequently appearing in either queries or corpora have a much higher impact (either positive or negative) than relations involving rarely occurring words. E.g.</Paragraph> <Paragraph position="1"> the synonymy between snow and C has a higher impact on the weather report domain (or the aviation domain, discussed in this paper) than the synonymy relation between cocaine and coke. Consequently, the precision of a synonym database obviously depends much more on frequently used relations than on rarely used ones. Another important consideration is that judging the correctness of a given synonymy relation in a given domain is often an elusive issue: besides clearcut cases, there is a large gray area where judgments may not be trivial even for humans evaluatots. E.g. given the following three senses of the noun approach (2) a. {approach, approach path, glide path, glide slope} (the final path followed by an aircraft as it is landing) b. {approach, approach shot} (a relatively short golf shot intended to put the ball onto the putting green) c. {access, approach} (a way of entering or leaving) it would be easy to judge the first and second senses respectively relevant and irrelevant to the aviation domain, but the evaluation of the third sense would be fuzzier.</Paragraph> <Paragraph position="2"> The combination of the two remarks above induced us to consider a manual pruning phase for the terms of highest 'weight' as a good investment of human effort, in terms of rate between the achieved increase in precision and the amount of work involved. A second reason for performing an initial manual pruning is that its outcome can be used as a reliable test set against which automatic pruning algorithms can be tested.</Paragraph> <Paragraph position="3"> Based on such considerations, we included a manual phase in the pruning process, consisting of two steps: (i) the ranking of synonymy relations in terms of their weight in the specific domain; (ii) the actual evaluation of the correctness of the top ranking synonymy relation, by human evaluators.</Paragraph> <Paragraph position="4"> The goal of ranking synonymy relations is to associate them with a score that estimates how often a synonymy relation is likely to be used in the specific domain. The input database is sorted by the assigned scores, and the top ranking words are checked for manual pruning. Only terms appearing in the domain specific corpus are considered at this stage. In this way the benefit of manual pruning is maximized. Ranking is based on three sorting criteria, listed below in order of priority.</Paragraph> <Paragraph position="5"> Criterion 1. Since a term that does appear in the domain corpus must have at least one valid sense in the specific domain, words with only one sense are not good candidates for pruning (under the assumption of completeness of the synonym database). Therefore .polysemous terms are prioritized over monosemous terms.</Paragraph> <Paragraph position="6"> Criterion 2. The second and third sorting criteria axe similar, the only difference being that the second criterion assumes the existence of some inventory of relevant queries (a term list, a collection of previous queries, etc.), ff such an inventory is not available, the second sorting criterion can be omitted. If the inventory is available, it is used to check which synonymy relations are actually to be used in queries to the domain corpus. Given a pair (ti,tj) of synonym terms, a score (which we name scoreCQ) is assigned to their synonymy relation, according to the following formula:</Paragraph> <Paragraph position="8"> where fcorpusn and fqueryn are, respectively, the frequencies of a term in the domain corpus and in the inventory of query terms.</Paragraph> <Paragraph position="9"> The above formula aims at estimating how often a given synonymy relation is likely to be actually used. In particular, each half of the formula estimates how often a given term in the corpus is likely to be matched as a synonym of a given term in a query. Consider, e.g., the following situation (taken form the aviation domain discussed in section 3.1):</Paragraph> <Paragraph position="11"> It is estimated that C would be matched 18336 times as a synonym for snow (i.e 9168 * 2), while snow would never be matched as a synonym for C, because C never occurs as a query term. Therefore scoreCQs,~ow,c is 18336 (i.e. 18336 + 0).</Paragraph> <Paragraph position="12"> Then, for each polysemous term i and synset s such that i E s, the following score is computed: j fcorpusj fqueryj cocaine 1 0 cocain 0 0 coke 8 0 C 9168 0 (5) scorePolyCQ i,~ = E{scoreCQi,~lj ~ s A i C/ j} E.g., if ,5' is the synset in (1), then scorePolyCQs~ow,s is &quot;the sum of scoreCQsnow,coc~ine, scoreCQsnow,eocain, scoreCQsnow,eoke and scoreCQ,no~o,c. Given the data in Table 1 (taken again from our aviation domain) the following scoreCQ would result: (6) scoreCQsnow,cocaine -~ 2</Paragraph> <Paragraph position="14"> Therefore, scorePolyCQsnow,s would equal 18354.</Paragraph> <Paragraph position="15"> The final score assigned to each polysemous term tl is the highest scorePolyCQi,s. For snow, which has the following three senses (7) a. {cocaine, cocaine, coke, C, snow} (a narcotic (alkaloid) extracted from coca leaves) b. {snow} (a layer of snowflakes (white crystals of frozen water) covering the ground) c. {snow, snowfall} (precipitation falling from clouds in the form of ice crystals) the highest score would be the one computed above.</Paragraph> <Paragraph position="16"> Criterion 3. The third criterion assigns a score in terms of domain corpus frequency alone. It is used to further rank terms that do not occur in the query term inventory (or when no query term inventory is available). It is computed in the same way as the previous score, with the only difference that a value of 1 is conventionally assumed for fquery (the frequency of a term in the inventory of query terms).</Paragraph> <Paragraph position="17"> All the synsets containing the top ranking terms, according to the hierarchy of criteria described above, are manuMly checked for pruning. For each term, all the synsets containing the term are clustered together and presented to a human operator, who examines each (term, synset) pair and answers the question: does the term belong to the synset in the specific domain? Evidence about the answer is drawn from relevant examples automatically extracted from the domain specific corpus. E.g., following up on our example in the previous section, the operator would be presented with the word snow associated with each of the synsets in (7) and would have to provide a yes/no answer for each of them. In the specific case, the answer would be likely to be 'no' for (7a) and 'yes' for (75) and (7c). The evaluator is presented with all the synsets involving a relevant term (even those that did not rank high in terms of scorePoIyCQ) in order to apply a contrastive approach. It might well be the case that the correct sense for a given term is one for which the term has no synonyms at all (e.g. 7b in the example), therefore all synsets for a given term need to be presented to the evaiuator in order to make an informed choice. The evaluator provides a yes/no answer for all the (term, synset) he/she is presented with (with some exceptions, as explained in section 3.1).</Paragraph> </Section> <Section position="2" start_page="0" end_page="4" type="sub_section"> <SectionTitle> 2.3 Automatic pruning </SectionTitle> <Paragraph position="0"> The automatic pruning task is analogous to manual pruning in two respects: (i) its input is the set of synonymy relations involving WordNet polysemous words appearing in the domain specific corpus; (ii) it is performed by examining all (term, synset) input pairs and answering the question: does the term belong to the synset in the specific domain? However, while the manual pruning task was regarded as a filtering task, where a human eval- null uator assigns a boolean value to each pruning candidate, the automatic pruning task can be more conveniently regarded as a ranking task, where all the pruning candidates are assigned a score, measuring how appropriate a given sense is for a given word, in the domain at hand. The actual pruning is left as a subsequent step. Different pruning thresholds can be applied to the ranked list, based on different considerations (e.g. depending on whether a stronger emphasis is put on the precision or the recall of the resulting database). The score is based on the frequencies of both words in the synset (except the word under consideration) and words in the sense gloss.</Paragraph> <Paragraph position="1"> We also remove from the gloss all words belonging to a stoplist (a stoplist provided with WordNet was used for this purpose). The following scoring formula is used:</Paragraph> <Paragraph position="3"> Note that the synset cardinality does not include the word under consideration, reflecting the fact the word's frequency is not used in calculating the score. Therefore a synset only containing the word under consideration and no synonyms is assigned cardinality 0.</Paragraph> <Paragraph position="4"> The goal is to identify (term, sense) pairs not pertaining to the domain. For this reason we tend to assign high scores to candidates for which we do not have enough evidence about their inappropriateness. This is why average frequencies are divided by some factor which is function of the number of averaged frequencies, in order to increase the Scores based on little evidence (i.e. fewer averaged numbers). In the sample application described in section 3 the value of k was set to 2. For analogous reasons, we conventionally assign a very high score to candidates for which we have no evidence (i.e. no words in both the synset and the gloss). If either the synset or the gloss is empty, we conventionally double the score for the gloss or the synset, respectively. We note at this point that our final ranking list are sorted in reverse order with respect to the assigned scores, since we are focusing on removing incorrect items. At the top of the list are the items that receive the lowest score, i.e. that are more likely to be incorrect (term, sense) associations for our domain (thus being the best candidates to be pruned out).</Paragraph> <Paragraph position="5"> Table 2 shows the ranking of the senses for the word C in the aviation domain. In the table, each term is followed by its corpus frequency, separated by a slash. From each synset the word C itself has been removed, as well as the gloss words found in the stop list. Therefore, the table only contains the words that contribute to the calculation of the sense's score. E.g. the score for the first sense in the list is obtained from the following expression: null</Paragraph> <Paragraph position="7"> The third sense in the list exemplifies the case of an empty synset (i.e. a synset originally containing only the word under consideration). In this case the score obtained from the gloss is doubled. Note that the obviously incorrect sense of C as a narcotic is in the middle of the list. This is due to a tagging problem, as the word leaves in the gloss was tagged as verb instead of noun. Therefore it was assigned a very high frequency, as the verb leave, unlike the noun leaf, is very common in the aviation domain. The last sense in the list also requires a brief explanation.</Paragraph> <Paragraph position="8"> The original word in the gloss was 10S. However, the pre-processor that was used before tagging the glosses recognized S as an abbreviation for South and expanded the term accordingly. It so happens that both words 10 and South are very frequent in the aviation corpus we used, therefore the sense was assigned a high score.</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 2.4 Optimization </SectionTitle> <Paragraph position="0"> The aim of this phase is to improve the access speed to the synonym database, by removing all information that is not likely to be used.</Paragraph> <Paragraph position="1"> The main idea is to minimize the size of the</Paragraph> <Paragraph position="3"> database in such a way that the database behavior remains unchanged. Two operations are performed at the stage: (i) a simple relevance test to remove irrelevant terms (i.e.</Paragraph> <Paragraph position="4"> terms not pertaining to the domain at hand); (ii) a redundancy check, to remove information that, although perhaps relevant, does not affect the database behavior.</Paragraph> <Paragraph position="5"> Terms not appearing in the domain corpus are considered not relevant to the specific domain and removed from the synonym database. The rationale underlying this step is to remove from the synonym database synonymy relations that are never going to be used in the specific domain. In this way the efficiency of the module can be increased, by reducing the size of the database and the number of searches performed (synonyms that are known to never appear are not searched for), without affecting the system's matching atcuracy. E.g., the synset in (10a) would be reduced to the synset in (10b).</Paragraph> <Paragraph position="6"> The final step is the removal of redundant synsets, possibly as a consequence of the previous pruning steps. Specifically, the following synsets are removed: * Synsets containing a single term (although the associated sense might be a valid one for that term, in the specific domain).</Paragraph> <Paragraph position="7"> * Duplicate synsets, i.e. identical (in terms of synset elements) to some other synset not being removed (the choice of the only synset to be preserved is arbitrary).</Paragraph> <Paragraph position="8"> E.g., the synset in (10b) would be finMly removed at this stage.</Paragraph> </Section> </Section> class="xml-element"></Paper>