File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0812_metho.xml
Size: 17,223 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0812"> <Title>Using Semantic Similarity to Acquire Cooccurrence Restrictions from Corpora</Title> <Section position="3" start_page="82" end_page="83" type="metho"> <SectionTitle> 2 Extraction of Syntactic Word Collo- </SectionTitle> <Paragraph position="0"> cates from Corpora First, all instances of the syntactic dependency pairs under consideration (e.g. verb-object, verb-subject, adjective-noun) are extracted from a collection of text corpora using a parser. In performing this task, only the most important words (e.g. heads of immediate constituents) are chosen. The chosen words are also lemmatized. For example, the extraction of verb-object collocates from a text fragment such as have certamly htred the best financial analysts tn the area would yield the pair < hire, analyst >.</Paragraph> <Paragraph position="1"> The extracted pairs are sorted according to the syntactic dependency involved (e.g. verb-object). All pairs which involve the same dependency and share one word collocate are then merged. Each new pair consists of a unique associating word and a set of associated words containing all &quot;statistically relevant&quot; words (see below) which are related to the associating word by the same syntactic dependency, e.g.</Paragraph> <Paragraph position="3"> OUT: < {fire_v,dmmiss_v,hire v,recruit v}, employee_n * The statistical relevance of associated words is defined with reference to their conditional probability. For example, consider the equations in (6) where the numeric values express the (conditional) probability of occurrence in some corpus for each verb in (5) given the noun employee.</Paragraph> <Paragraph position="4"> (6) fi'eq(fire v \[ employee_n)= .3 freq(dlsmiss v \[ employee_n)= .28 freq(hlre_v \[ employee_n)= .33 freq(recrmt v \[ employee_n)= .22 fi-eq(attract v \[ employee n) = .02 freq(be v \[ employee_n) = .002 freq(make v \] employee_n)= .005 freq(affect_v \[ employee_n*)= .01 These conditional probabilities are obtained by dividing the number of occurrences of the verb with employee by the total number of occurrences of the verb with reference to the text corpus under consideration, as indicated in (7).</Paragraph> <Paragraph position="6"> Inclusion in the set of statistically relevant associated words is established with reference to a threshold TI which can be either selected manually or determined automatically as the most ubiquitous probability value for each choice of associating word. For example, the threshold T1 for the selection of verbs taking the noun employee as direct object with reference to the conditional probabilities in (6) cart be calculated as follows. First, all probabilities in (6) are distributed over a ten-bin template, where each bin is to receive progressively larger values starting from a fixed lowest point greater than 0, e.g.:</Paragraph> <Paragraph position="8"> Then one of the values from the bin containing most elements (e.g. the lowest) is chosen as the threshold.</Paragraph> <Paragraph position="9"> The exclusion of collocates which are not statistically relevant in the sense specified above makes it possible to avoid interference from collocations which do not provide sufficiently specific exemplifications of word usage. null</Paragraph> </Section> <Section position="4" start_page="83" end_page="84" type="metho"> <SectionTitle> 3 Word Clustering and Sense Expansion </SectionTitle> <Paragraph position="0"> Each pair of syntactic collocates at this stage consists of either * an associating head word (AING) and a set of dependent associated words (AED), e.g.</Paragraph> <Paragraph position="1"> < AING: fire_v, AED: {gun n,rocket_n,employee_n,clerk n} > * or an associating dependent word (AING) and a set of associated head words (AED), e.g.</Paragraph> <Paragraph position="2"> < AED: {fire_v, dismiss v,hire v,recruit_v}, AING: employee_n> The next step consists in partitioning the set of associated words into clusters of semantically congruent word senses. This is done in three stages.</Paragraph> <Paragraph position="3"> 1. Form all possible unique word pairs with non-identical members out of each associated word set, e.g.</Paragraph> <Paragraph position="4"> IN: {fire, dismiss, htre, recrmt} OUT: {ftre-dismms,fire-htre,fire-recrult, dmmms-hlre,dmmiss-recrmt, hire-recruit} IN: {gun,rocket,employee,clerk} OUT: {gun-rocket,gun-employee, gun-clerk,rocket-employee, rocket-clerk, employee-clerk} . Find the semantic similarity (expressed as a numeric value) for each such pair, specifying the senses with respect to which the similarity holds (if any), e.g. IN: {fire-dasmms,fire-hwe,fire-recrult,</Paragraph> <Paragraph position="6"> The assessment of semantic similarity and the ensuing word sense specification are carried out using Resnik's approach (see section 1).</Paragraph> <Paragraph position="7"> 3. Fix the threshold for membership into clusters of semantically congruent word senses (either manually or by calculation of the most ubiquitous semantic similarity value) and generate such clusters. For example, assuming a threshold value of 3, we will have:</Paragraph> <Paragraph position="9"> Once associated words have been partitioned into semantically congruent clusters, new sets of collocations are generated as shown in (8) by * pairing each cluster of semantically congruent associated words with its associating word, and * expanding the associating word into all of its possible senses.</Paragraph> <Paragraph position="10"> At this stage, all word senses which are syntactically incompatible with the original input words are removed. For example, the intransitive verb senses fire v 1 and fire_v_5 (see table 1) are eliminated since the occurrence of fire in the input collocation which we are seeking to disambiguate relates to the transitive use of the verb. Note that the noun employee has only one sense in WordNet (see table 1); therefore, employee has a single expansion when used as an associating word.</Paragraph> <Paragraph position="11"> The disambiguation of the associating word is performed by intersecting correspondent subsets across pairs of the newly generated collocations. In, the case of verb-object pairs, for example, the subsets of these new sets containing verbs are intersected and likewise the subsets containing objects are intersected. The output comprises a new set which is non-empty if the two sets have one or more common members in both the verb and object subsets. For the specific example of newly expanded collocations given in (8), there is only one pairwise intersection producing a non empty result, as shown in (9). (9) IN: < {fwe_v_2/3/4/6/7/8},</Paragraph> <Paragraph position="13"> OUT: < {fire v_4}, {employee n_l} > All other pairwise intersections are empty as there are no verbs and objects common to both sets of each pairwise combination.</Paragraph> <Paragraph position="14"> The result of distinct disambiguation events can be merged into pairs of semantically compatible word clusters using the notion of semantic similarity. For example, the verbs and nouns of all the input pairs in (10) are closely related in meaning and can therefore be merged into a single pair.</Paragraph> </Section> <Section position="5" start_page="84" end_page="85" type="metho"> <SectionTitle> 5 Storing Results </SectionTitle> <Paragraph position="0"> Pairs of semantically congruent word sense clusters such as the one shown in the output of (10) are stored as cooccurrence restrictions so that future disambiguation events involving any head-dependent word sense pair in them can be reduced to simple table lookups.</Paragraph> <Paragraph position="1"> The storage procedure is structured in three phases.</Paragraph> <Paragraph position="2"> First, each cluster of word senses in each pair is assigned a unique code consisting of an id number and the syntactic dependency involved:</Paragraph> <Paragraph position="4"> Then, the cluster codes in each pair are stored in a cooccurrence restriction table:</Paragraph> <Paragraph position="6"> The disambiguation of a pair of syntactically related words such as the pair <fire_v, employee_n> can be carried out by retrieving all the cluster codes for each word in the pair and create all possible pairwise combinations, e.g. IN: < fire v, employee_n > OUT: < 102_VO, 102_OV ></Paragraph> <Paragraph position="8"> * eliminating code pairs which are not in the table of cooccurrence restrictions for cluster codes, e.g.</Paragraph> <Paragraph position="9"> INPUT: < 102 VO, 102_OV *</Paragraph> <Paragraph position="11"> OUTPUT: < 102_VO, 102_OV * using the resolved cluster code pairs to retrieve the appropriate senses of the input words from previously stored pairs of word senses and cluster codes such as those in the table above, e.g.</Paragraph> <Paragraph position="12"> INPUT: < \[fire v, 102 VO\] , \[employee_n, 102_OV\] * OUTPUT: < fire v_4, employee_n_l * By repeating the acquisition process described in sections 2-4 for collections of appropriately selected source corpora, the acquired cooccurrence restrictions can be parameterized for sublanguage specific domains. This augmentation can be made by storing each word sense and associated cluster code with a sublanguage specification and a percentage descriptor indicating the relative frequency of the word sense with reference to the cluster code in the specified sublanguage, e.g.</Paragraph> <Paragraph position="13"> Because only statistically relevant collocations are chosen to drive the disambiguation process (see section 2), it follows that no cooccurrence restrictions will be acquired for a variety of word pairs. This, for example, might be the case with verb-object pairs such as < firev, hand_n > where the noun is a somewhat atypical object. This problem can be addressed by using the cooccurrence restrictions already acquired to classify statistically inconspicuous collocates, as shown below with reference to the verb object pair < firev, hand n >.</Paragraph> <Paragraph position="14"> Find all verb-object cooccurrence restrictions containing the verbfire, which as shown in the previous section are Cluster the statistically inconspicuous collocate with all members of the direct object collocate class. This will provide one or more sense classifications for the statistically inconspicuous collocate. In the .present case, the WordNet senses 2 and 9 (glossed as &quot;farm labourer&quot; and &quot;crew member&quot; respectively) are given when hand_n clusters with clerk n 1/2 and employee_n_1, e.g.</Paragraph> <Paragraph position="15"> IN: {hand_n, clerk n_l/2, employee_n 1, gun_n_ 1, rocketL n_l} OUT: {hand_n_2/9, clerk_n 1/2, employee_n_1} {gun_n_1, rocketn_l} Associate the disambiguated statistically inconspicuous collocate with the same code of the word senses with which it has been clustered, e.g.</Paragraph> <Paragraph position="16"> IIhand In 12 I 10&quot;O \[I hand n g 102_VO This will make it possible to choose senses 2 and 9 for hand in contexts where hand occurs as the direct object of verbs such asfire, as explained in the previous section.</Paragraph> </Section> <Section position="6" start_page="85" end_page="86" type="metho"> <SectionTitle> 7 Preliminary Results and Future Work </SectionTitle> <Paragraph position="0"> A prototype of the system described was partially implemented to test the effectiveness of the disambiguation method. The prototype comprises: a component performing semantic similarity judgements for word pairs using WordNet (this is an implementation of Resnik's approach); a component which turns sets of word pairs rated for semantic similarity into clusters of semantically congruent word senses, and a component which performs the disambiguation of syntactic collocates in the manner described in section 4.</Paragraph> <Paragraph position="1"> The current functionality provides the means to disambiguate a pair of words <W1 W2> standing in a given syntactic relation Dep given a list of words related to W1 by Dep, a list of words related to W2 by Dep, and a semantic similarity threshold for word clustering, as shown in (12).</Paragraph> <Paragraph position="2"> In order to provide an indication of how well the system performs, a few examples are presented in (12). As can be confirmed with reference to the WordNet entries in table 1, these preliminary results are encouraging as they show a reasonable resolution of ambiguities. A more thorough evaluation is currently being carried out. (12) IN: < fire_v-\[employee n,clerk_n, gun_n,plstol_nl, \[fire,dasmlss, htre,recrmt\]-employee_n, 3 > OUT: < fire_v_4 employee n_l > IN&quot; < fire v-\[employee_n,clerk._n, gun_n,plstol n\], \[fire v,shoot_v, pop v,&sharge v\]-gun n, 3 > OUT: < fire_~..1 gun_n_l > IN: < wear_v-\[sult...n,garment_n, clothes_.n,umform n\], \[wear_v, have_onv, record v,iile v\]-smt_n, 3 > OUT. < wear_v_l/9 smt n_l > IN. < file_v-\[sult_.n,proceedmgs n, lawsult_n, htagataon n\], \[wear,have_on, record_v,file v\]-sult n, 3 > OUT. < file_v 1/5 sult_n_2 > Note that disambiguation can yield multiple senses, as shown with reference to the resolution of the verbs file and wear in the third and fourth examples shown in (12). Multiple disambiguation results typically occur when some of the senses given for a word in the source dictionary database are close in meaning. For example, both sense 1 and 9 of wear relate to an eventuality of &quot;clothing oneself&quot;. Multiple word sense resolutions can be ranked with reference to the semantic similarity scores used in clustering word senses during disambiguation. The basic idea is that the word sense resolution contained in the word cluster which has highest semantic similarity scores provides the best disambiguation hypothesis. For example, specific word senses for the verb-object pair < wear suit > in the third example of (12) above are given by the disambiguated word tuples in (13) which arise from intersecting pairs consisting of all senses of an associating word and a semantically congruent cluster of its associated words, as described in section 4.</Paragraph> <Paragraph position="3"> Taking into account the scores shown in (14), the best word sense candidate for the verb wear in the context wear suit would be wear_v 1. In this case, the semantic similarity scores for the second cluster (i.e. the nouns) do not matter as there is only one such cluster.</Paragraph> <Paragraph position="4"> (14) szm(have_on_v_l, wear_v 1) = 6.291 sim(file_v_2, wear v_9} = 3.309 Preliminary results suggest that the present treatment of disambiguation can achieve good results with small quantities of input data. For example, as few as four input collocations may suffice to provide acceptable results, e.g.</Paragraph> <Paragraph position="5"> (15) IN: < flre_v-\[employee_n,clerk n\], \[fire,dlsmiss\]-employee_n, 3 * OUT: < fire_v_4 employee_n_1 * IN: < wear v-\[star n,clothes_n\], \[wear_v,have_on_v\]-suit_n, 3 > OUT: < wea_ v_l sultn 1 > This is because word clustering --- which is the decisive step in disambiguation --- is carried out using a measure of semantic similarity which is essentially induced from the hyponymic links of a semantic word net. As long as the collocations chosen as input data generate some word clusters, there is a good chance for disambiguation. The reduction of input data requirements offers a significant advantage compared with methods such as those presented in Brown et al. (1991), Gale et al. (1992), Yarowsky (1995), and Karol & Edelman (1996) where strong reliance on statistical techniques for the calculation of word and context similarity commands large source corpora. Such advantage can be particularly appreciated with reference to the acquisition of cooccurrence restrictions for those sublanguage domains where large corpora are not available.</Paragraph> <Paragraph position="6"> Ironically, the major advantage of the approach proposed --- namely, a reliance on structured semantic word nets as the main knowledge source for assessing semantic similarity --- is also its major drawback. Semantically structured lexical databases, especially those which are tuned to specific sublanguage domains, are currently not easily available and expensive to build manually. However, advances in the area of automatic thesaurus discovery (Grefenstette, 1994) as well as progress in the area of automatic merger of machine readable dictionaries (Sanfilippo & Poznanski, 1992; Chang & Chen, 1997) indicate that availability of the lexical resources needed may gradually improve in the future. In addition, ongoing research on rating conceptual distance from unstructured synonym sets (Sanfilippo, 1997) may soon provide an effective way of adapting any commercially available thesaurus to the task of word clustering, thus increasing considerably the range of lexical databases used as knowledge sources in the assessment of semantic similarity.</Paragraph> </Section> class="xml-element"></Paper>