File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0704_metho.xml
Size: 15,708 bytes
Last Modified: 2025-10-06 14:15:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0704"> <Title>The Use of WordNet in Information Retrieval</Title> <Section position="4" start_page="31" end_page="32" type="metho"> <SectionTitle> 2 What's wrong with WordNet? </SectionTitle> <Paragraph position="0"> In this section we analyze why WordNet has failed to improve information retrieval performance. We run exact-match retrieval against 9 small standard test collections in order to observe this phenomenon. An information retrieval test collection consists of a collection of documents along with a set of test queries. The set of relevant documents for each test query is also given, so that the performance of the information retrieval system can be measured.</Paragraph> <Paragraph position="1"> We expand queries using a combination of synonyms, hypernyms, and hyponyms in WordNet.</Paragraph> <Paragraph position="2"> The results are shown in Table 1.</Paragraph> <Paragraph position="3"> In Table 1 we show the name of the test collection (Collection), the total number of documents (#Doc) and queries (#Query), and all relevant documents for all queries (#Rel) in that collection. For each document collection, we indicate the total number of relevant documents retrieved (Rel-ret), the recall (~), the total number of documents retrieved (Retdocs), and the precision t Rel-ret ~ for each of Ret-docs j no expansion (Base), expansion with synonyms (Exp. I), expansion with synonyms and hypernyms (Exp. II), expansion with synonyms and hyponyms (Exp. III), and expansion with synonyms, hypernyms, and hyponyms (Exp. IV).</Paragraph> <Paragraph position="4"> From the results in Table 1, we can conclude that query expansion can increase recall performance but unfortunately degrades precision performance. We thus turned to investigation of why all the relevant documents could not be retrieved with the query expansion method above.</Paragraph> <Paragraph position="5"> Some of the reasons are stated below : * Two terms that seem to be interrelated have different parts of speech in WordNet.</Paragraph> <Paragraph position="6"> This is the case between stochastic (adjective) and statistic (noun). Since words in WordNet are grouped on the basis of part of speech in WordNet, it is not possible to find a relationship between terms with different parts of speech.</Paragraph> <Paragraph position="7"> * Most of relationships between two terms are not found in WordNet. For example how do we know that Sumitomo Bank is a Japanese company ? * Some terms are not included in WordNet (proper name, etc).</Paragraph> <Paragraph position="8"> To overcome all the above problems, we propose a method to enrich WordNet with an automatically constructed thesaurus. The idea underlying this method is that an automatically constructed thesaurus could complement the drawbacks of WordNet. For example, as we stated earlier, proper names and their inter-relations among them are not found in Word-Net, but if proper names and other terms have some strong relationship, they often cooccur in the document, so that their relationship may be modelled by an automatically constructed thesaurus. null Polysemous words degrade the precision of information retrieval since all senses of the original query term are considered for expansion. To overcome the problem of polysemous words, we apply a restriction in that queries are expanded by adding those terms that are most similar to the entirety of query terms, rather than selecting terms that are similar to a single term in the query.</Paragraph> <Paragraph position="9"> In the next section we describe the details of our method</Paragraph> </Section> <Section position="5" start_page="32" end_page="324" type="metho"> <SectionTitle> 3 Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="32" end_page="324" type="sub_section"> <SectionTitle> 3.1 Co-occurrence-based Thesaurus </SectionTitle> <Paragraph position="0"> The general idea underlying the use of term co-occurrence data for thesaurus construction is that words that tend to occur together in documents are likely to have similar, or related, meanings. Co-occurrence data thus provides a statistical method for automatically identifying semantic relationships that are normally contained in a hand-made thesaurus. Suppose two words (A and B) occur fa and fb times, respectively, and cooccur fc times, then the similarity between A and B can be calculated using a similarity coefficient such as the Dice Coefficient :</Paragraph> </Section> <Section position="2" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 2xf 3.2 Predicate-Argument-based Thesaurus </SectionTitle> <Paragraph position="0"> In contrast with the previous section, this method attempts to construct a thesaurus according to predicate-argument structures. The use of this method for thesaurus construction is based on the idea that there are restrictions on what words can appear in certain environments, and in particular, what words can be arguments of a certain predicate. For example, a cat may walk, bite, but can not fly. Each noun may therefore be characterized according to the verbs or adjectives that it occurs with. Nouns may then be grouped according to the extent to which they appear in similar constructions.</Paragraph> <Paragraph position="1"> First, all the documents are parsed using the Apple Pie Parser, which is a probabilistic chart parser developed by Satoshi Sekine (Sekine and Grisbman, 1995). Then the following syntactic structures are extracted :</Paragraph> <Paragraph position="3"> Each noun has a set of verbs and adjective that it occurs with, and for each such relationship, a dice coefficient value is calculated.</Paragraph> <Paragraph position="5"> where fsub(Vi, nj) is the frequency of noun nj occurring as the subject of verb vi, fsub(nj) is the frequency of the noun nj occurring as subject of any verb, and f(vi) is the frequency of the verb vi Cobj(Vi, n3) = 2x/obj(.,,~) f(vl)+fobj (nj)' where fobi(vi, ni) is the frequency of noun n i occurring as the object of verb vi, fobj(nj) is the frequency of the noun nj occurring as object of any verb, and f(vi) is the frequency of the verb vi</Paragraph> <Paragraph position="7"> where f(ai, nj) is the frequency of noun nj occurring as argument of adjective ai, fadi(nj) is the frequency of the noun n i occurring as argument of any adjective, and f(a 0 is the frequency of the adjective ai We define the object similarity of two nouos with respect to one predicate, as the minimum of each dice coefficient with respect to that predicate, i.e.</Paragraph> <Paragraph position="8"> SI~'t/I, ub(Vi, rlj, nk)=min{C.ub(Vi, nj), C.ub(Vi, nk)} SIMobi(vi, n i, n~)=rnin{Cobj (vi, nj), Cob1 (vi, nh) } $IM~dj(ai, n i, nk)=min{C~dj(a~, n j), C~dj(a,, nk)} Finally the overall similarity between two nouns is defined as the average of all the similarities between those two nouns for all predicate-argument structures.</Paragraph> </Section> <Section position="3" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 3.3 Expansion Term Weighting Method </SectionTitle> <Paragraph position="0"> A query q is represented by a vector -~ = (ql, q2, ..., qn), where the qi's are the weights of the search terms ti contained in query q.</Paragraph> <Paragraph position="1"> The similarity between a query q and a term tj can be defined as belows :</Paragraph> <Paragraph position="3"> Where the value of sim(ti, tj) can be defined as the average of the similarity values in the three types of thesaurus. Since in Word-Net there are no similarity weights, when there is a relation between two terms in WordNet, their similarity is taken from the average of the similarity between those two terms in the co-occurrence-based and in predicate-argument-based thesauri.</Paragraph> <Paragraph position="4"> With respect to the query q, all the terms in the collection can now be ranked according to their simqt. Expansion terms are terms tj with high simqt(q, tj).</Paragraph> <Paragraph position="5"> The weight(q, tj) of an expansion term tj is defined as a function of simqt(q, tj): weight(q, tj) = simqt(q, tj) t, eq qi where 0 _< weight(q, tj) <_ 1.</Paragraph> <Paragraph position="6"> An expansion term gets a weight of 1 if its similarity to all the terms in the query is 1. Expansion terms with similarity 0 to all the terms in the query get a weight of 0. The weight of an expansion term depends both on the entire retrieval query and on the similarity between the terms. The weight of an expansion term can be interpreted mathematically as the weighted mean of the similarities between the term tj and all the query terms. The weight of the original query terms are the weighting factors of those similarities.</Paragraph> <Paragraph position="7"> Therefore the query q is expanded by adding the following query : ~e = (al, a2, ..., at) where aj is equal to weight(q, tj) iftj belongs to the top r ranked terms. Otherwise a i is equal to 0.</Paragraph> <Paragraph position="8"> The resulting expanded query is : where the o is defined as the concatenation operator. null The method above can accommodate the polysemous word problem, because an expansion term which is taken f~om a different sense to the original query term is given very low weight.</Paragraph> </Section> </Section> <Section position="6" start_page="324" end_page="324" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> In order to evaluate the effectiveness of the proposed method in the previous section we conducted experiments using the WSJ, CACM, IN-SPEC, CISI, Cranfield, NPL, and LISA test collections. The WSJ collection comprises part of the TREC collection (Voorhees and Harman, 1997). As a baseline we used SMART (Salton, 1971) without expansion. SMART is an information retrieval engine based on the vector space model in which term weights are calculated based on term frequency, inverse document frequency and document length normalization. The results are shown in Table 2. This table shows the average of 11 point uninterpolated recall-precision for each of baseline, expansion using only WordNet, expansion using only predicate-argument-based thesaurus, expansion using only cooccurrence-based thesaurus, and expansion using all of them. For each method we give the percentage of improvement over the baseline. It is shown that the performance using the combined thesauri for query expansion is better than both SMART and using just one type of thesaurus.</Paragraph> </Section> <Section position="7" start_page="324" end_page="324" type="metho"> <SectionTitle> 5 Discussions </SectionTitle> <Paragraph position="0"> In this section we discuss why our method of using WordNet is able to improve the performance of information retrieval. The important points of our method are : * the coverage of WordNet is broadened * weighting method The three types of thesaurus we used have different characteristics. Automatically constructed thesauri add not only new terms but also new relationships not found in WordNet. If two terms often cooccur together in a document then those two terms are likely bear some relationship. Why not only use the automatically constructed thesauri ? The answer to this is that some relationships may be missing in the automatically constructed thesauri. For example, consider the words tumor and turnout.</Paragraph> <Paragraph position="1"> These words certainly share the same context, but would never appear in the same document, at least not with a frequency recognized by a cooccurrence-based method. In general, different words used to describe similar concepts may never be used in the same document, and are thus missed by the cooccurrence methods.</Paragraph> <Paragraph position="2"> However their relationship may be found in the WordNet thesaurus.</Paragraph> <Paragraph position="3"> The second point is our weighting method.</Paragraph> <Paragraph position="4"> As already mentioned before, most attempts at automatically expanding queries by means of WordNet have failed to improve retrieval effectiveness. The opposite has often been true: expanded queries were less effective than the original queries. Beside the &quot;incomplete&quot; nature of WordNet, we believe that a further problem, the weighting of expansion terms, has not been solved. All weighting methods described in the past researches of query expansion using Word-Net have been based on &quot;trial and error&quot; or ad-hoc methods. That is, they have no underlying justification.</Paragraph> <Paragraph position="5"> The advantages of our weighting method are: * the weight of each expansion term considers the similarity of that term with all terms in the original query, rather than to just one or some query terms.</Paragraph> <Paragraph position="6"> * the weight of the expansion term accommodates the po\[ysemous word problem.</Paragraph> <Paragraph position="7"> This method can accommodate the polysemous word problem, because an expansion term taken from a different sense to the original query term sense is given very low weight. The reason for this is that, the weighting method depends on all query terms and all of the thesauri. For example, the word bank has many senses in Word-Net. Two such senses are the financial institution and the river edge senses. In a document collection relating to financial banks, the river sense of bank will generally not be found in the eooccurmnce-based thesaurus because of a lack of articles talking about rivers. Even though (with small possibility) there may be some documents in the collection talking about rivers, ff the query contained the finance sense of bank then the other terms in the query would also concerned with finance and not rivers. Thus rivers would only have a relationship with the bank term and there would be no relationships with other terms in the original query, resulting in a low weight. Since our weighting method depends on both query in its entirety and similarity in the three thesauri, the wrong sense expansion terms are given very low weight.</Paragraph> </Section> <Section position="8" start_page="324" end_page="324" type="metho"> <SectionTitle> 6 Related Research </SectionTitle> <Paragraph position="0"> Smeaton (Smeaton and Berrut, 1995) and Voorhees (Voorhees, 1994) have proposed an expansion method using WordNet. Our method differs from theirs in that we enrich the coverage of WordNet using two methods of automatic thesatmm construction, and we weight the expausion term appropriately so that it can accommodate the polysemous word problem.</Paragraph> <Paragraph position="1"> Although Stairmand (Stairmand, 1997) and Richardson (Richardson and Smeaton, 1995) have proposed the use of WordNet in information retrieval, they did not used WordNet in the query expansion framework.</Paragraph> <Paragraph position="2"> Our predicate-argument structure-based thesatmis is based on the method proposed by Hindie (Hindle, 1990), although Hindle did not apply it to information retrieval. Instead, he used mutual information statistics as a Similarity coefficient, wheras we used the Dice coefficient for normalization purposes. Hindle only extracted the subject-verb and the object-verb predicatearguments, while we also extract adjective-noun predicate-arguments.</Paragraph> <Paragraph position="3"> Our weighting method follows the Qiu method (Qiu and Frei, 1993), except that Qiu used it to expand terms only from a single automatically constructed thesarus and did not consider the use of more than one thesaurus.</Paragraph> </Section> class="xml-element"></Paper>