File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0624_metho.xml

Size: 19,284 bytes

Last Modified: 2025-10-06 14:15:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0624">
  <Title>Lexical ambiguity and Information Retrieval revisited</Title>
  <Section position="3" start_page="195" end_page="195" type="metho">
    <SectionTitle>
2 The IR-SEMCOR test collection
</SectionTitle>
    <Paragraph position="0"> The best-known publicly available corpus hand-tagged with WordNet senses is SEMCOR (Miller et al., 1993), a subset of the Brown Corpus of about 100 documents that occupies about 2.4 Mb. of text (22Mb. including annotations). The collection is rather heterogeneous, covering politics, sports, music, cinema, philosophy, excerpts from fiction novels, scientific texts...</Paragraph>
    <Paragraph position="1"> We adapted SEMCOR in order to build a test collection -that we call IR-SEMCOR- in four manual steps: * We have split the documents in Semcor 1.5 to get coherent chunks of text for retrieval. We have obtained 171 fragments with an average length of 1331 words per fragment. The new documents in Semcor 1.6 have been added without modification (apart from mapping Wordnet 1.6 to WordNet 1.5 senses), up to a total of 254 documents.</Paragraph>
    <Paragraph position="2"> * We have extended the original TOPIC tags of the Brown Corpus with a hierarchy of sub-tags, assigning a set of tags to each text in our collection. This is not used in the experiments reported here.</Paragraph>
    <Paragraph position="3"> * We have written a summary for each of the first 171 fragments, with lengths varying between 4 and 50 words and an average of 22 words per summary. Each summary is a human explanation of the text contents, not a mere bag of related keywords.</Paragraph>
    <Paragraph position="4"> * Finally, we have hand-tagged each of the summaries with WordNet 1.5 senses. When a word or term was not present in the database, it was left unchanged. In general, such terms correspond to proper nouns; in particular, groups (vg. Fulton_County_Grand_Jury), persons ( Cervantes) or locations (Fulton).</Paragraph>
    <Paragraph position="5"> We also generated a list of &amp;quot;stop-senses&amp;quot; and a list of &amp;quot;stop-synsets&amp;quot;, automatically translating a standard list of stop words for English.</Paragraph>
    <Paragraph position="6"> In our first experiments (Gonzalo et al., 1998; Gonzalo et al., 1999), the summaries were used as queries, and every query was expected to retrieve exactly one document (the one summarized by the query). In order to have a more standard set of relevance judgments, we have used the following assumption here: if an original Semcor document was split into n chunks in our test collection, the summary of each of the chunks should retrieve all the chunks of the original document. This gave us 82 queries with an average of 6.8 relevant documents per query. In order to test the plausibility of this artificial set of relevance judgments, we produced an alternative set of random relevance judgments. This is used as a baseline and included for comparison in all the results presented in this paper.</Paragraph>
    <Paragraph position="7"> The retrieval engine used in the experiments reported here is the INQUERY system (Callan et al., 1992).</Paragraph>
  </Section>
  <Section position="4" start_page="195" end_page="196" type="metho">
    <SectionTitle>
3 Lexical Ambiguity and IR
</SectionTitle>
    <Paragraph position="0"> Sanderson used a technique previously introduced in (Yarowski, 1993) to evaluate Word Sense Disambiguators. Given a text collection, a (size 2) pseudo-word collection is obtained by substituting all occurrences of two randomly chosen words (say, bank and spring) by a new ambiguous word (bank/spring).</Paragraph>
    <Paragraph position="1"> Disambiguating each occurrence of this pseudo-word consists on finding whether the original term was either bank or spring. Note that we are not strictly discriminating senses, but also conflating synonym senses of different words. We previously showed (Gonzalo et al., 1998) that WordNet synsets seem better indexing terms than senses.</Paragraph>
    <Paragraph position="2"> Sanderson used an adapted version of the Reuters text categorization collection for his experiment, and produced versions with pseudo-words of size 2 to 10 words per pseudo-word. Then he evaluated the decrease of IR performance as the ambiguity of the indexing terms is increased. He found that the results were quite insensitive to ambiguity, except for the shortest queries.</Paragraph>
    <Paragraph position="3"> We have reproduce Sanderson's experiment for pseudo-words ranging from size 1 (unmodified) to size 5. But when the pseudo-word bank/spring is disambiguated as spring, this term remains ambiguous: it can be used as springtime, or hook, or to jump, etc. We have, therefore, produced another collection of &amp;quot;ambiguity 0&amp;quot;, substituting each word by its Word-Net 1.5 semantic tag. For instance, spring could be  substituted for n07062238, which is a unique identifier for the synset {spring, springtime: the season o/ growth}.</Paragraph>
    <Paragraph position="4"> The results of the experiment can be seen in Figure 1. We provide 10-point average precision measures 1 for ambiguity 0 (synsets), 1 (words), and 2 to 5 (pseudo-words of size 2,3,4,5). Three curves are plotted: all queries, shortest queries, and longer queries. It can be: seen that: * The decrease of IR performance from synset indexing to word indexing (the slope of the left-most part of: the figure) is more accused than the effects of adding pseudoword ambiguity (the rest of the figure). Thus, reducing real ambiguity seems more useful than reducing pseudo-word ambiguity.</Paragraph>
    <Paragraph position="5"> * The curve for shorter queries have a higher slope, confirming that resolving ambiguity is more benefitial when the relative contribution of each query term is higher. This is true both for real ambiguity and pseudo-word ambiguity.</Paragraph>
    <Paragraph position="6"> Note, however , that the role of real ambiguity is more important for longer queries than pseudo-word ambiguity: the curve for longer queries has a high slope from synsets to words, but it is very smooth from size 1 to size 5 pseudo-words.</Paragraph>
    <Paragraph position="7"> * In our experiments, shorter queries behave better than longer queries for synset indexing (the leftmost points of the curves). This unexpected  documents are fragments from original Semcor texts, and we hypothesize that fragments of one text are relevant to each other. The shorter summaries are correlated with text chunks that have more cohesion (for instance, a Semcor text is split into several IRSemcor documents that comment on different baseball matches).</Paragraph>
    <Paragraph position="8"> Longer summaries behave the other way round: IRSemcor documents correspond to less cohesive text chunks. As introducing ambiguity is more harming for shorter queries, this effect is quickly shadowed by the effects of ambiguity.</Paragraph>
  </Section>
  <Section position="5" start_page="196" end_page="198" type="metho">
    <SectionTitle>
4 WSD and IR
</SectionTitle>
    <Paragraph position="0"> The second experiment carried out by Sanderson was to disambiguate the size 5 collection introducing fixed error rates (thus, the original pseudo-word collection would correspond to 100% correct disambiguation). In his collection, disambiguating below 90% accuracy produced worse results than not disambiguating at all. He concluded that WSD needs to be extremely accurate to improve retrieval results rather than decreasing them.</Paragraph>
    <Paragraph position="1"> We have reproduce his experiment with our size 5 pseudo-words collection, ranging from 0% to 50% error rates (100% to 50% accuracy). In this case, we have done a parallel experiment performing real Word Sense Disambiguation on the original text collection, introducing the fixed error rates with respect to the manual semantic tags. The error rate is understood as the percentage of polysemous words in- null i I i i I t i l Synset indexing with WSD errors -e---Text (no disambiguation thresold for real words) ..... Size 5 pseudowords with WSD errors -+--Size 5 pseudowords (no desambiguation thresold for size 5 pseudowords) ........... Random retdeval (baseline) .....</Paragraph>
    <Paragraph position="2"> .t~xL- ....... ..... g:.... .................................................................... __._____~_ . . ...... + .......... -4- ........... -I~..</Paragraph>
    <Paragraph position="3"> size 5 pseudowords '-,+.. .+...</Paragraph>
    <Paragraph position="5"> correctly disambiguated.</Paragraph>
    <Paragraph position="6"> The results of both experiments can be seen in Figure 2. We have plotted 10-point average precision in the Y-axis against increasing percentage of errors in the X-axis. The curve representing real WSD has as a threshold the 10-pt average precision for plain text, and the curve representing pseudo-disambiguation on the size-5 pseudo-word collection has as threshold the results for the size-5 collection without disambiguation. From the figure it can be seen that: * In the experiment with size 5 pseudo-word disambiguation, our collections seems to be more resistant to WSD errors than the Reuters collection. The 90% accuracy threshold is now 75%. * The experiment with real disambiguation is more tolerant to WSD errors. Above 60% accuracy (40% error rate) it is possible to improve the results of retrieval with plain text.</Paragraph>
    <Paragraph position="7"> The discrepancy between the behavior of pseudo-words and real ambiguous terms may reside in the nature of real polysemy: * Unlike the components of a pseudo-word, the different meanings of a real, polysemous word are often related. In (Buitelaar, 1998) it is estimated that only 5% of the word stems in WordNet can be viewed as true homonyms (unrelated senses), while the remaining 95% polysemy can be seen as predictable extensions of a core sense (regular polysemy). Therefore, a disambiguation error might be less harmful if a strongly related term is chosen. This fact Mso suggests that Information Retrieval does not necessarily demand full disambiguation. Rather than picking one sense and discarding the rest, WSD in IR should probably weight senses according to their plausibility, discarding only the less likely ones. This is used in (Schiitze and Pedersen, 1995) to get a 14% improvement of the retrieval performance disambiguating with a co-occurrence-based induced thesaurus. This is an issue that arises naturally when translating queries for Cross-Language Text Retrieval, in contrast to Machine Translation. A Machine Translation system has to choose one single translation for every term in the source document. However, a translation of a query in Cross-Language retrieval has to pick up all likely translations for each word in the query.</Paragraph>
    <Paragraph position="8"> In (Gonzalo et al., 1999) we argue that mapping a word into word senses (or WordNet synsets) is strongly related to that problem.</Paragraph>
    <Paragraph position="9"> Although the average polysemy of the terms in the Semcor collection is around five (as in Sanderson's experiment), the average polysemy of WordNet 1.5 terms is between 2 and 3. The reason is that polysemy is correlated with frequency of usage. That means that the best discriminators for aquery will be (in general) the less polysemous terms. The more polysemous terms are more frequent and thus worse discriminators, and disambiguation errors are not</Paragraph>
    <Paragraph position="11"> as harmful as for the pseudo-words experiment.</Paragraph>
  </Section>
  <Section position="6" start_page="198" end_page="199" type="metho">
    <SectionTitle>
5 POS tagging and IR
</SectionTitle>
    <Paragraph position="0"> Among many other issues, Krovetz tested to what extent Part-Of-Speech information was a good source of evidence for sense discrimination. He annotated words in the TIME collection with the Church Part-Of-Speech tagger, and found that performance decreased. Krovetz was unable to determine whether the results were due to the tagging strategy or to the errors made by the tagger. He observed that, in many cases, words were related in meaning despite a difference in Part-Of-Speech (for instance, in &amp;quot;summer shoes design&amp;quot; versus &amp;quot;they design sandals&amp;quot;). But he also found that not all errors made by the tagger cause a decrease in retrieval performance. null We have reproduced the experiment by Krovetz in our test collection, using the Brill POS tagger, on one hand, and the manual POS annotations, on the other. The precision/recall curves are plotted in Figure 3 against plain text retrieval. That curves does not show any significant difference between the three approaches. A more detailed examination of some representative queries is more informative:</Paragraph>
    <Section position="1" start_page="198" end_page="198" type="sub_section">
      <SectionTitle>
5.1 Manual POS tagging vs. plain text
Annotating Part-Of-Speech misses relevant informa-
</SectionTitle>
      <Paragraph position="0"> tion for some queries. For instance, a query containing &amp;quot;talented baseball playe~' can be matched against a relevant document containing &amp;quot;is one of the top talents of the time&amp;quot;, because stemming conflates talented and talent. However, POS tagging gives ADg/talent versus N/talent, which do not match. Another example is &amp;quot;skilled diplomat of an Asian Countrff' versus &amp;quot;diplomatic policy&amp;quot;, where N/diplomat and ADJ/diplomat are not matched.</Paragraph>
      <Paragraph position="1"> However, the documents where the matching terms agree in category are ranked much higher with POS tagging, because there are less competing documents. The two effects seem to compensate, producing a similar recall/precision curve on overall.</Paragraph>
      <Paragraph position="2"> Therefore, annotating Part-Of-Speech does not seem worthy as a standalone indexing strategy, even if tagging is performed manually. Perhaps giving partial credit to word occurrences with different POS would be an interesting alternative.</Paragraph>
      <Paragraph position="3"> Annotating POS, however, can be a useful intermediate task for IR. It is, for instance, a first step towards semantic annotation, which produced much better results in our experiments.</Paragraph>
    </Section>
    <Section position="2" start_page="198" end_page="199" type="sub_section">
      <SectionTitle>
5.2 Brill vs. manual tagging
</SectionTitle>
      <Paragraph position="0"> Although the Brill tagger makes more mistakes than the manual annotations (which are not error free anyway), the mistakes are not completely correlated to retrieval decrease. For instance, a query about &amp;quot;summer shoe design&amp;quot; is manually annotated as &amp;quot;summer/N shoe/N design/N&amp;quot;, while the Brill tagger produces &amp;quot;summer/N shoe/N design/if'. But an appropriate document contains &amp;quot;Italian designed sandals&amp;quot;, which is manually annotated as &amp;quot;Italian/ADJ designed/ADg sandals/N&amp;quot; (no match), but as &amp;quot;Italian/ADJ designed/V sandals/IV&amp;quot; by the Brill tagger (matches design and designed after stemming). null  In general, comparing with no tagging, the automatic and the manual tagging behave in a very similar way.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="199" end_page="199" type="metho">
    <SectionTitle>
6 Phrase indexing
</SectionTitle>
    <Paragraph position="0"> WordNet is rich in multiword entries (more than 55000 variants in WordNet 1.5). Therefore, such collocations are annotated as single entries in the Semcor and IR-Semcor collections. The manual annotation also includes name expressions for persons, groups, locations, institutions, etc., such as Drew Centennial Church or Mayor-nominate Ivan Allen Yr.. In (Krovetz, 1997), it is shown that the detection of phrases can be useful for retrieval, although it is crucial to assign partial credit also to the components of the collocation.</Paragraph>
    <Paragraph position="1"> We have performed an experiment to compare three different indexing strategies: 1. Use plain text both for documents and queries, without using phrase information.</Paragraph>
    <Paragraph position="2"> 2. Use manually annotated phrases as single indexing units in documents and queries. This means that New_York is a term unrelated to new or York (which seems clearly beneficial both for weighting and retrieval), but also that Drew_Centennial_Church would be a single indexing term unrelated to church, which can lead to precise matchings, but also to lose correct query/document correlations.</Paragraph>
    <Paragraph position="3"> 3. Use plain text for documents, but exploit the INQUERY #phrase query operator for the collocations in the query. For instance, meeting of the National_Football_League is expressed as #sum(meeting #phrase(National Football League)) in the query language.</Paragraph>
    <Paragraph position="4"> The #phrase operator assigns credit to the partial components of the phrase, while priming its co-occurrence.</Paragraph>
    <Paragraph position="5"> The results of the experiments can be seen in Figure 4. Overall, indexing with multiwords behaves slightly worse than standard word indexing. Using the INQUERY #phrase operator behaves similarly to word indexing.</Paragraph>
    <Paragraph position="6"> A closer look at some case studies, however, gives more information: * In some cases, simply indexing with phrases is obviously the wrong choice. For instance, a query containing &amp;quot;candidate in governor's_race&amp;quot; does not match &amp;quot;opened his race for governor'. This supports the idea that it is crucial to assign credit to the partial components of a phrase, and also. that it may be useful to look for co-occurrence beyond one word windows.</Paragraph>
    <Paragraph position="7"> * Phrase indexing works much better when the query is longer and there are relevant terms apart from one or more multiwords. In such cases, a relevant document containing just one query term is ranked much higher with phrase indexing, because false partial matches with a phrase are not considered. Just using the #phrase operator behaves mostly like no phrase indexing for these queries, because this filtering is not achieved.</Paragraph>
    <Paragraph position="8"> Phrase indexing seems more adequate when the query is intended to be precise, which is not the case of our collection (we assume that the summary of a fragment has all the fragments in the original text as relevant documents). For instance, &amp;quot;story of a famous strip cartoonist&amp;quot; is not related -with phrase indexing- to a document containing &amp;quot;detective_story&amp;quot;. This is correct if the query is intended to be strict, although in our collection these are fragments of the same text and thus we are assuming they are related. The same happens with the query &amp;quot;The board_of_regents of Paris_Junior_College has named the school's new president&amp;quot;, which is not related to &amp;quot;Junior or Senior High School Teaching Certificate&amp;quot;. This could be the right decision in a different relevance judgment setup, but it is wrong for our test collection.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML