File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1104_metho.xml

Size: 20,914 bytes

Last Modified: 2025-10-06 14:10:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1104">
  <Title>Automatically creating datasets for measures of semantic relatedness</Title>
  <Section position="5" start_page="16" end_page="17" type="metho">
    <SectionTitle>
3 Related work
</SectionTitle>
    <Paragraph position="0"> In the seminal work by Rubenstein and Goodenough (1965), similarity judgments were obtained from 51 test subjects on 65 noun pairs written on paper cards. Test subjects were instructed to order the cards according to the &amp;quot;similarity of meaning&amp;quot;  andthenassignacontinuoussimilarityvalue(0.04.0) to each card. Miller and Charles (1991) replicated the experiment with 38 test subjects judging on a subset of 30 pairs taken from the original 65 pairs. This experiment was again replicated by Resnik (1995) with 10 subjects. Table 1 summarizes previous experiments.</Paragraph>
    <Paragraph position="1"> A comprehensive evaluation of SR measures requires a higher number of word pairs. However, the original experimental setup is not scalable as ordering several hundred paper cards is a cumbersome task. Furthermore, semantic relatedness is an intuitive concept and being forced to assign fine-grained continuous values is felt to overstrain the test subjects. Gurevych (2005) replicated the experiment of Rubenstein and Goodenough with the original 65 word pairs translated into German.</Paragraph>
    <Paragraph position="2"> She used an adapted experimental setup where test subjects had to assign discrete values {0,1,2,3,4} and word pairs were presented in isolation. This setup is also scalable to a higher number of word pairs (350) as was shown in Gurevych (2006).</Paragraph>
    <Paragraph position="3"> Finkelstein et al. (2002) annotated a larger set of word pairs (353), too. They used a 0-10 range of relatedness scores, but did not give further details about their experimental setup. In psycholinguistics, relatedness of words can also be determined through association tests (Schulte im Walde and Melinger, 2005). Results of such experiments are hard to quantify and cannot easily serve as the basis for evaluating SR measures.</Paragraph>
    <Paragraph position="4"> Rubenstein and Goodenough selected word pairs analytically to cover the whole spectrum of</Paragraph>
  </Section>
  <Section position="6" start_page="17" end_page="17" type="metho">
    <SectionTitle>
CORRELATION
PAPER LANGUAGE PAIRS POS REL-TYPE SCORES # SUBJECTS INTER INTRA
</SectionTitle>
    <Paragraph position="0"> similarity from &amp;quot;not similar&amp;quot; to &amp;quot;synonymous&amp;quot;.</Paragraph>
    <Paragraph position="1"> This elaborate process is not feasible for a larger dataset or if domain-specific test sets should be compiled quickly. Therefore, we automatically create word pairs using a corpus-based approach.</Paragraph>
    <Paragraph position="2"> We assume that due to lexical-semantic cohesion, texts contain a sufficient number of words related by means of different lexical and semantic relations. Resulting from our corpus-based approach, test sets will also contain domain-specific terms. Previous studies only included general terms as opposed to domain-specific vocabularies and therefore failed to produce datasets that can be used to evaluate the ability of a measure to cope with domain-specific or technical terms. This is an important property if semantic relatedness is used in information retrieval where users tend to use specific search terms (Porsche) rather than general ones (car).</Paragraph>
    <Paragraph position="3"> Furthermore, manually selected word pairs are often biased towards highly related pairs (Gurevych, 2006), because human annotators tend to select only highly related pairs connected by relations they are aware of. Automatic corpus-based selection of word pairs is more objective, leading to a balanced dataset with pairs connected by all kinds of lexical-semantic relations. Morris and Hirst (2004) pointed out that many relations between words in a text are non-classical (i.e. other than typical taxonomic relations like synonymy or hypernymy) and therefore not covered by semantic similarity.</Paragraph>
    <Paragraph position="4"> Previous studies only considered semantic relatedness (or similarity) of words rather than concepts. However, polysemous or homonymous words should be annotated on the level of concepts. If we assume that bank has two meanings (&amp;quot;financial institution&amp;quot; vs. &amp;quot;river bank&amp;quot;)5 and it is paired with money, the result is two sense quali5WordNet lists 10 meanings.</Paragraph>
    <Paragraph position="5"> fied pairs (bankfinancial - money) and (bankriver - money). It is obvious that the judgments on the twoconceptpairsshoulddifferconsiderably. Concept annotated datasets can be used to test the ability of a measure to differentiate between senses when determining the relatedness of polysemous words. To our knowledge, this study is the first to include concept pairs and to automatically generate the test dataset.</Paragraph>
    <Paragraph position="6"> In our experiment, we annotated a high number of pairs similar in size to the test sets by Finkelstein(2002)andGurevych(2006). Weusedtherevised experimental setup (Gurevych, 2005), based on discrete relatedness scores and presentation of word pairs in isolation, that is scalable to the higher number of pairs. We annotated semantic relatedness instead of similarity and included also non noun-noun pairs. Additionally, our corpus-based approach includes domain-specific technical terms and enables evaluation of the robustness of a measure.</Paragraph>
  </Section>
  <Section position="7" start_page="17" end_page="19" type="metho">
    <SectionTitle>
4 Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
4.1 System architecture
</SectionTitle>
      <Paragraph position="0"> Figure 1 gives an overview of our automatic corpus-based system for creating test datasets for evaluating SR measures.</Paragraph>
      <Paragraph position="1"> In the first step, a source corpus is preprocessed using tokenization, POS-tagging and lemmatization resulting in a list of POS-tagged lemmas. Randomly generating word pairs from this list would result in too many unrelated pairs, yielding an unbalanced dataset. Thus, we assign weights to each word (e.g. using tf.idf-weighting). The most important document-specific words get the highest weights and due to lexical cohesion of the documents many related words can be found among the top rated. Therefore, we randomly generate a user-defined number of word pairs from the r wordswiththehighestweightsforeachdocument.</Paragraph>
      <Paragraph position="2">  In the next step, user defined filters are applied to the initial list of word pairs. For example, a filter can remove all pairs containing only uppercase letters (mostly acronyms). Another filter can enforce a certain fraction of POS combinations to be present in the result set.</Paragraph>
      <Paragraph position="3"> As we want to obtain judgment scores for semantic relatedness of concepts instead of words, we have to include all word sense combinations of a pair in the list. An external dictionary of word senses is necessary for this step. It is also used to add a gloss for each word sense that enables test subjects to distinguish between senses.</Paragraph>
      <Paragraph position="4"> If differences in meaning between senses are very fine-grained, distinguishing between them is hard even for humans (Mihalcea and Moldovan, 2001).6 Pairs containing such words are not suitable for evaluation. To limit their impact on the experiment, a threshold for the maximal number of senses can be defined. Words with a number of senses above the threshold are removed from the list.</Paragraph>
      <Paragraph position="5"> The result of the extraction process is a list of sense disambiguated, POS-tagged pairs of concepts. null 6E.g. the German verb &amp;quot;halten&amp;quot; that can be translated as hold, maintain, present, sustain, etc. has 26 senses in GermaNet. null</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
4.2 Experimental setup
4.2.1 Extraction of concept pairs
</SectionTitle>
      <Paragraph position="0"> We extracted word pairs from three different domain-specific corpora (see Table 2). This is motivated by the aim to enable research in information retrieval incorporating SR measures. In particular, the &amp;quot;Semantic Information Retrieval&amp;quot; project (SIR Project, 2006) systematically investigates the use of lexical-semantic relations between words or concepts for improving the performance of information retrieval systems.</Paragraph>
      <Paragraph position="1"> The BERUFEnet (BN) corpus7 consists of descriptions of 5,800 professions in Germany and therefore contains many terms specific to professional training. Evaluating semantic relatedness on a test set based on this corpus may reveal the ability of a measure to adapt to a very special domain. The GIRT (German Indexing and Retrieval Testdatabase) corpus (Kluck, 2004) is a collection of abstracts of social science papers. It is a standard corpus for evaluating German information retrieval systems. The third corpus is compiled from 106 arbitrarily selected scientific PowerPoint presentations (SPP). They cover a wide range of topics from bio genetics to computer science and contain many technical terms. Due to the special structure of presentations, this corpus will be particularly demanding with respect to the  requiredpreprocessingcomponentsofaninformation retrieval system. The three preprocessing steps (tokenization, POS-tagging, lemmatization) are performed using TreeTagger (Schmid, 1995). The resulting list of POS-tagged lemmas is weighted using the SMART 'ltc'8 tf.idf-weighting scheme (Salton, 1989).</Paragraph>
      <Paragraph position="2"> We implemented a set of filters for word pairs.</Paragraph>
      <Paragraph position="3"> One group of filters removed unwanted word pairs. Word pairs are filtered if they contain at least one word that a) has less than three letters b) contains only uppercase letters (mostly acronyms) or c) can be found in a stoplist. Another filter enforced a specified fraction of combinations of nouns (N), verbs (V) and adjectives (A) to be present in the result set. We used the following parameters: NN = 0.5, NV = 0.15, NA = 0.15, VV = 0.1, VA = 0.05, AA = 0.05. That means 50% of the resulting word pairs for each corpus  were noun-noun pairs, 15% noun-verb pairs and so on.</Paragraph>
      <Paragraph position="4"> Word pairs containing polysemous words are expanded to concept pairs using GermaNet (Kunze, 2004), the German equivalent to WordNet, as a sense inventory for each word. It is the most complete resource of this type for German. null GermaNet contains only a few conceptual glosses. As they are required to enable test subjects to distinguish between senses, we use artificial glosses composed from synonyms and hypernyms as a surrogate, e.g. for brother: &amp;quot;brother, male sibling&amp;quot; vs. &amp;quot;brother, comrade, friend&amp;quot; (Gurevych, 2005). We removed words which had more than three senses.</Paragraph>
      <Paragraph position="5"> Marginal manual post-processing was necessary, since the lemmatization process introduced some errors. Foreign words were translated into German, unless they are common technical terminology. We initially selected 100 word pairs from each corpus. 11 word pairs were removed because they comprised non-words. Expanding the word list to a concept list increased the size of the list. Thus, the final dataset contained 328 automatically created concept pairs.</Paragraph>
      <Paragraph position="6">  We developed a web-based interface to obtain human judgments of semantic relatedness for each automatically generated concept pair. Test subjects were invited via email to participate in the experiment. Thus, they were not supervised during the experiment.</Paragraph>
      <Paragraph position="7"> Gurevych (2006) observed that some annotators were not familiar with the exact definition of semantic relatedness. Their results differed particularly in cases of antonymy or distributionally related pairs. We created a manual with a detailed introduction to SR stressing the crucial points.</Paragraph>
      <Paragraph position="8"> The manual was presented to the subjects before the experiment and could be re-accessed at any  words are defined by means of synonyms and related words.</Paragraph>
      <Paragraph position="9"> During the experiment, one concept pair at a time was presented to the test subjects in random ordering. Subjects had to assign a discrete relatednessvalue{0,1,2,3,4} toeachpair. Figure 2shows the system's GUI.</Paragraph>
      <Paragraph position="10"> In case of a polysemous word, synonyms or related words were presented to enable test subjects to understand the sense of a presented concept. Because this additional information can lead to undesirable priming effects, test subjects were instructed to deliberately decide only about the relatedness of a concept pair and use the gloss solely to understand the sense of the presented concept. Since our corpus-based approach includes domain-specific vocabulary, we could not assume that the subjects were familiar with all words.</Paragraph>
      <Paragraph position="11"> Thus, they were instructed to look up unknown words in the German Wikipedia.9 Several test subjects were asked to repeat the experiment with a minimum break of one day. Results from the repetition can be used to measure intra-subject correlation. They can also be used to obtain some hints on varying difficulty of judgment for special concept pairs or parts-of-speech.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="19" end_page="20" type="metho">
    <SectionTitle>
5 Results and discussion
</SectionTitle>
    <Paragraph position="0"> 21 test subjects (13 males, 8 females) participated in the experiment, two of them repeated it. The average age of the subjects was 26 years. Most subjects had an IT background. The experiment took 39 minutes on average, leaving about 7 seconds for rating each concept pair.</Paragraph>
    <Paragraph position="1"> The summarized inter-subject correlation between 21 subjects was r=.478 (cf. Table 3), which</Paragraph>
  </Section>
  <Section position="9" start_page="20" end_page="21" type="metho">
    <SectionTitle>
CONCEPTS WORDS
INTER INTRA INTER INTRA
</SectionTitle>
    <Paragraph position="0"> all pairs, grouped by corpus and grouped by POS combinations.</Paragraph>
    <Paragraph position="1"> is statistically significant at p &lt; .05. This correlation coefficient is an upper bound of performance for automatic SR measures applied on the same dataset.</Paragraph>
    <Paragraph position="2"> Resnik (1995) reported a correlation of r=.9026.10 The results are not directly comparable, because he only used noun-noun pairs, words instead of concepts, a much smaller dataset, and measured semantic similarity instead of semantic relatedness. Finkelstein et al. (2002) did not report inter-subject correlation for their larger dataset. Gurevych (2006) reported a correlation of r=.69. Test subjects were trained students of computational linguistics, and word pairs were selected analytically.</Paragraph>
    <Paragraph position="3"> Evaluating the influence of using concept pairs instead of word pairs is complicated because word level judgments are not directly available. Therefore, we computed a lower and an upper bound for correlation coefficients. For the lower bound, we always selected the concept pair with highest standard deviation from each set of corresponding concept pairs. The upper bound is computed by selecting the concept pair with the lowest standard deviation. The differences between correlation co-efficient for concepts and words are not significant. Table 3 shows only the lower bounds.</Paragraph>
    <Paragraph position="4"> Correlation coefficients for experiments measuring semantic relatedness are expected to be lowerthanresultsforsemanticsimilarity, sincethe former also includes additional relations (like co-occurrence of words) and is thus a more complicated task. Judgments for such relations strongly depend on experience and cultural background of the test subjects. While most people may agree 10Note that Resnik used the averaged correlation coefficient. We computed the summarized correlation coefficient  that (car - vehicle) are highly related, a strong connectionbetween(parts-speech)mayonlybe established by a certain group. Due to the corpus-based approach, many domain-specific concept pairs are introduced into the test set. Therefore, inter-subject correlation is lower than the results obtained by Gurevych (2006).</Paragraph>
    <Paragraph position="5"> In our experiment, intra-subject correlation was r=.670 for the first and r=.623 for the second individual who repeated the experiment, yielding a summarized intra-subject correlation of r=.647.</Paragraph>
    <Paragraph position="6"> Rubenstein and Goodenough (1965) reported an intra-subject correlation of r=.85 for 15 subjects judging the similarity of a subset (36) of the original 65 word pairs. The values may again not be compared directly. Furthermore, we cannot generalize from these results, because the number of participants which repeated our experiment was too low.</Paragraph>
    <Paragraph position="7"> The distribution of averaged human judgments on the whole test set (see Figure 3) is almost balanced with a slight underrepresentation of highly related concepts. To create more highly related concept pairs, more sophisticated weighting schemes or selection on the basis of lexical chain- null observed for low or high judgments.</Paragraph>
    <Paragraph position="8"> ing could be used. However, even with the present setup, automatic extraction of concept pairs performs remarkably well and can be used to quickly create balanced test datasets.</Paragraph>
    <Paragraph position="9"> Budanitsky and Hirst (2006) pointed out that distribution plots of judgments for the word pairs used by Rubenstein and Goodenough display an empty horizontal band that could be used to separate related and unrelated pairs. This empty band is not observed here. However, Figure 4 shows the distribution of averaged judgments with the highest agreement between annotators (standard deviation &lt; 0.8). The plot clearly shows an empty horizontal band with no judgments. The connection between averaged judgments and standard deviation is plotted in Figure 5.</Paragraph>
    <Paragraph position="10"> When analyzing the concept pairs with lowest deviation there is a clear tendency for particularly highly related pairs, e.g. hypernymy: Universitat - Bildungseinrichtung (university - educational institution); functional relation: Tatigkeit - ausfuhren (task - perform); or pairs that are obviously not connected, e.g. logisch - Juni (logical - June). Table 4 lists some example concept pairs along with averaged judgments and standard deviation.</Paragraph>
    <Paragraph position="11"> Concept pairs with high deviations between judgments often contain polysemous words. For example, Quelle (source) was disambiguated to Wasserquelle (spring) and paired with Text (text). The data shows a clear distinction between one group that rated the pair low (0) and another group that rated the pair high (3 or 4). The latter group obviously missed the point that textual source was not an option here. High deviations were also common among special technical termslike(Mips-Core),propernames(Georg August - two common first names in German) or functionally related pairs (agieren - mobil). Human experience and cultural background clearly influence the judgment of such pairs.</Paragraph>
    <Paragraph position="12"> The effect observed here and the effect noted by Budanitsky and Hirst is probably caused by the same underlying principle. Human agreement on semantic relatedness is only reliable if two words or concepts are highly related or almost unrelated. Intuitively, this means that classifying word pairs as related or unrelated is much easier than numerically rating semantic relatedness. For an information retrieval task, such a classification might be sufficient.</Paragraph>
    <Paragraph position="13"> Differences in correlation coefficients for the three corpora are not significant indicating that the phenomenon is not domain-specific. Differences in correlation coefficients for different parts-of-speech are significant (see Table 3). Verb-verb and verb-adjective pairs have the lowest correlation.</Paragraph>
    <Paragraph position="14"> A high fraction of these pairs is in the problematic medium relatedness area. Adjective-adjective pairs have the highest correlation. Most of these pairs are either highly related or not related at all.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML