File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1203_metho.xml

Size: 11,249 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1203">
  <Title>Measuring the Semantic Similarity of Texts</Title>
  <Section position="3" start_page="13" end_page="15" type="metho">
    <SectionTitle>
2 Measuring Text Semantic Similarity
</SectionTitle>
    <Paragraph position="0"> Given two input text segments, we want to automatically derive a score that indicates their similarity at semantic level, thus going beyond the simple lexical matching methods traditionally used for this task. Although we acknowledge the fact that a comprehensive metric of text semantic similarity should take into account the relations between words, as well as the role played by the various entities involved in the interactions described by each of the two texts, we take a first rough cut at this problem and attempt to model the semantic similarity of texts as a function of the semantic similarity of the component words. We do this by combining metrics of word-to-word similarity and language models into a formula that is a potentially good indicator of the semantic similarity of the two input texts.</Paragraph>
    <Section position="1" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
2.1 Semantic Similarity of Words
</SectionTitle>
      <Paragraph position="0"> There is a relatively large number of word-to-word similarity metrics that were previously proposed in the literature, ranging from distance-oriented measures computed on semantic networks, to metrics based on models of distributional similarity learned from large text collections. From these, we chose to focus our attention on six different metrics, selected mainly for their observed performance in natural language processing applications, e.g. malapropism detection (Budanitsky and Hirst, 2001) and word sense disambiguation (Patwardhan et al., 2003), and for their relatively high computational efficiency.</Paragraph>
      <Paragraph position="1"> We conduct our evaluation using the following word similarity metrics: Leacock &amp; Chodorow, Lesk, Wu &amp; Palmer, Resnik, Lin, and Jiang &amp; Conrath. Note that all these metrics are defined between concepts, rather than words, but they can be easily turned into a word-to-word similarity metric by selecting for any given pair of words those two meanings that lead to the highest concept-to-concept similarity. We use the WordNet-based implementation of these metrics, as available in the Word-Net::Similarity package (Patwardhan et al., 2003).</Paragraph>
      <Paragraph position="2"> We provide below a short description for each of these six metrics.</Paragraph>
      <Paragraph position="3"> The Leacock &amp; Chodorow (Leacock and Chodorow, 1998) similarity is determined as: Simlch = [?]log length2 [?] D (1) where length is the length of the shortest path between two concepts using node-counting, and D is the maximum depth of the taxonomy.</Paragraph>
      <Paragraph position="4"> The Lesk similarity of two concepts is defined as a function of the overlap between the corresponding definitions, as provided by a dictionary. It is based on an algorithm proposed in (Lesk, 1986) as a solution for word sense disambiguation.</Paragraph>
      <Paragraph position="5"> The Wu and Palmer (Wu and Palmer, 1994) similarity metric measures the depth of the two concepts in the WordNet taxonomy, and the depth of the least common subsumer (LCS), and combines these figures into a similarity score:</Paragraph>
      <Paragraph position="7"> (2) The measure introduced by Resnik (Resnik, 1995) returns the information content (IC) of the LCS of two concepts:</Paragraph>
      <Paragraph position="9"> and P(c) is the probability of encountering an instance of concept c in a large corpus.</Paragraph>
      <Paragraph position="10"> The next measure we use in our experiments is the metric introduced by Lin (Lin, 1998), which builds on Resnik's measure of similarity, and adds a normalization factor consisting of the information content of the two input concepts:</Paragraph>
      <Paragraph position="12"> Finally, the last similarity metric we consider is Jiang &amp; Conrath (Jiang and Conrath, 1997), which returns a score determined by:</Paragraph>
      <Paragraph position="14"/>
    </Section>
    <Section position="2" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
2.2 Language Models
</SectionTitle>
      <Paragraph position="0"> In addition to the semantic similarity of words, we also want to take into account the specificity of words, so that we can give a higher weight to a semantic matching identified between two very specific words (e.g. collie and sheepdog), and give less importance to the similarity score measured between generic concepts (e.g. go and be). While the specificity of words is already measured to some extent by their depth in the semantic hierarchy, we are reinforcing this factor with a corpus-based measure of word specificity, based on distributional information learned from large corpora.</Paragraph>
      <Paragraph position="1"> Language models are frequently used in natural language processing applications to account for the distribution of words in language. While word frequency does not always constitute a good measure of word importance, the distribution of words across an entire collection can be a good indicator of the specificity of the words. Terms that occur in a few documents with high frequency contain a greater amount of discriminatory ability, while terms that occur in numerous documents across a collection with a high frequency have inherently less meaning to a document. We determine the specificity of a word using the inverse document frequency introduced in (Sparck-Jones, 1972), which is defined as the total number of documents in the corpus, divided by the total number of documents that include that word.</Paragraph>
      <Paragraph position="2"> In the experiments reported in this paper, we use the British National Corpus to derive the document frequency counts, but other corpora could be used to the same effect.</Paragraph>
    </Section>
    <Section position="3" start_page="14" end_page="15" type="sub_section">
      <SectionTitle>
2.3 Semantic Similarity of Texts
</SectionTitle>
      <Paragraph position="0"> Provided a measure of semantic similarity between words, and an indication of the word specificity, we combine them into a measure of text semantic similarity, by pairing up those words that are found to be most similar to each other, and weighting their similarity with the corresponding specificity score.</Paragraph>
      <Paragraph position="1"> We define a directional measure of similarity, which indicates the semantic similarity of a text segment Ti with respect to a text segment Tj. This definition provides us with the flexibility we need to handle applications where the directional knowledge is useful (e.g. entailment), and at the same time it gives us the means to handle bidirectional similarity through a simple combination of two unidirectional metrics.</Paragraph>
      <Paragraph position="2"> For a given pair of text segments, we start by creating sets of open-class words, with a separate set created for nouns, verbs, adjectives, and adverbs.</Paragraph>
      <Paragraph position="3"> In addition, we also create a set for cardinals, since numbers can also play an important role in the understanding of a text. Next, we try to determine pairs of similar words across the sets corresponding to the same open-class in the two text segments. For nouns and verbs, we use a measure of semantic similarity based on WordNet, while for the other word classes we apply lexical matching1.</Paragraph>
      <Paragraph position="4"> For each noun (verb) in the set of nouns (verbs) belonging to one of the text segments, we try to identify the noun (verb) in the other text segment that has the highest semantic similarity (maxSim), according to one of the six measures of similarity described in Section 2.1. If this similarity measure results in a score greater than 0, then the word is added to the set of similar words for the corresponding word class WSpos2. The remaining word classes: adjectives, adverbs, and cardinals, are checked for lexical similarity with their counter-parts and included in the corresponding word class set if a match is found.</Paragraph>
      <Paragraph position="5"> The similarity between the input text segments Ti and Tj is then determined using a scoring function that combines the word-to-word similarities and the</Paragraph>
      <Paragraph position="7"> This score, which has a value between 0 and 1, is a measure of the directional similarity, in this case computed with respect to Ti. The scores from both directions can be combined into a bidirectional similarity using a simple average function: sim(Ti, Tj) = sim(Ti, Tj)Ti + sim(Ti, Tj)Tj2 (8) 1The reason behind this decision is the fact that most of the semantic similarity measures apply only to nouns and verbs, and there are only one or two relatedness metrics that can be applied to adjectives and adverbs.</Paragraph>
      <Paragraph position="8"> 2All similarity scores have a value between 0 and 1. The similarity threshold can be also set to a value larger than 0, which would result in tighter measures of similarity.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="15" end_page="15" type="metho">
    <SectionTitle>
3 A Walk-Through Example
</SectionTitle>
    <Paragraph position="0"> We illustrate the application of the text similarity measure with an example. Given two text segments, as shown in Figure 1, we want to determine a score that reflects their semantic similarity. For illustration purposes, we restrict our attention to one measure of word-to-word similarity, the Wu &amp; Palmer metric.</Paragraph>
    <Paragraph position="1"> First, the text segments are tokenized, part-of-speech tagged, and the words are inserted into their corresponding word class sets. The sets obtained for the given text segments are illustrated in Figure 1.</Paragraph>
    <Paragraph position="2"> Starting with each of the two text segments, and for each word in its word class sets, we determine the most similar word from the corresponding set in the other text segment. As mentioned earlier, we seek a WordNet-based semantic similarity for nouns and verbs, and only lexical matching for adjectives, adverbs, and cardinals. The word semantic similarity scores computed starting with the first text segment are shown in Table 3.</Paragraph>
    <Paragraph position="3">  computing text similarity with respect to text 1 Next, we use equation 7 and determine the semantic similarity of the two text segments with respect to text 1 as 0.6702, and with respect to text 2 as 0.7202. Finally, the two figures are combined into a bidirectional measure of similarity, calculated as 0.6952 based on equation 8.</Paragraph>
    <Paragraph position="4"> Although there are a few words that occur in both text segments (e.g. juror, questionnaire), there are also words that are not identical, but closely related, e.g. courtroom found similar to juror, or fill which is related to complete. Unlike traditional similarity measures based on lexical matching, our metric takes into account the semantic similarity of these words, resulting in a more precise measure of text similarity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML