File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1611_metho.xml

Size: 14,857 bytes

Last Modified: 2025-10-06 14:08:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1611">
  <Title>Paraphrasing Japanese noun phrases using character-based indexing</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Retrieving paraphrase candidates
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Indexing and term expansion
</SectionTitle>
      <Paragraph position="0"> In conventional information retrieval, a query is given to the system to retrieve a list of documents which are arranged in descending order of relevance. Our aim is to obtain paraphrases given a noun phrase as a query, where retrieved objects should be smaller than documents. We divide a document into a set of passages at punctuation symbols. These passages are retrieved by a given query, a noun phrase.</Paragraph>
      <Paragraph position="1"> The input noun phrase and the passages are segmented into words and they are assigned part of speech tags by a morphological analyzer. Among these tagged words, content words (nouns, verbs, adjectives, adverbs) and unknown words are selected. Kanzi characters contained in these words are extracted as index terms. In addition to Kanzi characters, words written in Katakana (most of them are imported words) and numbers are also used as index terms. Precisely speaking, different numbers should be considered to denote different meaning, but to avoid data sparseness problem, we abstract numbers into a special symbol &lt;num&gt; .</Paragraph>
      <Paragraph position="2"> As mentioned in section 1, the query expansion technique is often used in information retrieval to solve the surface notational difference between queries and documents. We also introduce query expansion for retrieving passage. Since we use Kanzi characters as index terms, we need linguistic knowledge defining groups of similar characters for query expansion. However this kind of knowledge is not available at hand. We obtain similarity of Kanzi characters from an ordinary thesaurus which defines similarity of words.</Paragraph>
      <Paragraph position="3"> If a word t is not a Katakana word, we expand it to a set of Kanzi characters E(t) which is defined by (1), where Ct is a semantic class including the word t, KC is a set of Kanzi characters used in words of semantic class C, fr(k,C) is a frequency of a Kanzi character k used in words of semantic class C, and Kt is a set of Kanzi characters in word t.</Paragraph>
      <Paragraph position="5"> E(t) consists of Kanzi characters which is used in words of semantic class Ct more frequently, than the most frequent Kanzi character in the word t. If the word t is a Katakana word, it is not expanded.</Paragraph>
      <Paragraph position="6"> Let us see an expansion example of word &amp;quot;9 (hot spring)&amp;quot;. Here we have t = &amp;quot;9 &amp;quot; to expand, and we have two characters that make the word, i.e. Kt = {  }. Suppose &amp;quot;9 &amp;quot; belongs to a semantic class Ct in which we find a set of words {9 (hot sprint place),ul(lukewarm water),9 +(warm water), (spa),(oasis), . . .}. From this word set, we extract characters and count their occurence to obtain</Paragraph>
      <Paragraph position="8"> (22),(20),9(8),. . .}, where a number in parentheses denotes the frequency of characters in the semantic class Ct. Since the most frequent character of Kt in KCt is &amp;quot; &amp;quot; in this case, more frequently used character &amp;quot;l&amp;quot; is added to E(t). In addition, Katakana words &amp;quot;&amp;quot; and &amp;quot;&amp;quot; are added to E(t) as well.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Term weighting
</SectionTitle>
      <Paragraph position="0"> An index term is usually assigned a certain weight according to its importance in user's query and documents.</Paragraph>
      <Paragraph position="1"> There are many proposals of term weighting most of which are based on term frequency (Baeza-Yates and Riberto-Neto, 1999) in a query and documents. Term frequency-based weighting resides on Luhn's assumption (Luhn, 1957) that a repeatedly mentioned expression denotes an important concepts. However it is obvious that this assumption does not hold when retrieving paraphrase candidates from a set of documents. For term weighting, we use character frequency in a semantic class rather than that in a query and documents, assuming that a character frequently used in words of a semantic class represents the concept of that semantic class very well.</Paragraph>
      <Paragraph position="2"> A weight of a term k in a word t is calculated by (2).</Paragraph>
      <Paragraph position="4"> if k is a Kanzi (2) Katakana words and numbers are assigned a constant value, 100, and a Kanzi character is assigned a weight according to its frequency in the semantic class Ct, where k is used in the word t.</Paragraph>
      <Paragraph position="5"> In the previous example of &amp;quot;9 &amp;quot;, we have obtained an expanded term set {l,9, ,,}.</Paragraph>
      <Paragraph position="6"> Among this set, &amp;quot;&amp;quot; and &amp;quot;&amp;quot; are assigned weight 100 because these are Katakana words, and the rest three characters are assigned weight according to its frequency in the class. For example, &amp;quot;l&amp;quot; is assigned weight 100x log35log35+log22+log8 = 40.7.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Similarity
</SectionTitle>
      <Paragraph position="0"> Similarity between an input noun phrase (I) and a passage (D) is calculated by summing up the weights of terms which are shared by I and D, as defined in (3). In the equation, k takes values over the index terms shared by I and D, w(k) is its weight calculated as described in the previous section.</Paragraph>
      <Paragraph position="2"> Note that since we do not use term frequency in passages, we do not introduce normalization of passage length.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Syntactic and semantic filtering
</SectionTitle>
    <Paragraph position="0"> The proposed method utilizes Kanzi characters as index terms. In general, making index terms smaller units increases exhaustivity to gain recall, but, at the same time, it decreases specificity to degrade precision (Sparck Jones, 1972). We aim to gain recall by using smaller units as index terms at the cost of precision. Even though Kanzi are ideograms and have more specificity than phonograms, they are still less specific than words. Therefore there would be many irrelevant passages retrieved due to coincidentally shared characters. In this section, we describe a process to filter out irrelevant passages based on the following two viewpoints.</Paragraph>
    <Paragraph position="1"> Semantic constraints : Retrieved passages should contain all concepts mentioned in the input noun phrase.</Paragraph>
    <Paragraph position="2"> Syntactic constraints : Retrieved passages should have a syntactically proper structure corresponding to the input noun phrase.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Semantic constraints
</SectionTitle>
      <Paragraph position="0"> In the indexing phase, we have decomposed an input noun phrase and passages into a set of Kanzi characters for retrieval. In the filtering phase, from these characters, we reconstruct words denoting a concept and verify if concepts mentioned in the input noun phrase are also included in the retrieved passages.</Paragraph>
      <Paragraph position="1"> To achieve this, a retrieved passage is syntactically analyzed and dependencies between bunsetu (word phrase) are identified. Then, the correspondence between words of the input noun phrase and bunsetu of the passage is verified. This matching is done on the basis of sharing the same Kanzi characters or the same Katakana words.</Paragraph>
      <Paragraph position="2"> Passages missing any of the concepts mentioned in the input noun phrase are discarded in this phase.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Syntactic constraints
</SectionTitle>
      <Paragraph position="0"> Since passages are generated on the basis of punctuation symbols, each passage is not guaranteed to have a syntactically proper structure. In addition, a part of the passage tends to be a paraphrase of the input noun phrase rather than the whole passage. In such cases, it is necessary to extract a corresponding part from the retrieved passage and transform it into a proper syntactic structure.</Paragraph>
      <Paragraph position="1"> By applying semantic constraints above, we have identified a set of bunsetu covering the concepts mentioned in the input noun phrase. We extract a minimum dependency structure which covers all the identified bunsetu.</Paragraph>
      <Paragraph position="2"> Finally the extracted structure is transformed into a proper phrase or clause by changing the ending of the head (the right most element) and deleting unnecessary elements such as punctuation symbols, particles and so on.</Paragraph>
      <Paragraph position="3"> Figure 1 illustrates the matching and transforming process described in this section. The input noun phrase is &amp;quot;?w1w2ww3V&lt;[w4 (reduction of telephone rate)&amp;quot; which consists of four words w1 ...w4. Suppose a passage &amp;quot;U&lt;[`h\qp (the company's telephone rate reduction caused. . . &amp;quot; is retrieved. This passage is syntactically analyzed to give the dependency structure of four bunsetu b1 ...b4 as shown in Figure 1.</Paragraph>
      <Paragraph position="4">  Correspondence between word w1 and bunsetu b2 is made bacause they share a common character &amp;quot;&amp;quot;. Word w2 corresponds to bunsetu b2 as well due to characters &amp;quot; &amp;quot; and &amp;quot;&amp;quot;. And word w4 corresponds to bunsetu b3. Although there is no counterpart of word w3, this passage is not discarded because word 3 is a function word (postposition). After making correspondences, a minimum dependency structure, the shaded part in Figure 1, is extracted. Then the ending auxiliary verb is deleted and the verb is restored to the base form.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Ranking
</SectionTitle>
    <Paragraph position="0"> Retrieved passages are ranked according to the similarity with an input noun phrase as described in section 2. However this ranking is not always suitable from the view-point of paraphrasing. Some of the retrieved passages are discarded and others are transformed through processes described in the previous section. In this section, we describe a process to rerank remaining passages according to their appropriateness as paraphrases of the input noun phrase. We take into account the following three factors for reranking.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Similarity score of retrieval
</SectionTitle>
      <Paragraph position="0"> The similarity score used in passage retrieval is not sufficient for evaluating the quality of the paraphrases. However, it reflects relatedness between the input noun phrase and retrieved passages. Therefore, the similarity score calculated by (3) is taken into account when ranking paraphrase candidates.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Distance between words
</SectionTitle>
      <Paragraph position="0"> In general, distance between words which have a dependency relation reflects the strength of their semantic closeness. We take into account the distance between two bunsetu which have a dependency relation and contain adjacent two words in the input noun phrase respectively.</Paragraph>
      <Paragraph position="1"> This factor is formalized as in (4), where ti is the ith word in the input noun phrase, and dist(s,t) is the distance between two bunsetu each of which contains s and t. A distance between two bunsetu is defined as the number of bunsetu between them. When two words are contained in the same bunsetu, the distance between them is defined as 0.</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Contextual information
</SectionTitle>
      <Paragraph position="0"> We assume that phrases sharing the same Kanzi characters likely represent the same meaning. Therefore they could be paraphrases of each other. However, even though a Kanzi denotes a certain meaning, its meaning is often ambiguous. This problem is similar to word sense ambiguities, which have been studied for many years. To solve this problem, we adopt an idea one sense per collocation which was introduced in word sense disambiguation research (Yarowsky, 1995). Considering a newspaper article in which the retrieved passage and the input noun phrase is included as the context, the context similarity is taken into account for ranking paraphrase candidates. More concretely, context similarity is calculated by following procedure.</Paragraph>
      <Paragraph position="1"> 1. For each paraphrase candidate, a context vector is constructed from the newspaper article containing the passage from which the candidate is derived.</Paragraph>
      <Paragraph position="2"> The article is morphologically analyzed and content words are extracted to make the context vector. The tf *idf metric is used for term weighting.</Paragraph>
      <Paragraph position="3"> 2. Since the input is given in terms of a noun phrase, there is no corresponding newspaper article for the input. However there is a case where the retrieved passages include the input noun phrase. Such passages are not useful for finding paraphrases, but useful for constructing a context vector of the input noun phrase. The context vector of the input noun phrase is constructed in the same manner as that of paraphrase candidates, except that all newspaper articles including the noun phrase are used.</Paragraph>
      <Paragraph position="4">  3. Context similarity Mcontext is calculated by cosine measure of two context vectors as in (5), where wi(k) and wd(k) are the weight of the k-th term of the input context vector and the candidate context vector, respectively.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Ranking paraphrase candidates
</SectionTitle>
      <Paragraph position="0"> Paraphrase candidates are ranked in descending order of the product of three measures, sim(I,D) (equation (3)), Mdistance (equation (4)) and Mcontext (equation (5)).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML