File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-5007_metho.xml

Size: 18,128 bytes

Last Modified: 2025-10-06 14:09:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-5007">
  <Title>Automated Generalization of Phrasal Paraphrases from the Web*</Title>
  <Section position="4" start_page="49" end_page="50" type="metho">
    <SectionTitle>
2 Overview of Generalization Method
</SectionTitle>
    <Paragraph position="0"> The origin input of our system is a seed phrasal paraphrase example. And the output is the generalized paraphrase templates from the given examples. The overall architecture of our paraphrase generalization is represented on figure 1.</Paragraph>
    <Paragraph position="1"> A seed phrasal paraphrase examples  Generalization We also use the example (1) to illustrate the representation. Here a semantic dictionary called &amp;quot;TongYiCiCiLin&amp;quot; (Extension Version)1 is used. The pair of phrases is a phrasal paraphrase. At first, after preprocessing which includes word segment, POS tagging and word sense disambiguation, we get the slot word in the paraphrase. In this example, the slot word is &amp;quot; (I)&amp;quot;. Then we search the web using the context of the slot word. Every phrase in the phrasal pair derives a set of sentences which include the original phrase context. A dependency parser on these sentences is used to extract the corresponding word with the slot word. Two word sets can be obtained through the two sentence sets. Then, we map word sets to their semantic code sets 1 TongYiCiCiLin (Extended Version) can be downloaded from the website of HIT-IRLab (Http://ir.hit.edu.cn). In the past section, we abbreviate the TongYiCiCiLin (Extended Version) to Cilin (EV)  according to Cilin(EV). Then an intersection operation is conducted on the two sets. We use the intersection set to replace the slot word and generate the final paraphrase template.</Paragraph>
    <Paragraph position="2"> In order to verify the validation of the generalized paraphrase template, we also design an automatic algorithm to confirm whether the template is reasonable using the existing search engine.</Paragraph>
  </Section>
  <Section position="5" start_page="50" end_page="50" type="metho">
    <SectionTitle>
3 Representation of Template
</SectionTitle>
    <Paragraph position="0"> In the section of introduction, some representation methods of paraphrase template have been introduced. And we proposed a new method using word semantic codes to represent the variable in a template. Before we introduce the representation method, Firstly, we give some general introduction about the semantic dictionary of Cilin(EV).</Paragraph>
    <Section position="1" start_page="50" end_page="50" type="sub_section">
      <SectionTitle>
3.1 TongYiCiCiLin (Extended Version)
</SectionTitle>
      <Paragraph position="0"> Cilin (EV) is derived from original TongYiCiCilin in which word senses are decomposed to 12 large categories, 94 middle categories, 1,428 small categories. Cilin (EV) removes some outdated words and updates many new words. More fine-grained categories are added on the base of original classification system to satisfy the more complex natural language applications. The encoding criterion is shown in the table 1: Table 1 Encoding table of dictionary Encoding bit 1 2 3 4 5 6 7 8 Example D a 1 5 B 0 2 = Attribute Big Middle Small groups Atom groups Layer 1 2 3 4 5 The encoding bits are arranged from left to right. The first three layers are same with Cilin. The fourth layer is represented by capital letters and the fifth layer is two-bit decimal digit. The last bit is some more detailed information about the atom groups.</Paragraph>
    </Section>
    <Section position="2" start_page="50" end_page="50" type="sub_section">
      <SectionTitle>
3.2 An Example of a Paraphrase Template
</SectionTitle>
      <Paragraph position="0"> For simplicity, we just select one slot word in every paraphrase. And we stipulated that only content word can be slot word. We also use the above paraphrase example (1).</Paragraph>
      <Paragraph position="1"> ,9?k (In my view/mind ----I feel) Here, we get the slot word &amp;quot;(I)&amp;quot;. Through the Word Sense Disambiguation processing, we get its semantic code &amp;quot;Aa02A01=&amp;quot; according to the fifth layer in Cilin(EV). If we just use the semantic code of the slot word, we can get a simple paraphrase template as follows:  But it is obviously that the template is very limited. Its' representation ability is also limited. So how to extend the ability of a paraphrase template is a challenging work.</Paragraph>
    </Section>
    <Section position="3" start_page="50" end_page="50" type="sub_section">
      <SectionTitle>
3.3 Extending the Template Abstract Ability
</SectionTitle>
      <Paragraph position="0"> According to the feature of Cilin(EV) architecture, we can use the higher layer's semantic code instead of the slot word to generalize the paraphrase template naturally. Of course it's a very simple method to extend the template ability, but it also brings more redundancy of a paraphrase template and it will be proven in the later section.</Paragraph>
      <Paragraph position="1"> So we use multiple semantic codes of the different layer instead of only one semantic code of slot word in Cilin (EV). The later experimental results prove this representation has a good performance with a good precision and coverage.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="50" end_page="52" type="metho">
    <SectionTitle>
4 Generalizing to Templates
</SectionTitle>
    <Paragraph position="0"> As mentioned above, we can use multiple semantic codes to generalize paraphrase examples.</Paragraph>
    <Paragraph position="1"> So the problem of how to generalize paraphrase examples is transformed into the problem of how to get the multiple semantic codes set. We proposed a new method which uses the existing search engine to reach the target.</Paragraph>
    <Section position="1" start_page="50" end_page="51" type="sub_section">
      <SectionTitle>
4.1 Getting the Candidate Sentences
</SectionTitle>
      <Paragraph position="0"> After we removed the slot word in the paraphrase examples, two phrasal contexts of the original paraphrase phrases were obtained. Each phrase without slot word is used as a search query for an existing search engine and achieving many sentences which include the query word. For this example, the two queries are &amp;quot; ,9 (in...view)&amp;quot; and &amp;quot;?k (feel)&amp;quot;. Each query gets one sentence set respectively. Part of the two result sentence sets are shown in figure 2 and figure 3:  From the above two sentence sets, we can find that there is some noisy information in the sentences. In order to extend the correspondent words of the slot word, it is not enough that we just use the position information or POS tagging information of the slot word. Even if we extract these words, many of them can't be found in the dictionary because they are not simple words.</Paragraph>
      <Paragraph position="1"> Benefiting from the idea of (Lin and Pantel, 2001), we use a dependency parser to determine the correspondent extended words.</Paragraph>
    </Section>
    <Section position="2" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
4.2 Dependency Parser
</SectionTitle>
      <Paragraph position="0"> In this paper, we use a dependency parser (Ma et al., 2004) to extract the candidate slot word. For example, the dependency parsing result of the phrase of &amp;quot;,9 &amp;quot; is shown in figure 4.</Paragraph>
      <Paragraph position="1">  The arcs in the figure represent dependency relationships. The direction of an arc is from the head to the modifier in the relationship. Labels associated with the arcs represent types of dependency relations. Table 2 lists a subset of the dependency relations in the HIT-IRLab depend-</Paragraph>
    </Section>
    <Section position="3" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
4.3 Extracting the extended words
</SectionTitle>
      <Paragraph position="0"> We just use a very simple method to get the extended words from the parsed sentences. At first, we record the relations of the original parsed phrasal examples. And then we use these relations to matched similar part in the candidate parsed sentence except slot word. And we omit these unseen relations and content words which don't appear in the original parsed phrasal examples. Then we can get the extended words.</Paragraph>
      <Paragraph position="2"> Figure 5 shows the dependency parsing result of the phrase of &amp;quot;C G4),9 &amp;quot;(In foreign capital fund manager view). We can easily find that the extended word of the slot word &amp;quot; &amp;quot;(I) is &amp;quot; 4) &amp;quot;(manager). Two extended word sets can be extracted from two sentence sets. Then we map each word to their semantic code to get two semantic code sets. Intersection operation is conducted on these two semantic code sets to obtain their intersection set. Finally, we use the semantic code set instead of the slot word to generate the paraphrase template.</Paragraph>
    </Section>
    <Section position="4" start_page="51" end_page="52" type="sub_section">
      <SectionTitle>
4.4 Some tricks
</SectionTitle>
      <Paragraph position="0"> Because the precision of the current dependency parser on Chinese is not very high, we just extract a part of the candidate sentences to parse.</Paragraph>
      <Paragraph position="1"> There are three patterns to segment the long candidate sentences according to position of slot word in paraphrase examples. They are called FRONT, MIDDLE and BACK. Here we use an example to illustrate it as shown in table 3: Table 3 Examples of sentence segmentation</Paragraph>
      <Paragraph position="3"> The bold section in the sentence will be extracted to parse. Pattern type can be decided by  the position relation between slot word and context words. And these patterns can reduce the relative error rate of the dependency parser. That is to say, if the original phrase is parsed wrongly, the extracted segments may be parsed wrongly with the similar error. But according to our method, this kind of parser error has little influence on the final extracting result.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="52" end_page="52" type="metho">
    <SectionTitle>
5 Experiments and Discussions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
5.1 Setting
</SectionTitle>
      <Paragraph position="0"> We extract about 510 valid paraphrase examples from a Chinese paraphrase corpus (Li et al., 2004). For simplicity, we just select those phrasal paraphrase examples which own same word. And we stipulate only content word can be as slot word. We just use four seed phrasal paraphrases as the original paraphrases in this paper. And the generalized paraphrase templates represented by semantic codes of the fifth layer in Cinlin (EV) are also shown in the Table 4:</Paragraph>
    </Section>
    <Section position="2" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
5.2 Evaluation on Templates
</SectionTitle>
      <Paragraph position="0"> The goal of the evaluations is to confirm how reasonable this kind of representation method of paraphrase templates is and how well the template is. We evaluated the generalized paraphrase template in three ways. They are listed in the following three categories: 1) Reasonability;  2) Precision; 3) Coverage.</Paragraph>
      <Paragraph position="1"> 1) Reasonability  The reasonability of a paraphrase template aims to measure the reasonable extent of the presentation method with multiple semantic codes. For example, if we use POS to generalize a paraphrase template, its reasonability is very lower; that is to say, POS is not suitable to represent paraphrase template in some extent.</Paragraph>
      <Paragraph position="2"> We use an existing search engine to calculate the reasonability of every paraphrase template. Firstly, we instantiate all paraphrase examples from a template. Then all these examples are as the queries of the search engine. If two phrases in one paraphrase can be matched completely from the search engine, it also means that one or more examples are found on the Web via search engine, we then consider this paraphrase is reasonable. Using this method we can get the approximate evaluation of all the examples. We define two metrics:</Paragraph>
      <Paragraph position="4"> Where N is the total number of the instantiated examples; S is the number of the paraphrase examples which two phrases in it can be matched all; L is the number of paraphrase examples only one phrase in a paraphrase can be matched.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="52" end_page="53" type="metho">
    <SectionTitle>
2) Precision
</SectionTitle>
    <Paragraph position="0"> Every template is correspondent to the examples number with the semantic code of different layer in Cilin (EV) as shown in table 5.</Paragraph>
    <Paragraph position="1"> Table 5 Templates and their correspond exam- null From the above table, we can find that every template can instantiate many examples. If manually judging all of these examples will spend plenty of time. So we just sample part of all instantiate examples, 200 paraphrase examples for each template in this paper. For each  phrase in a sample paraphrase example, it is as search query to get the first two matched sentences. Evaluators would be asked whether it is semantically okay to replace the query in the sentence by the correspondent phrase in a paraphrase. They were given only two options: Yes or No. If search query have no matched results, we consider that this phrase cannot be replace with its correspondent paraphrase. According to the above regulations, we know that every paraphrase examples correspondent to 4 sentences. If we sample n examples from a template, the precision of a paraphrase template can be calculated by:</Paragraph>
    <Paragraph position="3"> Where, R is the number of sentences which is considered to be correct by the evaluator.</Paragraph>
  </Section>
  <Section position="9" start_page="53" end_page="54" type="metho">
    <SectionTitle>
3) Coverage
</SectionTitle>
    <Paragraph position="0"> Evaluating directly the coverage of a paraphrase template is difficult because humans can't enumerate all the words to be suitable to the template. We use an approximate method to get the coverage of a template. At first we use another search engine to get candidate sentences with similar method for generalization of a paraphrase template. From these retrieved sentences we can get many different words with the known generalized words because more than 85% of search results from different search engine are different. Evaluators extract every sentence which can be replaced with the correspondent phrase in a paraphrase and the new sentences retain the origin meaning. We know each sentence is correspondent to a word.</Paragraph>
    <Paragraph position="1"> Then we define two metrics:</Paragraph>
    <Paragraph position="3"> Where, NS is the number of all manually tagged right words, M is the number of words which can be instantiated from a paraphrase template, K is the number of all the words that generalized the template at the front. Map(X) is the total word number of the word clusters which derived from X word in the semantic dictionary of Cilin(EV).</Paragraph>
    <Section position="1" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
5.3 Result
</SectionTitle>
      <Paragraph position="0"> In order to exhibit the merit of our method, we conduct four groups of experiment. They are POS-Tag, Cilin3, Cilin4 and Ciln5, respectively.</Paragraph>
      <Paragraph position="1"> Especially, we just randomly select 400 words to satisfy the POS information.</Paragraph>
      <Paragraph position="2">  Every value in table 6 is a average value of four values correspondent to four templates. From the table we can find that the reasonability of the Cilin-based representation template changes little, and that of POS-based representation is very lower. We find that the longer original phrases are, the lower the coverage of the generalized template is. Although the average coverage of generalized template is relatively low, we can draw a conclusion that using multiple semantic codes to generalize phrasal paraphrase examples is reasonable.</Paragraph>
      <Paragraph position="3"> The column of the coverage shows that the coverage rates of Cilin-based templates are all not more than 50%. And the POS-based template has a very high coverage rate. And we know that the extended information is not enough only depending on one search engine.</Paragraph>
      <Paragraph position="4"> We will combine several different search engines with together to solve this problem in the future work.</Paragraph>
      <Paragraph position="5">  The numbers from one to four on the X-axis are correspondent to POS, Cilin3, Cilin4 and Cilin5 in figure 6. We can see the features clearly of different representation methods of template from the figure 6. We can find that  Cilin5-based template has the highest precision, but its coverage is lower. And Cilin3-based template has opposite feature. This is because that one semantic code of Cilin3 includes more words than that of Cilin5. At the same time, more words bring more redundant information.</Paragraph>
      <Paragraph position="6"> And Cilin4-based template has a good tradeoff between coverage and precision. So we conclude that the semantic code of fourth layer in Cilin (EV) is more suitable to represent paraphrase template.</Paragraph>
      <Paragraph position="7"> Some additional information can be extracted from the generalized template. Such as, the collocation information between the slot word and the context words can be extract. For example, in the fourth template, we can get the information about which words can be collocated with &amp;quot;x (Jin)&amp;quot;.</Paragraph>
      <Paragraph position="8"> Although this kind of representation of paraphrase template has a good performance, it is weak for those words or structures that don't exist in dictionary. Also, this method is not suitable to the named entities representation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML