File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-5009_metho.xml

Size: 17,278 bytes

Last Modified: 2025-10-06 14:09:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-5009">
  <Title>Evaluating Contextual Dependency of Paraphrases using a Latent Variable Model</Title>
  <Section position="3" start_page="66" end_page="66" type="metho">
    <SectionTitle>
3 Evaluating Paraphrases with Latent
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="66" end_page="66" type="sub_section">
      <SectionTitle>
Variable Models
</SectionTitle>
      <Paragraph position="0"> To evaluate a paraphrasing pair of sentences, we must prepare a learning corpus for constructing latent variable models. It must be organized so that it consist of documents, and each document must be implicated in a specific context.</Paragraph>
      <Paragraph position="1"> Both latent variable models pLSI and LDA require vector format data for their learning. In this paper, we follow the bag-of-words approach and prepare vector data that consist of words and their frequency for each document in the learning corpus. null After constructing the pLSI and LDA models, wecan inferatopicby usingthe models with vector data that correspond to a target sentence.The vector data for the target sentence are constructed by using thetarget sentence and the sentences that surround it. From these sentences, the vector data that correspond to the target sentence are constructed. We call the number of sentences used to construct vector data &amp;quot;window size.&amp;quot; Evaluating a paraphrasing pair (P1:P2) is simple. Construct vector data (vec(P1) and vec(P2)) and infer contexts (T(P1) and T(P2)) by using a latent variable model. Using pLSI, the topic that indicates the highest probability is used as the inferred result, and using LDA, the largest parameter that corresponds to the topic is used as the inferred result. If topics T(P1) and T(P2) are different, the sentences might be used in different contexts, and the paraphrasing pair would be contextually dependent; otherwise, the paraphrasing pair would be contextually independent.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="66" end_page="69" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We carried out several experiments that automatically evaluated extracted paraphrases with pLSI and LDA. To carry out these experiments, we used plsi-0.031 by Kudo for pLSI and lda-c2 toolkit by Blei (Blei et al., 2003) for LDA.</Paragraph>
    <Section position="1" start_page="66" end_page="67" type="sub_section">
      <SectionTitle>
4.1 Data set
</SectionTitle>
      <Paragraph position="0"> We used a bilingual corpus of travel conversation containing Japanese sentences and corresponding English translations (Takezawa et al., 2002). Since the translations weremade sentence by sentence, this corpus was sentence-aligned from its origin and consisted of 162,000 sentence pairs.</Paragraph>
      <Paragraph position="1"> The corpus was manually and roughly annotated with topics. Each topic had a two-level hierarchical structure whose first level consisted of 19 topics. Each first-level topic had several subtopics. The second level consisted of 218 topics, after expanding all subtopics of each topic in the first level. A rough annotation example is shown in Table 1; the hierarchical structure of this topic seems unorganized. For example, in the first-level topic, there are topics labeled basic and communication, which seem to overlap.</Paragraph>
      <Paragraph position="2">  sentence 1st topic 2nd topic Where is the nearest department store? shopping buying something That's too flashy for me. shopping choosing something There seems to be a mistake on my bill. staying checkout There seems to be a mistake on my bill. staying complaining In the corpus, however, there is an obvious textual cohesion such that sentences of the same topic are locally gathered. Each series of sentencescan beused asadocumentforatextmodel.</Paragraph>
      <Paragraph position="3"> Under the assumption that each series of sentences is a document, the average number of sentences included in a document is 18.7, and the average number of words included in a document is 44.9.</Paragraph>
    </Section>
    <Section position="2" start_page="67" end_page="67" type="sub_section">
      <SectionTitle>
4.2 Extracting paraphrases
</SectionTitle>
      <Paragraph position="0"> A large collection of parallel texts contains many sentences in one language that correspond to the same expression in the other language for translation. For example, if Japanese sentences Ji1,...,Jim correspond to English sentence Ei, then these Japanese sentences would be paraphrases. null We utilized a very simple method to extract Japanese paraphrases from the corpus. First, we extracted duplicate English sentences by exact matching. From the learning set, 18,505 sentences were extracted. Second, we collected Japanese sentences that correspond to each extracted English sentence. Next, we obtained sets of Japanese sentences collected by using English sentences as pivots. In the corpus, one English sentence averaged almost 4.5 Japanese sentences, but this number included duplicate sentences.</Paragraph>
      <Paragraph position="1"> If duplicate sentences are excluded, the average number of Japanese sentences corresponding to an English sentence becomes 2.4. Finally, we obtained 944,547 Japanese paraphrasing pairs by combining sentences in each group of Japanese sentences.</Paragraph>
    </Section>
    <Section position="3" start_page="67" end_page="68" type="sub_section">
      <SectionTitle>
4.3 Comparing human judgement and
</SectionTitle>
      <Paragraph position="0"> inference by latent variable models In this section, we determine the difference between manually annotated topics and inference results using pLSI and LDA. We originally considered evaluating each paraphrase as a binary classification problem that determines whether both sentences of the paraphrase are used in the same context. We evaluated the inferred results by comparison with the manually annotated topics, and thus accuracy could be calculated when themanuallyannotated topicswerecorrect. However, accuracy is inappropriate for evaluating results inferred by a latent variable model, since the topicswereroughlyannotated byhumans asmentioned in Section 4.1. Accordingly, we employed Kappa statistics as a rough guide for the correctness of the inferred resultsby latentvariablemodels. null Tables 2 and 3 show the comparison results, where the window size is 11 (the target sentence + the previous five and the following five sentences). When constructing pLSI models, the parameter for tempered EM (TEM) is set to 0.9 (we use this value in all of the experiments in this paper), because it showed the best performance in preliminary experiments. We performed the experiments on several topics.</Paragraph>
      <Paragraph position="1">  treat inference results as vector data. Thus, we can use a metric to classify the two vectors that correspond totheinferredresultsofanytwogiven sentences. We use cosine as a metric and con- null ducted comparison experiments for the first- and second-level topics, as shown in Tables 4 and 5.</Paragraph>
      <Paragraph position="2"> The threshold values used to judge whether topics are the same are indicated in the parentheses.</Paragraph>
      <Paragraph position="3">  We also performed an experiment to confirm the relationship between Kappa statistics and window-size context. Experiments were done underthefollowingconditions: thenumberoftopics was 20 for both pLSI and LDA, Kappa statistics were calculated for the first-level topic, and window sizes were 5, 11, 15, 21, 25, and 31. Table 6  shows the experimental results.</Paragraph>
      <Paragraph position="4"> The actual computing time needed to evaluate 944,547 paraphrases with a Pentium M 1.4-GHz, 1-GB memory computer is shown in Table 7. It is important to note that the inference program for pLSI was written in Perl, but for LDA it was written in C.</Paragraph>
    </Section>
    <Section position="4" start_page="68" end_page="69" type="sub_section">
      <SectionTitle>
4.4 Experiments from paraphrasing
perspectives
</SectionTitle>
      <Paragraph position="0"> To investigate the upper bound of our method, we carried out several experiments. So far in this paper, we have discussed topic information as an approximation of contextual information by comparing topics annotated by humans and automatically inferred by pLSI and LDA. However, since our goal is to evaluate paraphrases, we need to determine whether latent variable models detect a difference of topics for sentences of paraphrases.</Paragraph>
      <Paragraph position="1"> First, we randomly selected 1% of the English seed sentences. Each sentence corresponds to several Japanese sentences, so we could produce Japanese paraphrasing pairs. The number of selected English sentences was 185.</Paragraph>
      <Paragraph position="2"> Second, we generated 9,091 Japanese paraphrasing pairs from the English seed sentences.</Paragraph>
      <Paragraph position="3"> However, identicalsentencesexisted insomegenerated paraphrasing pairs. In other words, these sentences were simply collected from different  places in the corpus. From a paraphrasing perspective, suchpairs are useless. Thus weremoved them and randomly selected one pair from one English seed sentence.</Paragraph>
      <Paragraph position="4"> Finally, we sampled 117 paraphrasing pairs and evaluated them based on a paraphrasing perspective: whether a paraphrase is contextually independent. There were 71 contextually independent paraphrases and 37 contextually dependent paraphrases. Nine paraphrases had problems, all of which were caused by translation errors. The phrase &amp;quot;contextually independent paraphrases&amp;quot; means that the paraphrases can be used in any context and can be applied as two-way paraphrases. On the other hand, &amp;quot;contextually dependent paraphrases&amp;quot; means that the paraphrases are one-way, and so wehaveto give consideration to the direction of each paraphrase.</Paragraph>
      <Paragraph position="5">  We removed the nine problematic paraphrasing pairs and evaluated the remaining samples with manually annotated topic labels, as shown in Table 8. According to the basic idea of this method, a contextually independent paraphrasing pair should be judged as having the same topic, and a contextually dependent pair should be judged as having a different topic. Thus, we introduced a criterion to evaluate labeling results in terms of an error rate, defined as follows: Error rate = |Dindep|+|Sdep|# of judged pairs, (7) where Dindep denotes a set that consists of paraphrasing pairs that are judged as having different topics but are contextually independent. On the other hand, Sdep denotes a set that consists of paraphrasing pairs that are judged as having the same topic, but are contextually dependent.</Paragraph>
      <Paragraph position="6"> For example, from the results in Table 8, the error rate of the results for the first-level topic is 0.398 ((25 + 18)/108), and that for the second-level topic is 0.528 ((46 + 11)/108).</Paragraph>
      <Paragraph position="7"> Toestimate the upperbound ofthis method, we also investigated potentially unavoidable errors.</Paragraph>
      <Paragraph position="8"> Several paraphrasing pairs are used for the exact same topic, but they seem contextually dependent because several words are different. On the other hand, some paraphrasing pairs seem to be used in obviously different topics but are contextually independent. Table 9 shows the investigation results; at least ten paraphrasing pairs seem contextually independent but are actually used in different topics. In addition, there are at least 15 paraphrasing pairs whose topic is obviously the same, but several differences of words make them contextually dependent. Moreover, in this case, the error rate is 0.231 ((15+10)/108), meaning that it is difficult to judge all of the paraphrasing pairs correctly by using only topic (contextual) information. Thus, this method's upper bound of accuracy when using only topic information is estimated to be around 77%.</Paragraph>
      <Paragraph position="9"> Table 9: Potential upper bound of this method human judgement human judgement from paraphrasing based on topic perspective same different independent 61 10 dependent 15 22 We prepared several latent variable models to investigate the performance of the proposed method and applied it to the sampled paraphrasing sentences mentioned above. Table 10 shows the evaluation results.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="69" end_page="71" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> First, there is no major performance difference between pLSI and LDA in paraphrasing evaluation. On average, LDA is slightly better than pLSI. Blei et al. showed that LDA outperforms pLSI in (Blei et al., 2003); however, in some of the cases shown in Tables 2 and 3, pLSI outperforms LDA. On the contrary, using a cosine metric, LDAhasasignificantproblem: itlosesitsdistinguishing ability when the number of topics (latent variables) becomes large. With such a large number of topics, LDA always infers a point near the gravity point of the topic simplex. In addition, using a cosine metric also requires a threshold to  judge a pair of paraphrasing sentences.</Paragraph>
    <Paragraph position="1"> From Table 6, LDA seems robust against the inclusion of noisy sentences with a large window, but it is easily affected by a small window. On the other hand, pLSI seems robust against information shortages due to a small window, but it is not effective with a large window. The best performanceswereshownatwindowsize 15 for both pLSI and LDA, since the average number of sentences in a document (segment) is 18.7, as shown in Section 4.1.</Paragraph>
    <Paragraph position="2"> Table 7 shows that in spite of the difference in programing language, pLSI is faster than LDA in practice. In addition, Table 8 reveals that judging the contextual dependency of paraphrasing pairs does not require fine-grained topics.</Paragraph>
    <Paragraph position="3"> From the results shown in Table 10, we can conclude that topic inference by latent variable models resembles context judgement by humans as recorded in error rate. However, we note that the error rate was not weighted for contextually independent or dependent results. Error rate is simply a relative index. For example, if there is a result in which all of the inferences reflect the same topic, then the error rate becomes 0.3426.</Paragraph>
    <Paragraph position="4"> Thus it is important to detect a contextually dependent paraphrase. Considering these points, pLSI20 with window size 11 shows very good results in Table 10.</Paragraph>
    <Paragraph position="5"> In Section 4.4, we showed the potential upper bound of this method. The smallest error rate is 0.231, and we can estimate a corrected error by the following formula: |Dindep|+|Sdep|[?]C # of judged pairs[?]C, (8) where C denotes the correction value that corresponds to the number of paraphrasing pairs judged incorrectly with only contextual information. In our experiments, from the results shown in Table 9, C is set to 25. From the results shown inTable10, wecan concludethattheperformance of our method is almost the same as that by the manually annotated topics, and the accuracy of our method is almost 80% for paraphrasing pairs that can be judged by contextual information.</Paragraph>
    <Paragraph position="6"> There are several possibilities for improving accuracy. One is using a fixed window to obtain contextual information. Irrelevant sentences are sometimes included in fixed windows, and latent variable models fail on inference. If we could infer a boundary of topics with high accuracy,  we would be able to dynamically detect a precise window using some other reliable text models specialized to text segmentation.</Paragraph>
    <Paragraph position="7"> So far, we have mainly discussed the contextual dependency of paraphrasing pairs. However, when a paraphrasing pair is contextually dependent, it is also important to infer its specific paraphrasing direction. Unfortunately, we conclude that inferring the paraphrasing direction with contextual information is difficult. In the experimental results, however, there were several examples whose direction could be inferred from their contextualinformation. Thus, contextual information may benefit the inference of paraphrasing direction. Actually, in the experiments, 11 of 37 contextual dependent pairs had obvious paraphrasing directions. In most of the paraphrasing pairs, differentwordswereused orinserted, orsomewords were deleted. Thus, to infer a paraphrasing direction, weneedmore specific information for words or sentences; for example what words carry specific or generic meaning and so on.</Paragraph>
    <Paragraph position="8"> One might consider a supervised learning method, such as Support Vector Machine, to infer topics (e.g., (Lane et al., 2004)). However, we cannot know the best number of topics for an application in advance. Thus, a supervised learning method is promising only if we already know the best number of topics for which we can prepare an appropriate learning set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML