File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1607_metho.xml

Size: 12,496 bytes

Last Modified: 2025-10-06 14:08:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1607">
  <Title>Criterion for Judging Request Intention in Response texts of Open-ended Questionnaires</Title>
  <Section position="3" start_page="6" end_page="7" type="metho">
    <SectionTitle>
SVM
</SectionTitle>
    <Paragraph position="0"> . For each iteration in the cross validation, 8/10 of the data was used for training, 1/10, for parameter adjustment, and 1/10, for testing. The</Paragraph>
    <Paragraph position="2"> We define P as the mean of the precisions for each</Paragraph>
    <Paragraph position="4"> /10. We henceforth call P precision. The precisions of ME and SVM are in Table 3, together with a baseline precision 0.648 (=1944/3001), which was obtained by tagging all the responses possible. In the table, the figures in columns &amp;quot;ME&amp;quot; and &amp;quot;SVM&amp;quot; are the precisions of  ME and SVM. Line F i (i=1,2,3) indicates that the precisions in that line were obtained by using F</Paragraph>
    <Paragraph position="6"> an answer into a word sequence.</Paragraph>
    <Paragraph position="7">  This data was different from the response text analyzed in Section 3.1.</Paragraph>
    <Paragraph position="8">  We used the polynomial kernel for SVM. We tried degrees 1 and 2 d=1,2. Since d=1 outperformed d=2, the results of d=1 are in Table 3 a feature set. We use one-sided Welch tests to measure the differences between precisions and say &amp;quot;statistically significant&amp;quot; or simply &amp;quot;significant&amp;quot; when the differences were statistically significant at 1% level.</Paragraph>
    <Paragraph position="9"> Table 3 indicates that both ME and SVM outperform the baseline by a large margin. The differences were, of course, statistically significant. Therefore, we can conclude that these methods are quite effective in this task.</Paragraph>
    <Paragraph position="10">  This table also indicates that ME and SVM are comparable in precision. The differences of precision were not statistically significant. We next compared the highest precisions in lines F  as a feature set.</Paragraph>
    <Paragraph position="11"> Table 3 demonstrates that we can expect about 91% precision in deciding the paraphrasability by using either ME or SVM. This is a reasonably high precision. Therefore, we can conclude that the criterion proposed in Section 2.2 is sufficiently objective and stable.</Paragraph>
    <Paragraph position="12"> 4 Evaluation by different judges In Section 3, we described the manual analytical evaluation by a single judge and the objective evaluation by machine learning that uses a corpus prepared based on the analytical evaluation. Section 4 refers to experiments carried out by multiple different judges.</Paragraph>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
4.1 Evaluation of reproducibility: judgment
</SectionTitle>
      <Paragraph position="0"> of paraphrasing by multiple judges The subjects of this experiment were three male native speakers of Japanese in their twenties who were engineering majors. The experiment was carried out using a total of 24,000 random sentences from the OEQ corpus described in Section 3.1 by applying the criterion proposed in Section 2.2. If a response text included plural sentences, they were separated into single sentences as mentioned in Section 3.1. Of the</Paragraph>
      <Paragraph position="2"> number of correctly tagged answers total number of answers in the test data</Paragraph>
      <Paragraph position="4"> 24,000 sentences, the three subjects A, B and C were each given 8,000 of them. However, the pairs A and B, B and C, and A and C were each given 4,000 common sentences, so that a variation of sentence totaled 12,000.</Paragraph>
      <Paragraph position="5"> As shown in Table 1 in Section 3.1, direct request expressions can be paraphrased with tehoshii, therefore, we deal only with the judgment of the second level in Fig.2, namely the paraphrasing into te-hoshii. For the evaluation, we prepared a set of work instructions for the subjects, part of which is shown below.</Paragraph>
      <Paragraph position="6"> Work instructions 1) Not only the end expression but also case particles, case particle equivalents and those containing such expressions or expressions of connection are to be paraphrased.</Paragraph>
      <Paragraph position="7"> 2) If te-hoshii is to be changed to a negative request of shite-hoshiku-nai (do not want), place the word negative at the end.</Paragraph>
      <Paragraph position="8"> 3) Not only functional words but also content words, furthermore, word order may be changed in paraphrasing #1 S(ource): Zhu Che Chang gaShao nasugirutoSi u (We think that there are not enough car parks.) - T(arget): Zhu Che Chang woZeng yasitehosii (We want car parks to be increased.) The experimental results are given in Table 4, where P means possible to paraphrase and NP means not possible. KC is the kappa coefficient  Generally, the closer the kappa coefficient is to 1, the higher the degree of agreement is obtained. There is a complete agreement when it is 1. In general, the ranges [0.81-1.00], [0.61-0.80], [0.410.60], [0.21-0.40] and [0.00-0.20] correspond to full, practical, medium, low, and no agreement, respectively.</Paragraph>
      <Paragraph position="9"> Therefore, as Table 4 indicates, the results of the judging and the paraphrasing using the criterion by the three subjects showed that there was substantial agreement between subject A and C, and medium agreement between A and B, and B and C.</Paragraph>
      <Paragraph position="10"> These results indicate that the method based on the criterion, whether used by a single judge or by different judges(=subjects) for analysis and experiment, enables requests and non-requests to be distinguished. Therefore, we can conclude that using the criterion enables even untrained people to reproduce the extraction of requests.</Paragraph>
      <Paragraph position="11"> Sentences such as #2 and #3 below are examples of sentences that were agreed to be nonparaphrasable. These include expressions of intentions in which the current situation is accepted passively such as #2 &amp;quot;siyouganai (I think that it cannot be helped),&amp;quot; or in which the current situation is actively accepted such as #3 &amp;quot; Su Qing rasii (are wonderful)&amp;quot;. Furthermore, #4 is a sentence that begins with a clear statement of reason &amp;quot; Li You ha (the reason is).&amp;quot; This indicates that a motive for requests exists, and that a response formed by multiple sentences often composes request-motive adjacency in discourse structure.</Paragraph>
      <Paragraph position="12"> Examples of sentences that could not be paraphrased: #2 Bi Yao gaarebaLiao Jin noZhi Shang gemosiyouganai to Si u. (I think that it cannot be helped if rise in charges is necessary.) # 3 Che isunoRen demo, Le niatikoti[?] Ren deMai i Wu yaSan Bu gadekiruTing , Dao Lu deSu Qing rasii (The town and roads are wonderful as even people in wheelchairs can do shopping by themselves here and there with ease and wander about.) #4 Li You ha Quan Ti De naFa Zhan gaWang menainode . (The reason is that overall development cannot be hoped for.) This analysis shows that paraphrasable sentences indicate requests and non-paraphrasable sentences indicate the acceptance of the current situations or the motives for requests.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
4.2 Evaluation of effectiveness: judging
</SectionTitle>
      <Paragraph position="0"> intention without using the criterion To evaluate whether the proposed criterion described in Section 2.2 is effective or not, we carried out an experiment to see if a response shows requests or not without the criterion. The two subjects, D and E, who took part in this experiment were both native speakers of Japanese. Subject D was a male student in his twenties from the education department of a university, and subject E was a female student also in her twenties from the literature department of a university. They used the same data of 4,000 sentences that were used by the subjects B and C in Section 4.1. The subjects D and E did not consult with each other and carried out the work separately. We provided them with the following instructions before asking them to start the work.</Paragraph>
      <Paragraph position="1"> * Each response sentence is context-free.</Paragraph>
      <Paragraph position="2"> * Judge intuitively, and mark 1 if you think the sentence shows a request, and mark 0 if you do not .</Paragraph>
      <Paragraph position="3"> * Make sure to mark either 1 or 0.</Paragraph>
      <Paragraph position="4"> The results of the experiment are given in Table 5, where 1 and 0 in the right table correspond to P and NP in Table 4. We show the data again because subjects B and C used the same data as subjects D and E. In Table 5, the kappa coefficient (KC), between D and E is lower than that between B and C. Moreover, it is the lowest among all those given in Tables 4 and 5. The KC of 0.17 means there is no agreement between D and E.</Paragraph>
      <Paragraph position="5"> The results indicate the rate of agreement is higher for judgments made using the criterion than for subjective judgments. That is to say, this proves the effectiveness of the criterion.</Paragraph>
    </Section>
    <Section position="3" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
4.3 Examination of evaluation results
</SectionTitle>
      <Paragraph position="0"> We examine here mainly the cases in which no agreement was obtained with respect to paraphrasing in the experiment described in Section 4.1. Table 4 shows the cases where disagreement was considerable. The results for these cases, shown in Table 6, indicate that disagreement is obtained when the sentences are paraphrased into the forms including clauses of cause and reason indicated by &amp;quot;node&amp;quot;(because) as #5. The clause is underlined in the target sentence in #5.</Paragraph>
      <Paragraph position="1"> #5 S:Xia iDao Lu womasumasuXia kusiteiru. (A narrow road is made even narrower) T: Xia iDao Lu womasumasuXia kusiteirunode (node), dounikasitehosii. (Because the narrow road is made even narrower, I would like to see something done about it.) The source sentence #5 is a statement showing the condition of the road being narrow. This statement can be seen as a motive for a request in the target sentence of #5. That is to say, the source sentence #5 itself shows not the content of a request but the &amp;quot;motive for request.&amp;quot; The three subjects disagreed in their judgments on whether or not the &amp;quot;motive for request&amp;quot; sentence was paraphrasable as shown in the bottom line of Table 6. As the table indicates, disagreement rates of 64.4%, 51.5%, and 9.0% were obtained between A and B, A and C, and B and C. The reason for these high disagreement rates was that we did not give clear directions in the work instructions. The sentences which the paraphrasing includes &amp;quot;node&amp;quot; are not requests and should not be extracted. This means these sentences should have been considered to be nonparaphrasable. null On the other hand, with regard to &amp;quot;motive for request&amp;quot; sentences, there was an example #1 in Section 4.1 in which the work instructions requested the subjects to paraphrase such a sentence. That is, the work instructions suggested that the source sentence #1 &amp;quot;I think that we do not have enough car parks&amp;quot; is a motive for the request &amp;quot;I want car parks to be increased.&amp;quot; This kind of inadequate instruction led to instability in the work done and might have increased the disagreement rates obtained in the judgment.</Paragraph>
      <Paragraph position="2"> However, according to the data prepared by the expert referred to in Section 3.2, &amp;quot;motive for request&amp;quot; sentences cannot be paraphrased into tehoshii, and machine learning has confirmed that the data are objective. Therefore, it can be considered that the work of removing &amp;quot;motive for requests&amp;quot; sentences can be done stably. This means  and reason clauses &amp;quot;node&amp;quot; that if the work instructions give clear directions like &amp;quot;if you are able to add node at the end of a sentence, that sentence should be regarded not as a content of request, but a motive of request,&amp;quot; then the rate of agreement may be improved.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML