File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-5004_metho.xml

Size: 12,960 bytes

Last Modified: 2025-10-06 14:09:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-5004">
  <Title>A Class-oriented Approach to Building a Paraphrase Corpus</Title>
  <Section position="4" start_page="26" end_page="26" type="metho">
    <SectionTitle>
3 Related work
</SectionTitle>
    <Paragraph position="0"> Previous work on building paraphrase corpus (collecting paraphraseexamples) canbeclassified into two directions: manual production of paraphrases and automatic paraphrase acquisition.</Paragraph>
    <Section position="1" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.1 Manual production of paraphrases
</SectionTitle>
      <Paragraph position="0"> Manual production of paraphrase examples has been carried out in MT studies.</Paragraph>
      <Paragraph position="1"> For example, Shirai et al. (2001) and Kinjo et al. (2003) use collections of Japanese-English translation sentence pairs. Given translation pairs, annotators are asked to produce new translations for each side of the languages.</Paragraph>
      <Paragraph position="2"> Sentences that have an identical translation are collected as equivalents, i.e., paraphrases.</Paragraph>
      <Paragraph position="3"> Shimohata (2004), on the other hand, takes a simpler approach in which he asks annotators to produce paraphrases of a given set of English sentences.</Paragraph>
      <Paragraph position="4"> Obviously, if we simply asked human annotators to produce paraphrases of a given set of sentences, the labor cost would be expensive while the coverage not guaranteed. Previous work, however, has averted their eyes from evaluating the cost-efficiency of the method and the coverage of the collected paraphrases supposedly because their primary concern was to enhance MT systems.</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.2 Automatic paraphrase acquisition
</SectionTitle>
      <Paragraph position="0"> Recently, paraphrase examples have been automatically collected as a source of acquiring paraphrase knowledge, such as pairs of synonymous phrases and syntactic transformation templates.</Paragraph>
      <Paragraph position="1"> Some studies exploit topically related articles derived from multiple news sources (Barzilay and Lee, 2003; Shinyama and Sekine, 2003; Quirk et al., 2004; Dolan et al., 2004). Sentence pairs that are likely to be paraphrases are automatically collected from the parallel or comparable corpora, using such clues as overlaps of content words and named entities, syntactic similarity, and reference description, such as date of the article and positions of sentences in the articles.</Paragraph>
      <Paragraph position="2"> Automatic acquisition from parallel or comparable corpora, possibly in combination with manual correction, could be more cost-efficient than manual production. However, it would not ensure coverage and quality, because sentence pairingalgorithms virtually limit the range of obtainable paraphrases and products tend to be noisy.</Paragraph>
      <Paragraph position="3"> Nevertheless, automatic methods are useful to discover a variety of paraphrases that need further exploration. We hope that our approach to corpus construction, which we present below, will work complementary to those directions of research.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="26" end_page="26" type="metho">
    <SectionTitle>
4 Proposed method
</SectionTitle>
    <Paragraph position="0"> Recall that we require a corpus that reflects the distribution of the occurrences of potential paraphrases of each class because we aim to use it for linguistic analysis and quantitative evaluation of paraphrase generation models.</Paragraph>
    <Paragraph position="1"> Since the issues weaddress here are highly empirical, we need to empirically examine a range of possible methods to gain useful methodological insights. As an initial attempt, we have so far examined asimplemethod which falls in the middle of the aforementioned two approaches. The method makes use of an existing paraphrase generation system to reduce human labor cost as well as to ensure coverage and quality: Step 1. For a given paraphrase class, develop a set of morpho-syntactic paraphrasing patterns and lexical resources.</Paragraph>
    <Paragraph position="2"> Step 2. Apply the patterns to a given text collection using the paraphrasing system to generate a set of candidate paraphrases.</Paragraph>
    <Paragraph position="3"> Step 3. Annotate each candidate paraphrase with information of the appropriateness according to a set of judgement criteria.</Paragraph>
    <Paragraph position="4"> Weusemorpho-syntacticparaphrasingpatterns derived from paraphrase samples in an analogous way to previous methods such as (Dras, 1999).</Paragraph>
    <Paragraph position="5"> For instance, from example (1), we derive a paraphrasing pattern for paraphrasing of light-verb  N, and the subscriptedarrowin (6s) indicates that N-o depends on V.</Paragraph>
    <Paragraph position="6"> To exhaustively collect paraphrase examples from a given text collection, we should not excessively constrain paraphrasing patterns. To avoid overly generating anomalies, on the other hand, we make use of several lexical resources. For instance, pairs of a deverbal noun and its transitive form are used to constrainN andV (N) in pattern (6). This way, we combine syntactic transformation patterns with lexical constraints to specify a paraphrase class. This approach is practical given the recent advances of shallow parsers.</Paragraph>
    <Paragraph position="7"> ForthejudgementonappropriatenessinStep3, we create a set of criteria separately for each paraphrase class. When the paraphrase class in focus is specified, the range of potential errors in candidate generation tends to be predictable. We therefore specify judgement criteria in terms of a typology of potential errors (Fujita and Inui, 2003); namely, we provide annotators with a set of conditions for ruling out inappropriate paraphrases. Annotators judge each candidate paraphrase with a view of an RDB-based annotation tool (Figure 1). Given (a) a source sentence and (b) an automatically generated candidate paraphrase, human annotators are asked to (c) judge the appropriateness of it and, if it is inappropriate, they are also asked to (d) classify the underlying errors into a predefined taxonomy, and make (e) appropriate revisions (if possible) and (f) format-free comments.</Paragraph>
  </Section>
  <Section position="6" start_page="26" end_page="29" type="metho">
    <SectionTitle>
5 Preliminary trials
</SectionTitle>
    <Paragraph position="0"> To examine how the proposed method actually work regarding the issues, we conducted preliminary trials, taking two classes of Japanese paraphrases: paraphrasing of light-verb constructions and transitivity alternation. This section describes the settings for each paraphrase class.</Paragraph>
    <Paragraph position="1"> We sampled a collection of source sentences from one year worth of newspaper articles: Nihon Keizai Shinbun3, 2000, where the average sentence length was 25.3 words. The reason why we selected newspaper articles as a sample source was that most of the publicly available shallow parsers for Japanese were trained on a tree-bank sampled from newspaper articles, and a newspaper corpus was available in a considerably large scale. We used for candidate generation the morphologicalanalyzer ChaSen4,thedependency structure analyzer CaboCha5, and the paraphrase generation system KURA6.</Paragraph>
    <Paragraph position="2"> Two native speakers of Japanese, adults graduated from university, were employed as annotators. The process of judging each candidate paraphrase is illustrated in Figure 2. The first annotator was asked to make judgements on each candidate paraphrase. The second annotator inspected all the candidates judged correct by the first an- null subset of candidates that the first annotator judged incorrect were checked by the second annotator, leaving the rest labeled incorrect. Once in several days,theannotatorsdiscussedcasesonwhich they disagreed, and if possible revised the annotation criteria. When the discussion did not reach a consensus, the judgement was deferred.</Paragraph>
    <Section position="1" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
5.1 Paraphrasing of light-verb constructions
</SectionTitle>
      <Paragraph position="0"> (LVC) An example of this class is given in (1). A light-verb construction consists of a deverbal noun (&amp;quot;shigeki (inspiration)&amp;quot; in example (1)) governed by a light-verb (&amp;quot;ukeru (to receive)&amp;quot;). A paraphrase of this class is a pair of a light-verb construction and its unmarked form, which consists of the verbalized form of the deverbal noun where the light-verb is removed.</Paragraph>
      <Paragraph position="1"> Let N, V be a deverbal noun and a verb, and V (N) be the verbalized form of N. Paraphrases of this class can be represented by the following paraphrasing pattern:</Paragraph>
      <Paragraph position="3"> In the experiment, we used three more patterns to gain the coverage.</Paragraph>
      <Paragraph position="4"> We then extracted 20,155 pairs of deverbal noun and its verbalized form (e.g. &amp;quot;shigeki (inspiration)&amp;quot; and &amp;quot;shigeki-suru (to inspire)&amp;quot;) from the Japanese word dictionary, IPADIC (version 2.6.3)3. This set was used as a restriction on nouns that can match with N in a paraphrasing pattern. On the other hand, we made no restriction on V, because we had no exhaustive list of light-verbs. The patterns were automatically compiled into pairs of dependency trees with uninstantiated components, and were applied to source sentences with the paraphrase generation system, which carried out dependency structure-based pattern matching. 2,566 candidate paraphrases were generated from 10,000 source sentences. null In the judgement phase, the annotators were also asked to revise erroneous candidates if possible. The following revision operations were allowed for LVC: Change of conjugations Change of case markers Insert adverbs Append verbal suffixes, such as voice, aspect, or mood devices When pattern (7) is applied to sentence (1s), for instance, weneedtoaddavoicedevice, &amp;quot;are(passive),&amp;quot; to correctly produce (1t). In example (8), on the other hand, an aspectual device, &amp;quot;dasu (inchoative),&amp;quot; is appended, and a case marker, &amp;quot;no  (GEN),&amp;quot; is replaced with &amp;quot;o (ACC).&amp;quot; (8) s. concert-no ticket-no hanbai-o hajime-ta. concert-GEN ticket-GEN sale-ACC to start-PAST  We started to sale tickets for concerts.</Paragraph>
      <Paragraph position="5"> t. concert-no ticket-o hanbai-shi-dashi-ta.</Paragraph>
      <Paragraph position="6"> concert-GEN ticket-ACC to sell-INCHOATIVE-PAST We started selling tickets for concerts.</Paragraph>
      <Paragraph position="7"> So far, 1,114 candidates have been judged7 with agreements on 1,067 candidates, and 591 paraphrase examples have been collected.</Paragraph>
    </Section>
    <Section position="2" start_page="28" end_page="29" type="sub_section">
      <SectionTitle>
5.2 Transitivity alternation (TransAlt)
</SectionTitle>
      <Paragraph position="0"> This class of paraphrases requires a collection of pairs of intransitive and transitive verbs, such as &amp;quot;yureru (to sway)&amp;quot; and &amp;quot;yurasu (to sway)&amp;quot; in example (2). Since there was no available resource of such knowledge, we newly created a minimal set of intransitive-transitive pairs that were required to cover all the verbs appearing in the source sentence set (25,000 sentences). We first retrieved all the verbs from the source sentences using a set of extraction patterns implemented in the same manner as paraphrasing patterns. Example (9) is one of the patterns used, where Nx matches with a noun, and V a verb.</Paragraph>
      <Paragraph position="1"> 7983 candidates for the first 4,500 sentences were fully judged, and 131 candidates were randomly sampled from the remaining portion.</Paragraph>
      <Paragraph position="3"> t. no change.</Paragraph>
      <Paragraph position="4"> We then manually examined the transitivity of each of 800 verbs that matched with V, and collected212 pairsof intransitiveverbvi and itstransitive form vt. Using them as constraints, we implemented eight paraphrasing patterns as in (10).</Paragraph>
      <Paragraph position="6"> where Vi and Vt(Vi) are variables that match with vi and vt, respectively. By applying the patterns to the same set of source sentences, we obtained 985 candidate paraphrases.</Paragraph>
      <Paragraph position="7"> We created a set of criteria for judging appropriateness (an example will be given in Section 6.4) andrevisionexamplesfor the following operations allowed for this trial: Change of conjugations Change of case markers Change of voices 964 candidates have gained an agreement, and 484 paraphrase examples have been collected.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML