File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1114_metho.xml

Size: 15,999 bytes

Last Modified: 2025-10-06 14:14:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-1114">
  <Title>Large Scale Collocation Data and Their Application</Title>
  <Section position="3" start_page="0" end_page="695" type="metho">
    <SectionTitle>
2. Collocation Data
</SectionTitle>
    <Paragraph position="0"> Unlike the recent works on the automatic extraction of collocations from corpus \[Church, K. W, et al, 1990, Ikehara, S. et al, 1996, etc.\], our data have been collected manually through the intensive investigation of various texts, spending years on it.</Paragraph>
    <Paragraph position="1"> This is because no stochastic framework assures the  accuracy of the extraction, namely the necessity and sufficiency of the data set. The collocations which are used in our Kana-to-Kanji conversion system consist of two kinds: (1) idiomatic expressions, whose meanings seem to be difficult to compose from the typical meaning of the individual component words \[Shudo, K. et al., 1988\]. (2) stereotypical expressions in which the concurrence of component words is seen in the texts with high frequency. The collocations are also classified into two classes by a grammatical criterion: one is a class of functional collocations, which work as functional words such as particles (postpositionals) or auxiliary verbs, the other is a class of conceptual collocations which work as nouns, verbs, adjectives, adverbs, etc. The latter is further classified into two kinds: uninterruptible collocations, whose concurrence relationship of words are so strong that they can be dealt with as single words, and interruptible collocations, which are occasionally used separately.</Paragraph>
    <Paragraph position="2"> In the following, the parenthesized number is the number of expressions adopted in the system.</Paragraph>
    <Section position="1" start_page="694" end_page="694" type="sub_section">
      <SectionTitle>
2.1 Functional Collocations (2,174)
</SectionTitle>
      <Paragraph position="0"> We call expressions which work like a particle relational collocation and expressions which work like an auxiliary verb at the end of the predicate auxiliary predicative collocation \[Shudo, K. et al., 1980\].</Paragraph>
      <Paragraph position="1"> relational collocations (760) ex. \[ 7./') t, x-C ni/tuae (about) auxiliary predicative collocations (1,414) naKereoa/naranai (must)</Paragraph>
    </Section>
    <Section position="2" start_page="694" end_page="694" type="sub_section">
      <SectionTitle>
2.2 Uninterruptible Conceptual Col-
</SectionTitle>
      <Paragraph position="0"> locations (54,290) four-Kanji-compound (2,231) ex. ~ ZJlYg. gaaeninsut (every miller draws water to his own mill)  adverb + particle type (3,089) ex ~t:,5,tz.&amp; * atafutat'o'(da sconcertedly) adverb + suru type (1,043) &lt; &lt;-C/ eX'agt~u&lt;se~cusuru toil and moil) noun type (21,128) ex. ~09/~3, akano/tanin (perfect stranger) verb type (13,225) ex. ~'9 ~J ~'~/~ 1-o otsuriga/~-ru . . (be enough to make the change) adjective type (2,394) ex \]t~ L t,~ * uraganashii (mournful) adjective verb type (397) ex ~t~J~ &amp;quot;goldge-n/naname (in a bad mood) adverb and other type (8,185) ex ~ 17../,~'C * meni/miete (remarkably) proverb type (2,598) ex ~ I, ~'C I~I~J ~.I~ ~. * otteha/koni/shitagae (when old, obey your children)</Paragraph>
    </Section>
    <Section position="3" start_page="694" end_page="695" type="sub_section">
      <SectionTitle>
2.3 Interruptible Conceptual Colloca-
</SectionTitle>
      <Paragraph position="0"> tions (78,251) noun type (7,627) ex. ~$(7)/tttt, akugyouno/mukui (fruit of an evil deed) verb type (64,087) ex. ~,~. tt:~/~ I 7b~.~ usnlrogamlwo/nlKareru (feel as if one's heart were left behind) adjective type (3,617) ex ~Tb~/:~-~ t,~ &amp;quot;taittbgcr~ool~i ( act in a lordly manner) adjective verb type (2,018) ex. tt~Tb~/+- yakushaga/ue (be more able)  others (902) ex ~lz/~li'J'~ * atoni/~il~nu (can not give up) 3. Kana-to-Kanji Conversion Systems  We developed four different Kana-to-Kanji conversion systems, phasing in the collocation data described in 2. The technological framework of the system is based on extended bunsetsu (ebunsetsu) model \[Shndo, K. et al., 1980\] for the unit of the segmentation of the input Kana string, and on minimum cost method \[Yoshimura, K. et al., 1987\] combined with Viterbi's algorithm \[Viterbi, A,, J., 1967\] for the reduction of the ambiguity of the segmentation.</Paragraph>
      <Paragraph position="1"> A bnn.~etsu is the basic postpositional or predicative  phrase which composes Japanese sentences, and an e-bunsetsu, which is a natural extension of the bunsetsu, is defined roughly as follows: &lt;e-bunsetsu&gt;::= &lt;prefix&gt;* &lt;conceptual word l uninterruptible conceptual collocation&gt; &lt;suffix&gt;* &lt;functional word l functional collocation&gt;* The e-bunsetsu which includes no collocation is the bunsetsu. More refmed rules are used in the actual segmentation process. The interruptible conceptual collocation is not treated as a single unit but as a string ofbunsetsus in the segmentation process. Each collocation in the dictionary which is composed of multiple number of bunsetsus is marked with the boundary between bunsetsus. The system first tries to segment the input Kana string into ebunsetsus. Every possible segmentation is evaluated by its cost. A segmentation which is assigned the least cost is chosen as the solution.</Paragraph>
      <Paragraph position="2"> The boundary between e-bunsetsus in examples in this paper is denoted by &amp;quot;/&amp;quot;.</Paragraph>
      <Paragraph position="3"> ex. two results of e-bunsetsu-segmentation: , hitoh.a/kigqkikunikositagotol, taarimasen (there is nothing like being watchful) hitohdv'Mga/Idkimi/ko3itcv;kotoha/arimasen In the above examples, JKT~/~I\] &lt; kiga/kiku: is uninterruptible conceptual collocation and IS-/il~ I.,</Paragraph>
      <Paragraph position="5"> a functional collocation. In the first example, these collocations are dealt with a single words. The second example shows the conventional bunsetsusegmentation. null The cost for the segmentation candidate is the sum of three partial costs: b-cost, c-cost and d-cost shown below.</Paragraph>
      <Paragraph position="6"> (1)a segment cost is assigned to each segment. Sum of segment costs of all segments is the basic cost (b-cost) of a segmentation candidate. By this, the collocation tends to have priority over the ordinary word. The standard and initial value of each segment cost is 2, and it is increased by 1 for each occurrence of the prefix, su_Wnx, etc. in the segment. null (2)a concatenation cost (c-cost) is assigned to specific e-bunsetsu boundaries to revise the b-cost. The concatenation, such as adnominal-noun, adverb-verb, noun-noun, etc. is paid a bonus , namely a negative cost, -1.</Paragraph>
      <Paragraph position="7"> (3)a dependency cost (d-cost), which has a negative value, is assigned to the strong dependency relationship between conceptual words in the candidate, representing the consistency of concurrence of conceptual words. By this, the segmentation containing the interrupted conceptual collocation tends to have priority. The value of a d-cost varies from -3 to -1, depending on the strength of the concurrence. The interruptible conceptual collocation is given the biggest bonus i.e.-3.</Paragraph>
      <Paragraph position="8"> The reduction of the homophonic ambiguity, which limits Kanji candidates, is carried out in the course of the segmentation and its evaluation by the cost.</Paragraph>
    </Section>
    <Section position="4" start_page="695" end_page="695" type="sub_section">
      <SectionTitle>
3.1 Prototype System A
</SectionTitle>
      <Paragraph position="0"> We first developed a prototype Kana-to-Kanji conversion system which we call System A, revising Kana-to-Kanji conversion software on the market, WXG Ver2.05 for PC.</Paragraph>
      <Paragraph position="1"> System A has no collocation data but conventional lexical resources, namely functional words (1,010) and conceptual words (131,66 I).</Paragraph>
    </Section>
    <Section position="5" start_page="695" end_page="695" type="sub_section">
      <SectionTitle>
3.2 System B, C and D
</SectionTitle>
      <Paragraph position="0"> We reinforced System A to obtain System B, C and D by phasing in the following collocational resources. System B is System A equipped additionally with functional collocations (2,174) and uninterruptible conceptual collocations except for four-Kanji-compound and proverb type collocations (49,461). System C is System B equipped additionally with four-Kanji-compound (2,231) and proverb type collocations (2,598). Further, System D is System C equipped additionally with interruptible conceptual collocations (78,251).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="695" end_page="697" type="metho">
    <SectionTitle>
4. Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="695" end_page="696" type="sub_section">
      <SectionTitle>
4.1 Text Data for Evaluation
</SectionTitle>
      <Paragraph position="0"> Prior to the experiments of Kana-to-Kanji conversion, we prepared a large volume of text data by hand which is formally a set of triples whose first component a is a Kana string (a sentence) with no space, The second component b is the correct segmentation result of a, indicating each boundary between bunsetsus with &amp;quot;/&amp;quot; or &amp;quot;.&amp;quot;. '7&amp;quot; and .... means obligatory and optional boundary, respectively. The third component c is the correct conversion result of a, which is a Kana-Kanji mixed string. ex. { a: {S-;\[9\[s-\[~7b~l,~-Ct,~To niwanibaragasaiteiru  (roses are in bloom in a garden) b: IZab)\[7-/\[~?~/~ \[,~.(,~70 niwani/baraga/saite, iru c: I~I~.I#~#J~II~I,~T..I,x,'~ } The introduction of the optional boundary assures the flexible evaluation. For example, each ofl~lA &amp;quot;C/t,~ saite/iru (be in bloom) and I~I,~'CIA~ saiteiru is accepted as a correct result. The data fde is divided into two sub-files, fl and 12, depending on the number of bunsetsus in the Kana string a. fl has 10,733 triples, whose a has less than five bunsetsus and t2 has 12,192 triples, whose a has more than four bunsetsus.</Paragraph>
    </Section>
    <Section position="2" start_page="696" end_page="696" type="sub_section">
      <SectionTitle>
4.2 Method of Evaluation
</SectionTitle>
      <Paragraph position="0"> Each a in the text data is fed to the conversion system. The system outputs two forms of the least cost result: b', Kana string segmented to bunsetsus by &amp;quot;/&amp;quot;, and c', Kana-Kanji mixed string corresponding to b and c of the correct data, respectively. Each of the following three cases is counted for the evaluation. null SS (Segmentation Success): b TM b CS (Complete Success): b TM b and C/'= C/ TS (Tolerative Success): b'= b and C/'~ C/ There are many kinds of notational fluctuation in Japanese. For example, the conjugational suffix of some kind of Japanese verb is not always necessitated, therefore,~l,,I I'{'f,~fi I'I'Y and ~.1: are all acceptable results for input ~ L)~ I~ uriage (sales). Besides, a single word has sometimes more than one Kanji notations, e.g. &amp;quot;~g hama (beach) and ;~ hama (beach) are both acceptable, and so on. c'- C/ in the case of TS means that e' coincides with C/ completely or excepting the part which is heteromorphic in the above sense. For this, each of our conversion system has a dictionary which contains approximately 35,000 fluctuated notations of conceptual words.</Paragraph>
    </Section>
    <Section position="3" start_page="696" end_page="697" type="sub_section">
      <SectionTitle>
4.3 Results of Experiments
</SectionTitle>
      <Paragraph position="0"> Results of the experiments are given in Table 1 and Table 2 for input file fl and 12, respectively.</Paragraph>
      <Paragraph position="1"> Comparing the statistics of system A with D, we can conclude that the introduction of approximately 135,000 collocation data causes 8.1% and 10.5 % raise of CS and TS rate, respectively, in case of relatively short input strings (fl). The raise of SS rate for t&amp;quot;1 is 2.7%. In case of the longer input strings (t2) whose average number of bunsetsus is approximately 12.6, the raise ofCS, TS and SS rate is 2.4 %,</Paragraph>
    </Section>
    <Section position="4" start_page="697" end_page="697" type="sub_section">
      <SectionTitle>
4.4 Comparison with a Software on the
Market
</SectionTitle>
      <Paragraph position="0"> We compared System D with a Kana-to-Kanji conversion soRware for PC on the market, WXG Ver2.05 under the same condition except for the anaount of installed collocation dam For this, system D was reinforced and renmned D', by equipping with WXG's 10,000 items of word dependency description. Both systems were disabled for the learning functiom WXG has approximately 60,000 collocations (3,000 unintcrmptible and 57,000 interruptible collocations), whereas Syst~nn D' has approximately 135,000 collocations. The statistical results are givm in Table 3 and Table 4 for the corpus fl and t2, respectively.</Paragraph>
      <Paragraph position="1"> The tables show that the raise of CS, TS and SS rme, which was oblained by System D' is 2.5 %, 4.5 % and 3.9 % on the average, respectively. No fialher comparison with the conanercial products has been done, since we judge the perfommnce ofWXG Ver.2.05 to be average among them.</Paragraph>
    </Section>
    <Section position="5" start_page="697" end_page="697" type="sub_section">
      <SectionTitle>
4.5 Discussions
</SectionTitle>
      <Paragraph position="0"> Table 1 '~ 4 show that the longer input the system is given, the more difficult for the system to make the correct solution and the difference between accuracy rate of WXG and system D' is less for f2 than for fl. Further investigation clarified that the error of System D is mainly caused by missing words or expressions in the machine dictionmy. Specifically, it was clmified that the dictionary does not have the sufficient number of Kata-Kzna words and people's names. In Mdition, the number of fluctualional variants installed in the dictionary menfioned in 4.2 turned out to be inst~cient. These problems should be rmaedied in future.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="697" end_page="697" type="metho">
    <SectionTitle>
5. Concluding Remarks
</SectionTitle>
    <Paragraph position="0"> In this p,%~r, the effectiveness of the large scale collocation data for the improvement of the conversion accuracy of Kana-to-Kanji conversion process used in Japmese word processors was chrified, by relatively large scale experiments.</Paragraph>
    <Paragraph position="1"> The extensive collection of the collocations has been c,m'fied out manually these ten years by the authors in order to realize not only high precision word processor but also more general Japanese language ~ in future. A lot of resources, school texttx3oks, newspapers, novels, journals, dictionaries, etc. have been investigated by workers for the collection. The candidates for the collocation have been judged one after another by them.</Paragraph>
    <Paragraph position="2"> Among collocations described in this paper, the idiomatic expressions are quite burdensome in the developmera of NLP, since thW do not follow the principle of composilionality of the memaing Generally speaking the more extensive collocational d__~___ it deals with, the less the &amp;quot;rule syst~n&amp;quot; of the rule based NLP system is burdened. This means the great importance of the enrichment of collocalional data Whereas it is inevitable that the ~oiawiness lies in the human judgment and selection of collocations, we believe that our collocation rl~ is far more refined than the automalicany extracted one from corpora which has been recently reported \[Church, K. W.</Paragraph>
    <Paragraph position="3"> etal, 1990, Ikeham, S. etal, 1996, etc.\].</Paragraph>
    <Paragraph position="4"> We believe that the approach descrlqxxi here is important for the evolution of NLP product in general as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML