File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/88/c88-2157_intro.xml

Size: 8,726 bytes

Last Modified: 2025-10-06 14:04:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C88-2157">
  <Title>Colloeational Analysis in Japanese Text Input</Title>
  <Section position="2" start_page="0" end_page="770" type="intro">
    <SectionTitle>
2. Concept of Colloeatlonal Analysis in Translation
</SectionTitle>
    <Paragraph position="0"> CollocationM analysis evaluates the correctness of a translated sentence by measuring the WCP within the sentence.</Paragraph>
    <Paragraph position="1"> The WCP data is accmnulated in a 2-dimensional matrix, by information milts indicating more restricted concepts than the words can indica.te by themselves.</Paragraph>
    <Paragraph position="2"> As previously mentioned there are two kinds of ambiguities in Kana-to-Kanji translation. In Fig.i, disambiguation process of homonyms is illustrated. ' NA;R (a national anthem) and ~\[~'~- ~ (to play)' and ' NAg( (a state) aud ~.~-~- ;5 (to build)' etc. are examples of WCP. If the simple Kana sequence ' ~_ -~ h~ ~- .~./~ ~ 5 ~&amp;quot; ;5 \[kokkaoensousuru\]' is input, the usual translation system will develop two possible candidate words ' NJN ' (a national anthem) and ' NAg( (a state)', for the partial Kana sequence of ' ~ ~J h~ \[kokk@ The system will also develop uniquely the creed(date word, ' ~-C/ ;5 (to play) ' for' R./~ &lt;- -) -~- ;5 \[ensousumq'. These candidate words are obtained by table searching and morphologicM analysis. Itowever, morphological analysis alone can't identify which one is correct for ' ~. o h~ \[kokka\].</Paragraph>
    <Paragraph position="3"> Using eo!loeationM analysis, ~he WCP of ' NA ~.7,~(a state)' and ' ~- ;5 (to play)' is found to be nil, while that of' NA~ (a national anthem)' and ' ~ ;5 (to play)' is found to be probable. Using WCP, ' NA ~ik ~ ~ ~ ~&amp;quot; -5 (to play a national anthem)' is selected as the final candidate sentence. If the Kana sequence' c o h~ ~ l:Y/b -t~ ~ ~&amp;quot; .5 \[kokkaokensetsusnru\]' is input, ' NA~-$k~: ;5 (to build a state)' is obtained in same manner.</Paragraph>
    <Paragraph position="5"> ~(Canstitutional)\] ( to build ) q~t &amp;quot;~ &amp;quot;1o ' (houch i) 'l;t/vVO ~ ~' (kensetsuauru) : ~aKana sentence : Candid te~~ O ~ ~__&amp; (to play a national anthem) x NL~ ~&amp;quot; ~ ~ ;5 (to play a state) Fig. 1 Concept of colloeational analysis</Paragraph>
    <Paragraph position="7"> 2)he new compilation :method extracts fl'om a sent nee two ,ma'cl combinations whMh l'lave a dependency relationship.</Paragraph>
    <Paragraph position="8"> This is ilhmtmted with the sa,nple sentmme ~ i:L Ci &amp;quot;~ ~: ~1~ 3.~ ,e$ ~' ~l'I :, i: (i shot ~ bird fl'ying in the sky,)'.</Paragraph>
    <Paragraph position="9"> ;it i.e~,.,\] A t fir~:;t rids n,egho,l, analyz.&amp;quot;e~ a sentem-e morpl;ological,ly. \]n t\[6:~ c~ample, the sentence, is s(,.g;menl;ed into live \[hmsetsu ( Ja\[)alit;p,e grammatical, units, like t)hrase,q) and i.hc )arts oi:' simech of each wo~d are ,,brained. ' ~;,\], (1)', ' ,i.5 (a bird)' and ' ,u (sky)' arc noires. ' tl~.;v (to 3y) and gld ~, #. (to shoot) ',re ,,erl',,~. 'a (ha)' in the first m~,,.:~;s,,, ' ~.' 0')' ii,. the se:cond one and in the. fl)tn'th one ace poat0o:dtional words. They determine tl'm dependent attributes of ha,ms in dependency re\]atioml'}it).</Paragraph>
    <Paragraph position="10"> ex.</Paragraph>
    <Paragraph position="12"> Tim d,;pendeney relationship between words is a.aal,yzed using Japanese syntactic rules. In the extractkm step, DLA is used. This process first, finds out unique dependency rela.</Paragraph>
    <Paragraph position="13"> tkmr:l,fips. &amp;quot;Unique relationslhip&amp;quot; inca.us that a dependc'nt has o~.,\]y one \],ossibl,e ,,~oven or within the sentence, hi this exam.</Paragraph>
    <Paragraph position="14"> pie, the :colal, ionships between ' ,t',!&gt;'% (a bird) and !1~1 &amp;quot;, i:: (to hoe~) s.:nd ' Jl~-~; (to fly) and ,t'i$ ~i (a hh'd)' a.n~ idc~.d;iiled as lraiqn~: Next, 'ambiguous r'.\[ationshipu '~ are processed. .i'!}.is re.</Paragraph>
    <Paragraph position="15"> lationsl'fi\[- means th.t~ a (tei)e.ttC/\[cnt has sevet'fl\] po~sil)le governors. In this cwm, the governor which can be identiiied as ,n ~&gt;;~ l'ikcly by heuristic rules is local;cal. Thi.~; rul'e wil! only ~,:ccpt rd.~tio,~ships wlherc dependent and p;oeernor are adjw.</Paragraph>
    <Paragraph position="16"> eeLq:, because this rel'ationship l,nt.~ the highest possibility.</Paragraph>
    <Paragraph position="17"> In thi~ example, ' &amp;quot;}4 :~:(sky)' has two pos,.dbl'e, candidate governor:s, ' ~i}~.;c (to dy)' an ~ ' ~}~J .~ #. (to shoot), in tiffs ea.se, because, '?,~ 4- (sky) a,nd Ji~ A; (to ily)' are adjacent, it is identified that '3~ @ (siky)' is dependent and ' )l~g: (to fly)' is govcr.mu'.</Paragraph>
    <Paragraph position="18"> Next., ',/, ;t(I)' l,m.s also two possil)le candidate governors, ' \]t{.;: (to fly)' :rod' fl,~ -., ~'u (to shoot)', in this case, because, these two governors are not adjacent to the depemlent, the dependency relationship between '$\].,i,~(I)' and two candidate governors rl,on't be identified for extraction.</Paragraph>
    <Paragraph position="19"> I deg' t urthe~ more, some speefl:tc pa.rt-of-speeeh sequenees which have many sanbiguotm dependency rel'ationships are rejected fi)r extraction. Following is an exarnple of eonihsing part-of-speech sequence. In spite of similar syntactic style, ' ~li t,~ (red)' in ' ,~ t,~ *li a) ~g (a red car's window)' modifys adjacent word' _qi (a car)', while, '~, ~,,(red)'in' kl: ~' N a0 ~g (a red tl,ower in fal'l.)' modifys a word at end of .qent;enee ' :\]~ (a flower)'. '\]'has, in case.' of thiq sequence, if a dependent and a governor t~re adjacent, the relationsl,fip between the modifyins adjeet:ive and the modified noun is not identified.</Paragraph>
    <Paragraph position="20"> tPS'g:.</Paragraph>
    <Paragraph position="21"> modifying adj. etc. -t noun 't-- ' (0 '(postf,osition ) + noun ;3, ~, qt o) ;,g ,PS v, ~2 0.~ gF, r ' window) (red flower in fall) (a red :at s  '\]?o provide it large volume of syntacticall,y correct sen&amp;quot;~enees, ezample sentences written in dictionaries \[Ohno82\] \[MasndaS3\] were employed. This is because, tl,mse example senLe*~ces are a rich source of data indicating typical, usage of each common word wit;h short sentences and they are as.</Paragraph>
    <Paragraph position="22"> sumed go represent eornmon usages witl'fin gn extremely large ~4~niount of Sollrce data.</Paragraph>
    <Paragraph position="23"> Five hundred example sentences were used t,o examine the accuracy of this automatic exh'aetion method. 82% o\[&amp;quot; st, tenets eouhl be analyzed morphol'ogiea.lly. As result, 7\]~; sets of dependency rel'ationship were extracted from tlhese morphologically-.analyzed sentences with m! accuracy of 98.7% '.('he causes of erroneous extraction are ma.inly mi:;identifica thin of part-of-speech and of compound words. 'FL, e misidm&gt; tifieation of dependency relationship was much l'ess fcequem..</Paragraph>
    <Paragraph position="24"> Using thi~ mcghod~ 305,000 sets of WCP were collected from 300,000 example sentences, in these WCP, about 45% of them are relationships I)etweeD noan and verb or adjective with pmtpositional word, 21% are. relationships between noun and nomt with ' 00 (postpositional word)', and 26% are the nouns palm constructing compound words.</Paragraph>
    <Section position="1" start_page="770" end_page="770" type="sub_section">
      <SectionTitle>
3.3 Co-occurrence Pattern Matrix
</SectionTitle>
      <Paragraph position="0"> With the vim of constructing a rel,iM)le WCP dict\]ol!.a.ry~ the use of individual words, is impnu:t\]cal, l)ocall~,c the dic tionary becomes too large. Semantic categorie~ an. useful be cause, if word A and B are synonyms, they will have ~;imihn' eo-occurrence pal;terns to other words. 'J'lds allows I;he WCP dictionary, dC/&gt;;cribed in scmanl:ie categories, i;~) be greatly rc dueed in size. Scores of semantic categories were d.evelop~xi, however, it was flmml' t~ihat the munl)~:r ef these categori~; was ~oo smMl' to aeeuraWly describe word rel'atiol~,hips, l'br tunately, i;hcre is a Japimes,; thesaurus IOhno82\] with \] ,000 semantic cal, egories. Based on the 305,000 wets ofWCP (h)ota a. 2-dimensional matrix was devch)l,)ed which indicates co occu.rren.ee patterns in semantic categories \[ohno,~I2\].</Paragraph>
      <Paragraph position="1"> \]?ig.2 shows an image of this matrix. In this matrix, word pairs which have same semantic categories lm.ve high co oc currence possibil,ity. The words incl'uded in the categc, ie~; indicating 'action' and 'mow~ment' etc. are the .p;ow.'rnor in a.</Paragraph>
      <Paragraph position="2"> co. occurrence re.l'ationship with various words as their depe~t</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML