File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1089_metho.xml

Size: 16,790 bytes

Last Modified: 2025-10-06 14:14:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1089">
  <Title>Learning Bilingual Collocations by Word-Level Sorting</Title>
  <Section position="4" start_page="525" end_page="525" type="metho">
    <SectionTitle>
2 Two Types of Japanese-English
Collocations
</SectionTitle>
    <Paragraph position="0"> In this section, we briefly classify the types of Japanese-English collocations by using the material in Table 1 as an example. These texts were derived from a stock market bulletin written in Japanese and its abstract written in English, which were distributed electrically via a computer network.</Paragraph>
    <Paragraph position="1"> In Table 1, (~g-~,~'l-~/Tokyo Forex), (H~I~!IYJ ~\[~n~\]{~ /auto talks between Japan and the U.S.) and (~k,.'~/ahead of) are Japanese-English collocations whose elements constitute uninterrupted word sequences. We call hereafter this type of collocation fixed eolloeatlon. Although fixed collocation seems trivial, more than half of all useful collocations belong to this class. Thus, it is important to extract fixed collocations with high precision. In contrast, ( b')t-t~'~ ~ ~1~ C/ki~?,_ ~ / The U.S. currency was quoted at -~ ) and ( b&amp;quot; )t.~'~ ~ ~l~ ~k_2~ /The dollar stood ..~)1 are constructed from interrupted word sequences.</Paragraph>
    <Paragraph position="2"> We will call this type of collocation flexible collocation. From the viewpoint of machine learning, flexible collocations are much more difficult to learn because they involve the combination of elements. The points when extracting flexible collocations is how the number of combination (candidates) can be reduced.</Paragraph>
    <Paragraph position="3"> Our learning method is twofold according to the collocation types. First, useful uninterrupted 1 ~. represents any sequence of words.</Paragraph>
    <Paragraph position="4"> word chunks are extracted by the word-level sorting method. To find out fixed collocations, we evaluate stochastic similarity of the chunks. Next, we iteratively combin the chunks to extract flexible collocations.</Paragraph>
  </Section>
  <Section position="5" start_page="525" end_page="527" type="metho">
    <SectionTitle>
3 Extracting Useful Chunks by
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="525" end_page="526" type="sub_section">
      <SectionTitle>
Word-Level Sorting
3.1 Previous Research
</SectionTitle>
      <Paragraph position="0"> With the availability of large corpora and memory devices, there is once again growing interest in extracting n-grams with large values of n. (Nagao and Mori, 1994) introduced an efficient method for calculating an arbitrary number of n-grams from large corpora. When the length of a text is I bytes, it occupies l consecutive bytes in memory as depicted in Figure 1. First, another table of size l is prepared, each field of which represents a pointer to a substring. A substring pointed to by the (i - 1)th entry of the table constitutes a string existing from the ith character to the end of the text string. Next, to extract common substrings, the pointer table is sorted in alphabetic order. Two adjacent words in the pointer table are compared and the lengths of coincident prefix parts are counted(Gonnet et al., 1992).</Paragraph>
      <Paragraph position="1"> For example, when 'auto talks between Japan and the U.S.' and 'auto talks between Japan and China' are two adjacent words, the nmnber of coincidences is 29 as in 'auto talks between Japan and '. The n-gram frequency table is constructed by counting the number of pointers which represent the same prefix parts. Although the method is efficient for large corpora, it involves large volume of fractional and unnecessary expressions. The reason for this is that the method does not consider the inter-relationships between the extracted strings. That is, the method generates redundant substrings which are subsumed by longer strings.</Paragraph>
      <Paragraph position="2"> text ntr|hg (I oharaoter~: I bytes)  To settle this problem, (Ikehara et al., 1996) proposed a method to extract only useful strings.</Paragraph>
      <Paragraph position="3"> Basically, his methods is based on the longest-match principle. When the method extracts a longest n-gram as a chunk, strings subsumed by the chunk are derived only if the shorter string of_ tell appears independently to the longest chunk.</Paragraph>
      <Paragraph position="4"> If 'auto talks between Japan and the U.5'.' is extracted as a chunk, 'Japan and the U.S.'is also  Tokyo Forex 5 PM: Dollar at 84.21-84.24 yen The dollar stood 0.26 yen lower at 84.21-84.24 at 5 p.m. Forex market trading was extremely quiet ahead of fnrther auto talks between Japan and the U.S., slated for early dawn Tuesday.</Paragraph>
      <Paragraph position="5"> The U.S. currency was quoted at 1.361-1.3863 German marks at 5:15 p.m.  extracted because 'Japan and the U.S.' is used so often independently as in 'Japan and the U.S.</Paragraph>
      <Paragraph position="6"> agreed ...'. However, 'Japan and the' is not extracted because it always appears in the context of 'Japan and the U.S.'. The method strongly suppresses fractional and unnecessary expressions.</Paragraph>
      <Paragraph position="7"> More than 75 % of the strings extracted by Nagao's method are removed with the new method.</Paragraph>
    </Section>
    <Section position="2" start_page="526" end_page="527" type="sub_section">
      <SectionTitle>
3.2 Word-Level Sorting Method
</SectionTitle>
      <Paragraph position="0"> The research described in the previous section deals with character-based n-grams, which generate excessive numbers of expressions and requires large memory for the pointer table. Thus, from a practical point of view, word-based n-grams are preferable in order to further suppress fractional expressions and pointer table use. In this paper, we extend Ikehara's method to handle word-based n-grams. First, both Japanese and English texts are part-of-speech (POS) tagged 2 and stored in memory as in Figure 2. POS tagging is required for two main reasons: (1) There are no explicit word delimiters in Japanese and (2) By using POS information, useless expressions can be removed.</Paragraph>
      <Paragraph position="1"> In Figure 2, '@' and '\0' represent the explicit word delimiter and the explicit sentence delimiter, respectively. Compared to previous research, this data structure has the following advantages.</Paragraph>
      <Paragraph position="2"> 2We use in this phase the JUMAN morphological analyzing system (Kurohashi et al., 11994) for tagging Japanese texts and Brill's transformation-based tagget (Brill, 1994) for tagging English texts. We would like to thank all people concerned for providing us with the tools.</Paragraph>
      <Paragraph position="3">  1. Only heads of each word are recorded in the pointer table. As depicted in Figure 2, this remarkably reduces memory use because the pointer table also contains other string characteristics as Figure 3.</Paragraph>
      <Paragraph position="4"> 2. As depicted in Figure 2, only expressions within a sentence are considered by introducing the explicit sentence delimiter '\0'. 3. Only word-level coincidences are extracted by introducing the explicit word delimiter '@'. This removes strings arising from a  partial match of different words. For example, the coincident string between 'Japan and China' and 'Japan and Costa Rica' is 'Japan and'in our method, while it is 'Japan and C' in previous methods.</Paragraph>
      <Paragraph position="5"> colnol ~degont adopt dance  J~p. n~-v andc~a C/ &amp;quot;h m,,ov J,..,.,,&lt;.o...tc.~,c'o.,. ~1o.</Paragraph>
      <Paragraph position="6"> a ap. n~-q an dC/,~ t |~,*~ 1 Js Ju pa t t(~ an,tC,~ t I~ U S J it pit t~ ttnC/U~, t I~&lt;~ ~ I S Ja p a i ~C/U ~n (ICa) II~C~O_ * /  Next, the pointer table is sorted in alphabetic order as shown in Figure 3. In this table, sentno, and coincidence represent which senfence the string appeared in and how many characters are shared by the two adjacent strings, respectively. That is, eoineidenee delineates candidates for usefifl expressions. Note here that the coincidence between Japan@and@China... and Japan@and@Costa Rica... is l0 as mentioned above.</Paragraph>
      <Paragraph position="7"> Next, in order to remove useless subsumed strings, the pointer table is sorted according to sentno.. In this stage, adopt is filled with '1' or '0' , each of which represents if or not if a string is subsumed by longer word chnnks, respectively. Sorting by sentno, makes it much easier to check the subsumption of word chunks. When  both 'Japan and the U.S.' and 'Japan and the' arise from a sentence, the latter is removed because the former subsumes the latter.</Paragraph>
      <Paragraph position="8"> Finally, to determine which word-chunks to extract, the pointer table is sorted once again in alphabetic order. In this stage, we count how many times a string whose adopt is 1 appears in the corpus. By thresholding the frequency, only usetiff word chunks are extracted.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="527" end_page="527" type="metho">
    <SectionTitle>
4 Extracting Bilingual
</SectionTitle>
    <Paragraph position="0"> In this section, we will explain how Japanese-English collocations are constructed from word chnnks extracted in the previous stage. First, fixed collocations are induced in the following way.</Paragraph>
    <Paragraph position="1"> We use the contingency matrix to evaluate the similarity of word-chunk occurrences in both languages. Consider the contingency matrix, shown Table 2, for Japanese word chunk cjp,~ and English word chunk c~,g. The contingency matrix shows: (a) the number of Japanese-English corresponding sentence pairs in which both Cjp n and ce,~g were found, (b) the number of Japanese-English corresponding sentence pairs in which just c~, v was found, (c) the number of Japanese-English corresponding sentence pairs in which just ejp,~ was fonnd, (d) the mnnber of Japanese-English col responding sentence pairs in which neither chunk was found.</Paragraph>
    <Paragraph position="2">  If ejpn and Cen.q are good translations of one another, a should be large, and b and c should bc small. In contrast, if the two are not good translations of each other, a should be small, mid baud c should be large. To make this argument more precise, we introduce mutual information ~s follows. Thresholding the mutual information extracts fixed collocations. Note that mutual information is reliable in this case because the frequency of each word chunk is thresholded at the word chunk extraction stage.</Paragraph>
    <Paragraph position="4"> Next, we sumnmrize how flexible collocations are extracted. The following is a series of procedures to extract flexible collocations.</Paragraph>
    <Paragraph position="5">  1. For any pair of chunks in a Japanese sentence, compute mutual information. Con&gt; bine the two chunks of highest mutual information. Iteratively repeat this procedure and construct a tree level by level.</Paragraph>
    <Paragraph position="6"> 2. For any pair of chunks in an English sentence, repeat the operations done in the the Japanese sentence.</Paragraph>
    <Paragraph position="7"> 3. Perform node matching between trees of both langnages by using mutual information of Japanese and English word chunks.</Paragraph>
    <Paragraph position="8"> tin ,~l~ore R  The first two steps construct monolingual similarity trees of word chnnks in sentences. The third step iteratively evalnates the bilingual similarity of word chunk combinations by using the above trees. Consider the example below, in which the underlined word chunks construct a flexible collocation (~ Yif/~deg~.~t~,f~t~_~,:x ~g ~, I-iti~'~3: ~C/_k~-L/~:/~ rose ~ on the oil products spot market in Singapore). First, two similarity trees are constructed as shown in Figure 4. Graph matching is then iteratively attempted by compnting mutual inforlnation fbr groups of word chunks. In the present implementation, the system combines three word chunks at most. The technique we use is similar to the parsing-b~sed methods for extracting bilingual collocation(Matsumoto et al., 1993). Our method replaces the parse trees with the similarity trees and thus avoids the combinatorial explosion inherent to the parsing-ba~sed methods.</Paragraph>
    <Paragraph position="9"> lia:ample: , ,, Naphtha and gas oil rose on the oil products spot market in Singapore</Paragraph>
  </Section>
  <Section position="7" start_page="527" end_page="529" type="metho">
    <SectionTitle>
5 Preliminary Evaluation and
</SectionTitle>
    <Paragraph position="0"> We performed a preliminary ewduation of tile proposed method by using 10-days Japanese stock market bulletins and their Fnglish abstracts, each containing 2000 sentences. The text was first au-tomatically aligned and then hand-checked by a hum~m supervisor. A sample passage is displayed in TM~Ie 1.</Paragraph>
    <Paragraph position="1"> In this experiment, we considered only the word chunks thai; appeared more than 4 times for fixed collocations and more than 6 times for flexible collocations. Table 4 illustrates the fixed collocations acquired by our method. Almost all collocat.ions in Table 4 involw~ domain specilic jargon, which  Ta, ble 4: Siunples of Fixcd Collocation,&lt;~ cannot, be const.rueted composit, ionally. For examphi, No 9 nieans 'Tokyo (~ohl FuLure, m~rkel; ended trading R)r the (lay', but was never written as such. As well as No. 9 , a nuuflml: ofseut;ence-level collocations were also extracl, ed. No. 9, No. 18, No. 23, No. 2&lt; No. 35, No. 56 and No. 67 a.re t,ypica,l heads of Llle stock markel; report. These exi)rcssioiis a.pllear eweryda.y in st.ock markel, reports. null IlL is inl, eresl, iil E I4) not, ic(~ lhe variel,y o\[ fixed colh)ca.tions. They dill'~'r in their consl.rucl.ions; noun phrases, verll phrases, I)rel)osit.iolml phrase&lt;; and sentrnce--level. All, hough coltventionaJ nleLllotis focus on houri llhrases or |,ry t;o en(:onll/ass all kinds of (-olloca.tions at the sanie time, we beliew&amp;quot; l, ha, t, fixed colloca, tion is au ilnporl,anl, class o\[' colh)cation. It is useful to iltl,ensively sl,udy fixed collocations because 1,he (:ollocatioll of lilore com-plex structures is (lillic.lt to h'i,', regardle'~,~ of the mf~l,hod used.</Paragraph>
    <Paragraph position="2"> 'I'MAe 3 exemplifies the flexible colloca.tions we acquired fronl the saint cOrllUS. No. 1 to No. 4 are typical exprossions in stock nlarkc'l, reports. These collocation are eXl;l'enlc.ly useful for l,ellll)lal, e--based nlachine /.ra.nsla.tiol~ sysl.enls. No. 5 is a.n examph~ o1' a useless ('ol\[ocalriOIt. BOt\]l Olnron a, nd ~unii|,omo Forcst;ry arc cotupap, y names 1,lid, l; co-ocem- I'requenl, ly i. sl,ock uia,l'kel, i'el)ort;s , bul, t.he.qc two conlpanics ha,ve uo direct relal;iou. In fact, nlore I.han half of a.II lh!xibh~ collocations acquired were like No. 5. To remove useh&gt;ss coJJ()(';tlions, co,stra.inl.s &lt;)n l;ll&lt;&amp;quot; &lt;'haracl.er tyl&gt;eS would I)e useful. Most useful ,lapa/ICSe /lcxiblt' (:ollocai.iOllS coul;;lin al, least one ilira.gamt 3 ch~u-acter. Thus, 3 ,I a i)~nese has (,}n'c(~ t,y pe,~ of ch ara~ctcrs ( II ira.ga.na, I(atak;~na., and t&lt;anjO, each of which has dilt't!rcnt a.n.)uttts of i.lbrntalio.. In ( OllLl,t,qt, Enl-lish ha.s ouly  many useless collocations can be removed by imposing this constraint on extracted strings.</Paragraph>
    <Paragraph position="3"> It is also interesting to compare our results with a Japanese-English dictionary for economics (Iwatsu, 1990). About half of Table 4 and all of Table 3 are not listed in the dictionary. In particular, no verb-phrase or sentence-level collocations are not covered. These collocations are more useful for translators than noun phrase collocations, but greatly differ from domain to domain. Thus, it is difficult in general to hand-compile a dictionary that contains these kinds of collocations. Because our method automatically extracts these collocations, it will be of significant use in compiling domain specific dictionaries.</Paragraph>
    <Paragraph position="4"> Finally, we briefly describe the coverage of the proposed method. For the corpus examined, 70 % of the fixed collocations and 35 % of the flexible collocations output by the method were correct.</Paragraph>
    <Paragraph position="5"> This level of performance was achieved in the face of two problems.</Paragraph>
    <Paragraph position="6"> * The English text was not a literal translation. Parts of Japanese sentence were often omitted and sometimes appeared in a different English sentence.</Paragraph>
    <Paragraph position="7"> * The data set was too small.</Paragraph>
    <Paragraph position="8"> We are now constructing a larger volume of corpus to address the second problem.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML