File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2139_metho.xml

Size: 17,478 bytes

Last Modified: 2025-10-06 14:13:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2139">
  <Title>Analysis of Japanese Compound Nouns using Collocational Information</Title>
  <Section position="4" start_page="0" end_page="865" type="metho">
    <SectionTitle>
1 Here &amp;quot;/&amp;quot; denotes ~L bound~try of words,
</SectionTitle>
    <Paragraph position="0"> Segmentation of Jal)anese is difficult only when using syntactic knowledge. Therefore, we could not always expect a sequence of correctly segmented words as an input to structure analysis.</Paragraph>
    <Paragraph position="1"> The information of structures is also expected to improve segmentation accuracy.</Paragraph>
    <Paragraph position="2"> There are several researches that are attacking this problem, l)'uzisaki et al. applied the ItMM model to scg,nentatimt and probabilistic CFG to analyzing the structure of compound nouns \[3\].</Paragraph>
    <Paragraph position="3"> The accuracy of their method is 73% in identifying correct structures of kanzi character sequences with average length is 4.2 characters. In their approach, word boundaries are identified through tmrely statistical information (the IIMM model) without regarding such linguistic knowledge, as dictionaries. Therefore, the HMM nrodel may suggest an improper character sequence as a word.</Paragraph>
    <Paragraph position="4"> Purthermore, since nonterminal symbols of CFG are derived from a statistical analysis of word collocations, their number tends to be large and so the muuber of CFG rules are also large. They assumed COml)ound nouns consist of only one character words and two character words. It is questionable whether this method can be extended to handle cases that include nmre than two character words without lowering accuracy.</Paragraph>
    <Paragraph position="5"> ht this palter , we protmsc a method to analyze structures of Japanese compound nouns 1)y using word collocational information and a thesaurus.</Paragraph>
    <Paragraph position="6"> The callocational information is acquired from a corpus of four kanzi character words. The outline of procedures to acquire the collocational information is as follows: * extract collocations of nouns from a corpus of four kanzi character words * replace each noun in the collocations with thesaurus categories, to obtain the collocatkms of thesaurus categories * count occurrence frequencies for each collocational pattern of thesaurus catcgorics For each possible structure of a compound noun, the preference is calculated based on this colloo cational information and the structure with the highest score wins.</Paragraph>
    <Paragraph position="7">  Hindle and Rooth also used collocational information to solve ambiguities of pp-attachment in English \[5\]. Ambiguities arc resolved by comparing the strength of associativity between a preposition and a verb and the preposition and a nominal head. The strength of iLssociativity is calculated on the basis of occurrence frequencies of word collocations in a corpus. Besides the word collocations information, we also use semantic knowledge, nanlely~ a thesaurus.</Paragraph>
    <Paragraph position="8"> TILe structure of this paper is as follows: Section 2 cxplains the knowlcdge for structure analysis of compound nouns and the procedures to acquire it from a corpus, Section 3 describes the analysis algorithm, and Section 4 describes the experiments that arc conducted to evaluate the performance of our method, and Section 5 summarizes the paper and discusscs future rescarch directions.</Paragraph>
  </Section>
  <Section position="5" start_page="865" end_page="866" type="metho">
    <SectionTitle>
2 Collocational
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="865" end_page="865" type="sub_section">
      <SectionTitle>
for Analyzing
Nouns
Information
Compound
</SectionTitle>
      <Paragraph position="0"> This section describes procedures to acquire collocational information for analyzing compound nouns from a corpus of four kanzi character words.</Paragraph>
      <Paragraph position="1"> What we nccd is occurrence frequencies of all word collocations. It is not rcalistic, howcvcr, to collect all word collocations. We use collocations from thesaurus categories that are word abstractions.</Paragraph>
      <Paragraph position="2"> The procedures consist of the following four steps:  1. collect four kanzi character words (section 2.1) 2. divide the above words in the middle to product pairs of two kanzi charactcr words; if one is not in the thesaurus, this four kanzi charaeter word is discarded (section 2.1) 3. assign thesaurus catcgorics to both two kanzi character word (section 2.2) 4. count occurrence frequencies of category collocations (section 2.3)</Paragraph>
    </Section>
    <Section position="2" start_page="865" end_page="865" type="sub_section">
      <SectionTitle>
2.1 Collecting Word Collocations
</SectionTitle>
      <Paragraph position="0"> We use a corpus of four kanzi character words as the knowledge source of collocational information.</Paragraph>
      <Paragraph position="1"> The reasons are its follows: * In Japanese, kanzi character sequences longer than three are usually compound nouns, This tendency is confirmed by comparing tile occurrence frequencies of kanzi character words in texts and those headwords in dictionaries.</Paragraph>
      <Paragraph position="2"> We investigated the tendency by using sample texts from newspaper articles and encyclopedias, and Bunrui Goi IIyou (BGH for short), which is a standard Japanese thesaurus. The sanlple texts include about 220,000 sentences.</Paragraph>
      <Paragraph position="3"> We found that three character words and longer represent 4% in the thesaurus, but 71% in the sample texts. Therefore a collection of four kanzi character words would bc used as a corpus of comi)ound nouns.</Paragraph>
      <Paragraph position="4"> Four kanzi character sequences are useful to extract binary relations of nouns, because dividing a h)ur kanzi character sequence in the middle often gives correct segmentation.</Paragraph>
      <Paragraph position="5"> Our preliminary investigation shows that the accuracy of the above heuristics is 96 % (961/1000).</Paragraph>
      <Paragraph position="6"> There is a fairly large corpus of four kanzi character words created by Prof. Tanaka Yasuhito at Aiti Syukutoku college \[8\]. The corpus w~Ls manually created from newspaper articles and includes about 160,000 words.</Paragraph>
    </Section>
    <Section position="3" start_page="865" end_page="865" type="sub_section">
      <SectionTitle>
2.2 Assigning Thesaurus Categories
</SectionTitle>
      <Paragraph position="0"> After collecting word collocations, wc must assign a thesaurus category to each word. This is a difficult task because some words are mssigncd multiple categories. In such cases, we have several category collocations from a single word collocation, some of which are incorrect. TILe choices arc as follows;  (1) use word collocations with all words is assigned a single category.</Paragraph>
      <Paragraph position="1"> (2) equally distribute frequency of word collcations to all possible category collocations \[4\] (3) calculate the probability of each category collocation and distribute frequency based on these probabilities; the probability of collocations are calculated by using method (2) \[4\] (4) determine the correct category collocation by  using statistical methods other than word collocations \[2, 10, 9, 6\] Fortunately, there are few words that are itssigned multiple categories in BGH. Therefore, we use method (1). Word collocations containing words with multiple categories represent about 1/3 of the corpus. If we used other thesauruses, which assign multiple categories to more words, we would need to use method (2), (3), or (4).</Paragraph>
    </Section>
    <Section position="4" start_page="865" end_page="866" type="sub_section">
      <SectionTitle>
2.3 Counting Occurrence of Cate-
gory Collocations
</SectionTitle>
      <Paragraph position="0"> After assigning the thesaurus categories to words, wc count occurrence frequencies of category collocations as follows: 1. collect word collocations, at this time we collect only patterns of word collocations, but we do not care about occurrence frequencies of thc patterns  2. replace thesaurus categories with words to produce category collocation patterns 3. count the number of category collocation patterns null Note: wc do not care about frequencies of word collocations prior to replacing words with thesaurus catcgorics.</Paragraph>
      <Paragraph position="1"> 3 Algorithm The analysis consists of three steps: 1. enumerate possible segmentations of an input colnpound nmm by consulting headwords of the thesaurus (BGH) 2. assign thesaurus categories to all words 3. calculate the preferences of every structure of  thc compound noun according to tlm frcquen-tics of category collocations We assume that a structure of a compmmd noun cau be expressed by a binary tree. We also asstone that the category of the right branch of a (sub)tree represents tile category of tile (sub)tree itself. Tiffs assumption exsists because Japanese is a head-final language, a modifier is on the h'.ft of its modifiee. With these assuml)tions, a preference vahte of a structure is calculated by recurslve function p as follows:</Paragraph>
      <Paragraph position="3"> where flmction l and r return the left and right subtree of the tree respectively, cat returns thesaurus categories of the argmnent. If the argument of cat is a tree, cat returns the category of tl,e rightmost leaf of tile tree. Function cv returns an assoeiativity measure of two categories, which is calculated from tile frequency of category collocation described in the previous section. We wmtld use two measures for cv: P(catl, cat=) returns the relative frequency of collation cat1, which appears on the left side al, d cat2, wlfich appears on the right.</Paragraph>
      <Paragraph position="5"> Modified ,nutual information statistics (MIS):</Paragraph>
      <Paragraph position="7"> MIS is similar to nmtual infromation used by Church to calculate semantic dependencies between words \[1\]. MIS is different from mutual information because MIS takes account of the position of the word (left/right).</Paragraph>
      <Paragraph position="8"> Let us consider an example '9~)iJ~III\]~:~;~&amp;quot;.</Paragraph>
      <Paragraph position="9"> Segmentation: two possibilities, (1) ,,~zF~ (new)/Iz~t~ (indirect)/~ (tax)&amp;quot; and (2) &amp;quot;)~)i&amp;quot; (new)p_f~ (type)/Hi\]~ (indirect)/;\[~ (tax)&amp;quot; remain as mentioned in section 1.</Paragraph>
      <Paragraph position="10"> Category assignment: assigning tlmsaurus categories provides</Paragraph>
      <Paragraph position="12"> A three-digit number stands for a thesaurus category. A colon &amp;quot;:&amp;quot; separates multiple categories assigned to a word.</Paragraph>
      <Paragraph position="13"> Preference calculation: For the case (1) I, the possine structures are \[\[118, 3111, 1371 and \[118, \[311, 1371\].</Paragraph>
      <Paragraph position="14"> We represent a tree with a llst notation. For tile ease !2~. there is an anfl)iguity with tile category ,,,,&amp;quot;' \[118:141:111\], We expand the ambiguity to 15 possible structures. Preferences are calculated for 17 cases. For example, the l)reference of structure \[\[118, 311\], 137\] is calculated as follows:</Paragraph>
      <Paragraph position="16"/>
    </Section>
  </Section>
  <Section position="6" start_page="866" end_page="867" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="866" end_page="867" type="sub_section">
      <SectionTitle>
4.1 Data and Analysis
</SectionTitle>
      <Paragraph position="0"> We extract kanzi cl, aracter sequences from newspaper editorials and colunms and encyclopedia text, which has no overlap with the training corpus: 954 compound nouns consisting of four kanzi characters, 710 compound nouns consisting of five kanzi characters, and 786 coml)ound nouns consistiug of six kanzi characters are mammlly extracted from the set of the above kanzi character sequences. These three collections of compound nouns arc used for test data.</Paragraph>
      <Paragraph position="1"> Wc use a thesaurus BGII, which is a standard machine rea(lble Jat)anese thesaurus. BGH is structured as a tree with six hlerarclfieal levels. Table 1 shows the number of categories at all levels, in tlfis experiment, we use the categories at level 3, If we have more eoml)ound nolln8 \[LS knowledge, we could use a liner hierarchy level.</Paragraph>
      <Paragraph position="2"> '\]_'able 1 The mmd)er of categories o f 4 As mentioned in Section 2, we create a set of collocations of thesaurus categories from a corpus of four kanzi character sequences and BGt\[.  We analyze the test data accorditlg to the procedures described in Section 3. In segmentation, we use a heuristic &amp;quot;minimizing the number of content words&amp;quot; in order to prune the search space. This heuristics is commonly used in the Japanese morphological analysis. The correct structures of the test data manually created in advance.</Paragraph>
    </Section>
    <Section position="2" start_page="867" end_page="867" type="sub_section">
      <SectionTitle>
4.2 Results and Discussions
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the result of the analysis for four, five, and six kanzi character sequences. &amp;quot;oc&amp;quot; means that the correct answer was not obtained because the heuristics is segmentation filtered out from the correct segmentation. The first row shows the percentage of cases where the correct answer is uniquely identified, no tie. The rows, denoted &amp;quot;~ n&amp;quot;, shows the percentage of correct answers in the n-th rank. 4 ,,, shows the percentage of correct answers ranked lower or equal to 4th place.</Paragraph>
      <Paragraph position="1">  c~ 1 1: 6 6 5 2 Regardless, more than 90% of the correct answers are within the second rank. The probabilistic measure cvl provides better accuracy than the mutual information measure cv2 for five kanzi character compound nouns~ but the result is reversed for six kanzi character compound nouns. The results for four kanzi character words arc almost equal. In order to judge which measure is better, we need further experiments with longer words.</Paragraph>
      <Paragraph position="2"> We could not obtain correct segmentation for 11 out of 954 cases for four kanzi character words, 39 out of 710 cases for five kanzi character words, and 15 out of 787 ca~es for six kanzi character words. Therefore, the accuracy of segmentation candidates are 99%(943/954), 94.5% (671/710) and 98.1% (772/787) respectively. Segmentation failure is due to words missing from the dictionary and the heuristics we adopted. As mentioned in Section 1, it is difficult to correct segnlentation by using only syntactic knowledge. We used the heuristics to reduce ambiguities in segmentation, but ambiguities may remain. In these experiments, there are 75 cases where ambiguities can not be solved by the heuristics. There are 11 such c~ses for four kanzi character words, 35 such cases for five kanzi character words, and 29 cases for six kanzi character words. For such cases, the correct segmentation can be uniquely identifed by applying the structure analysis for 7, 19, and 17 eases, and the correct structure can be uniquely identified for 7, 10, and 8 cases for all collections of test data by using CVl. Oa the other hand, 4, 18, and 21 cases correctly segmented and 7, 11, and 8 cases correctly analyzed their structures for all collections by using cv2.</Paragraph>
      <Paragraph position="3"> For a sequence of segmented words, there are several possible structures. Table 3 shows possible structures for four words sequence and their occurrence in all data collections. Since a compound noun of our test data consists of four, five, and six characters, there could be cases with a eolw pound noun consisting of four, five, or six words.</Paragraph>
      <Paragraph position="4"> hi the current data collections, however, there are no such cases.</Paragraph>
      <Paragraph position="5"> In table 3, we find significant deviation over occurrences of structures. This deviation has strong correlation with the distance between modifiers and modifees. The rightmost column (labeled Y\] d) shows sums of distances between modifiers and modifiee contained in the structure. The distance is measured based on the number of words between a modifier and a modifiee. For instance, the distance is one, if a modifier and a modifice arc inlmcdiately adjacent.</Paragraph>
      <Paragraph position="6"> The correlation between the dist;ancc and the occurrence of structures tells us that a modifier tends to modify a closer modifiee. This tendency has been experimentally proven by Maruyama \[7\].</Paragraph>
      <Paragraph position="7"> The tendency is expressed by the formula that follows: null q(d) =0.54. d -I&amp;quot;sDG where d is the distance between two words and q(d) is the probM)ility when two words of said distance is d and a modification relation.</Paragraph>
      <Paragraph position="8"> We redifined cv by taking this tendency as the fonmda that follows: cv' = cv. q( d) where cv' is redifincd cv. 'fable 2 shows the result by using new cvs. Wc obtaiued significant improvement in 5 kauzi aud 6 kanzi collection.</Paragraph>
      <Paragraph position="9"> Table 3 Table of possible structures \[structure ' &amp;quot;&amp;quot; 5 kanzi</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML