File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/80/c80-1078_metho.xml

Size: 14,138 bytes

Last Modified: 2025-10-06 14:11:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="C80-1078">
  <Title>JAPANESE SENT~ICE f~IA\[.YSIS FOR AUT~IATIC IIIDEXIHG</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
JAPANESE SENT~ICE f~IA\[.YSIS FOR AUT~IATIC IIIDEXIHG
</SectionTitle>
    <Paragraph position="0"> A new method for automatic keyword extracting and &amp;quot;role&amp;quot; setting is proposed based on the Japanese sentence structure analysis.</Paragraph>
    <Paragraph position="1"> The analysis takes into account the following features of Japanese sentences, i.e., the structure of a sentence is determined by the noun-predicate verb dependency, and the case indicating words (kaku-joshi) play an important role in deep case structure. By utilizing the meaning of a noun as it depends on each predicate verb, restricted semantic processing becomes possible. An automatic indexing system, equipped with a man-machine interactive error-correcting function, has been developed.</Paragraph>
    <Paragraph position="2"> The evaluation of the system is performed by applying it in news information retrieval.</Paragraph>
    <Paragraph position="3"> The results of this evaluation show that the system can be put to practical use.</Paragraph>
    <Paragraph position="4"> I. Introduction The main problems arising with the development of an information retrieval system for the Japanese text are the need for, saving man-power, standardizing information storage, and the realization of efficient retrieval.</Paragraph>
    <Paragraph position="5"> In the case of the English text, the stop-word removing method for automatic keyword extraction has been put to practical use.</Paragraph>
    <Paragraph position="6"> However, in the case of the Japanese text which consists of KanJi and Kana characters, a keyword extraction method utilizing statistical word frequency data has been reported by a Kyoto University group.3 This paper proposes a new method of automatic keyword extraction and &amp;quot;role&amp;quot; setting for Japanese news information retrieval. The &amp;quot;role&amp;quot; characterizes semantic identification of each keyword in a sentence and is classified into six categories, i.e., human subject, human object, time, place, action, and miscellaneous important information.</Paragraph>
    <Paragraph position="7"> The main features of Japanese sentences can be characterized as follows: (I) The structure of a sentence is determined by the noun-predicate verb dependency.</Paragraph>
    <Paragraph position="8"> (2) The case indicating words(kaku-joshi) play an important role in deep case structure.</Paragraph>
    <Paragraph position="9"> Taking these features into account, D.G.Hays's dependency grammar I and C.J. Fillmore's case grammar 2 arc utilized in the sentence structure analysis. The sentence pattern table containing a noun-predicate verb dependency relationship plays an important function in the analysis. By utilizing the meaning of a noun as it depends on each predicate verb, restricted semantic processing becomes possible. An automatic indexing system5, equipped with a man-machine interactive error-correcting function, has been developed based on the method described. Evaluation of the system has been done by applying it in news information retrieval.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="514" type="metho">
    <SectionTitle>
2. Role Settin~ Criteria
</SectionTitle>
    <Paragraph position="0"> The employed criteria for the role setting of each keyword in a news sentence are as  fol lows : (I) &amp;quot;Action&amp;quot;(A for short) is assigned to verbs which express movement and are elements of the &amp;quot;predicate&amp;quot; set.</Paragraph>
    <Paragraph position="1"> (2) &amp;quot;Time&amp;quot;(T for short) can be assigned without ambiguity.</Paragraph>
    <Paragraph position="2"> (3) &amp;quot;Human subject&amp;quot;(ES for short), &amp;quot;human  object&amp;quot;(EO for short), &amp;quot;place&amp;quot;(P for short) and &amp;quot;miscellaneous important information&amp;quot; (Y~I for short) arc assigned to noun words according to the following criteria:  (a) Words which express humans or organizations have either role &amp;quot;HS&amp;quot; or &amp;quot;riO&amp;quot;. The distinction can be made by examining the subsequent kaku- joshi.</Paragraph>
    <Paragraph position="3"> (b) Words which express things without consciousness have role &amp;quot;If!&amp;quot;. (c) A country name has role &amp;quot;HS&amp;quot; if it is presumed to have consciousness as an organization. It has role &amp;quot;P&amp;quot; if it means territory.</Paragraph>
    <Paragraph position="4"> (d) An airplane or a ship have role &amp;quot;IIS&amp;quot; when they are personified together with the driver, role &amp;quot;P&amp;quot; when they express the place, and role &amp;quot;MI&amp;quot; when they mean things.</Paragraph>
    <Paragraph position="5"> (e) Ambiguities in item (c) and (d) are removed by knowing which predicate verb the word depends on and this determine which human, organization, place or miscellaneous matter it expresses.</Paragraph>
    <Paragraph position="6">  To clarify the description, some exa~uples are given below:  (4) As mentioned above the &amp;quot;role&amp;quot; of a noun word is determined by considering the following three elements: i.e., (a) the predicate verb whieh the noun word depends on (b) the meaning of the noun word (c) the kaku-joshi which is concatenated to the noun word 3. Japanese Sentence Structure Analysis  The basic Japanese sentence pattern is expressed as &amp;quot;NFINF2--NFnPV&amp;quot; , where NFi, which is called &amp;quot;meishi-bunsetsu&amp;quot;, is composed of a noun word and case indicating words, and where PV is a predicate verb. The Japanese sentence structure is characterized by the following points, i.e.,  (I) The predicate verb is put at the end of the sentence* (2) The position of a &amp;quot;meishi-bunsetsu&amp;quot; in a sentence is not fixed.</Paragraph>
    <Paragraph position="7"> (3) A &amp;quot;meishi-bunsetsu&amp;quot; could be omitted in  a discourse which consists of several sentences.</Paragraph>
    <Paragraph position="8"> Utilizing D.G. Hays's dependency grammar, noun-predicate verb dependency relationships are formulated. In this formulation the relationships between nouns are irrelevant. Therefore, the Japanese sentence structure becomes independent of noun-word order, and a word omission is expressed in terms of the presence of a dependency relationship in the sentence. Since &amp;quot;role&amp;quot; is semantic identification of a word, by applying C.J.Fillmore's case grammar 2, it can be assigned to each keyword by clarifying the case structure of the predicate verb.(Figure I) In Japanese sentence structure analysis, the predicate verb is identified first and then dependent noun words are determined in order of nearness to the predicate verb. The sentence is parsed by using top-down analysis. The bottom-up method is not adopted because it causes much ~nbiguity in the parsing of words which do not directly depend on the predicate verb. The need for classification of noun words in terms of their meaning is mentioned in  chapter 2. Noun words are classified into seven semantic classes in order to analyze noun-predicate verb dependency relationships efficiently and to set &amp;quot;role&amp;quot;s to them, i.e., (i) Organization (ii) Person (iii) Literature (iv) Place (v) Action (vi) Name of matter, Abstract idea, etc. (vii) Time Predicate verbs are classified by taking into account the meaning of the dominated words and  their cases. (Figure 2) The sentence pattern table is constructed based on this predicate verb classification. (Figure 3) In the news retrieval system, about 5600 predicate verbs are classified into 586 classes; this classification is called case-information(A4-code). The sentence pattern table contains 1686 patterns. A Sentence pattern in the table is composed of four triplets at most. Elements of the triplet are the semantic class identification code of the noun word, kaku-joshi, and the &amp;quot;role&amp;quot; which is determined in terms of the values of the first two elements.</Paragraph>
    <Paragraph position="9"> For example &amp;quot;shihai-suru&amp;quot;(control) and &amp;quot;kogeki-suru&amp;quot;(attaek) belong to No.46 category. The predicate verb of this category has six sentence patterns and each sentence pattern has two triplets. The first sentence pattern has triplets (ga,A, 1) and (wo, I,2). The first code of the triplet is &amp;quot;kaku-joshi&amp;quot;, the second  code is the semantic classification code of the noun word, and the third code is the &amp;quot;role&amp;quot;. Semantic classification code &amp;quot;A&amp;quot; expresses organization or person.</Paragraph>
    <Paragraph position="10"> Sentence analysis and &amp;quot;role&amp;quot; setting are performed referring to this sentence pattern table.</Paragraph>
    <Paragraph position="11"> ~., Automatic Indexin~ System An automatic indexing system has been developed based on the method described. The processing procedure of the syst~1 consists of the following three steps(Figure 4):  (2) An automatic &amp;quot;role&amp;quot; setting resulting from the sentence structure analysis (3) Man-machine interactive error-correction.  The hardware configuration is given in Table I. Size and performance of the programs are given in Table 2.</Paragraph>
    <Section position="1" start_page="514" end_page="514" type="sub_section">
      <SectionTitle>
4.1 Word Recognition
</SectionTitle>
      <Paragraph position="0"> Word recognition is executed in the following two steps,(Figure 5) i.e., automatic segmentation of the Kanji and Kana character string, and the matching of each segment with entries in the content word dictionary (&amp;quot;Jiritsu-go&amp;quot; dictionary which contains nouns, verbs, etc.) and the function-word table (&amp;quot;Fuzoku-go&amp;quot; table) to obtain syntactic and semantic information concerning the word. The first step utilizies statistical features of Japanese sentences. The second step is a morphological word analysis4. The following information codes are given to the words contained in the &amp;quot;Jiritsu-go&amp;quot; dictionary:  The morphological analysis procedure gives the following information by referring to the  &amp;quot;Fuzoku-go&amp;quot; table: (6) C1-code:kaku-joshi classification code (7) C2-code:the code distinguishes active voice, passive voice and causative expression (8) C3-code:The code given to a meishi-bunsetsu  distinguishes whether the meishi-bunsetsu is a direct dependant of the predlcate-verb or a modifier of another meishi-bunsetsu.</Paragraph>
      <Paragraph position="1"> The code given to the prdicate-verb expresses the type of inflection of the verb and the kind of subsequent conjunctive function word(setsuzoku-joshi).</Paragraph>
    </Section>
    <Section position="2" start_page="514" end_page="514" type="sub_section">
      <SectionTitle>
4.2 Automatic &amp;quot;Role&amp;quot; Settin$
</SectionTitle>
      <Paragraph position="0"> Automatic &amp;quot;role&amp;quot; setting is executed by the following four steps(Figure 6): (I) Predicate verbs in a sentence are recognized by referring to the At-code at first. Then, complex sentence structure is analyzed and divided into simple sentences.</Paragraph>
      <Paragraph position="1"> (2) Sentence patterns for each simple sentence are obtained by utilizing the A4-code.</Paragraph>
      <Paragraph position="2"> Then, noun-predicate verb dependency is analyzed by comparing the B-code and the C1-code of noun words with the sentence pattern. Prior to this analysis the following procedures are executed.</Paragraph>
      <Paragraph position="3">  (3) Words in the noun phrase modify the last noun word of the phrase in the analysis.</Paragraph>
      <Paragraph position="4"> (4) The &amp;quot;role&amp;quot; is automatically given to each  keyword using the results of the above three procedures.</Paragraph>
    </Section>
    <Section position="3" start_page="514" end_page="514" type="sub_section">
      <SectionTitle>
4.3 Man-Machine Interactive Error-Correctinz
Function
</SectionTitle>
      <Paragraph position="0"> The man-machine interactive error-correction unit consists of a Kanji video terminal and a Kanji line printer.</Paragraph>
      <Paragraph position="1"> 5. Evaluation of the System The system has been evaluated by applying it to news information retrieval. The results of this application show, that, based on the assumption that the content word dictionary and the sentence pattern table cover 90% of the processed words and processed sentence patterns, 85 to 90% of the keywords and 80 to 85% of the set roles extracted are estimated to be correct. Also, the time required for indexing is only one third of that required for conventional manual inde~ng, and the retrieval precision-ratio is improved by 20 to 30% without affecting the recall-ratio~ With this method the turn- arround time for information storage is reduced to half of that of the conventional manual method. Examples of output are given in Figure 7.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="514" end_page="514" type="metho">
    <SectionTitle>
6, Conclusion
</SectionTitle>
    <Paragraph position="0"> A new method of automatic keyword extracting and &amp;quot;role&amp;quot; setting has been proposed and evaluated. An experimental automatic indexing system has been developed utilizing the above mentioned Japanese sentence structure analysis. The analysis is characterized as follows: (I) It is based on the noun-predicate verb dependency.</Paragraph>
    <Paragraph position="1"> (2) Restricted semantic processing becomes possible by utilizing the meaning of a noun as it depends on each predicate verb.</Paragraph>
    <Paragraph position="2"> An automatic indexing system has been developed based on the proposed method. By utilizing the system, the foll~ling problems which arose with the development of an information retrieval system have been solved, i.e., man-power savings, information storage standardization and the realization of efficient retrieval.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML