File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/80/c80-1001_abstr.xml

Size: 27,504 bytes

Last Modified: 2025-10-06 13:45:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="C80-1001">
  <Title>MORPHOLOGICAL ASPECT OF JAPANESE LANGUAGE PROCESSING</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
MORPHOLOGICAL ASPECT OF JAPANESE LANGUAGE PROCESSING
</SectionTitle>
    <Paragraph position="0"> A comprehensive grammatical model produced for analyzing the agglutinated structure which characterizes the Japanese language is presented. This model, which includes extensively idiomatic postpositional expressions as terminals, is quite effective for the development of the Japanese language processor receptive to a reasonable variety of sentential forms and applicable to relatively wide fields.</Paragraph>
    <Paragraph position="1"> Introduction The following fundamental problems are still latent in most present systems of the natural language processing: (i)how to enable the system to have a higher quality processing that renders the output more feasible; (ii)how to broaden the applicable field of the system; and (iii)how to allow the system to accept more &amp;quot;natural&amp;quot; input sentences, including miscellaneous linguistic constructions. In order to remedy these problems, we will need not only far advanced A.I. researches on the knowledge representation or deduction, but also more elaborate studies on the surface structures of natural sentences from the engineering viewpoints.</Paragraph>
    <Paragraph position="2"> Among other things, the requirement for the linguistic approach on the engineering side is quite urgent for Japanese language processing, since we have no Japanese grammar which is extensive and definite enough for solving, especially, problem (iii).</Paragraph>
    <Paragraph position="3"> The authors have been developing a Japanese language parser for a Japanese-English translation system on the following standpoints.</Paragraph>
    <Paragraph position="4"> (1)~de coverage of the input forms; we aim at a system which is powerful enough to accept with less exceptions the sentential forms which appear in the actual, colloquial and written texts (e.g. non-pre-edited sentences in technical papers).</Paragraph>
    <Paragraph position="5"> (2)Two-phase parsing~ The system first analyzes the local expression which is the syntactical and semantical unit constituting immediately the input sentence, and then analyzes the whole sentence by detecting the relationships between the units. The first phase, which corresponds to the morphological phase in the ordinary parser of the European language, is designed for analyzing not only the word's inflection but the &amp;quot;agglutinated&amp;quot; structure characterizing the Japanese language.</Paragraph>
    <Paragraph position="6"> We attach much importance to the first phase which has a great influence on the overall performance of the system.</Paragraph>
    <Paragraph position="7"> (3)Elaborate preparation for the first phase; In the first phase, we adopt an elaborate grammatical model that prescribes the internal structure of the above-mentioned units in detail. The extensive enumeration of postpositional expressions carried out in the model, among others, is quite effective for solving the problem (iii), since they determine the syntactical and semantical &amp;quot;framework&amp;quot; of the Japanese sentence. The inflection of the word can also be manipulated almost without exceptions in a relatively simple way in this model.</Paragraph>
    <Paragraph position="8"> (4)Matching of the first phase and higher phases; Most of the atomic postpositional expressions enumerated in the model are idiomatic ones which should be treated without decomposing into words because of their definite and unitary meanings. This fact yields a good matching of the first phase and the higher semantical phases.</Paragraph>
    <Paragraph position="9"> (5)Disambiguation in the first phase; A certain part of the polysemy of the postpositional expression can be reduced by the restriction for the co-occurence on the neighboring positions in the sentence. Our grammar for the first phase is designed to carry out disambiguation of this type. This is based on the idea that the syntactical and semantical structure ought to be disambiguated as early and as much as possible from the viewpoint of the system's total efficiency.</Paragraph>
    <Paragraph position="10"> In this paper, the above mentioned grammatical model for the first phase of parsing, which may be called &amp;quot;pseudo-morphological&amp;quot; phase, is shown and the experimental system developed for the verification of its validity is outlined. After showing some operational examples and the result of the experiment, we conclude that our model is quite effective for Japanese language processing from the standpoints mentioned above.</Paragraph>
    <Paragraph position="11"> Japanese sentence, E-bunsetsu The information to be extracted from the input sentence by the parser can be generally classified into following three types: (a)the information of the concept which is ordinarily provided by the conceptual word (e.g. noun, verb, adjective); (b)the information of the relationship between concepts; (c)the supplementary information such as of &amp;quot;tense&amp;quot;,&amp;quot;aspect&amp;quot;,&amp;quot;mood&amp;quot;,etc. Japanese is an agglutinative language and is very far from European languages from structural viewpoints, i.e. the information of type(b) or (c) is ordinarily given by the annex-expression agglutinated postpositionally to the conceptual expression which gives the information of type (a). We call the compound which consists of the annex- and conceptual expression E-bunsetsu%.</Paragraph>
    <Paragraph position="12"> The information of type(b) is given as the dependency relation, called kakariuke-relation between E-bunsetsus. A sentence consists immediately of E-bunsetsus positioned in a relatively free order except for a few constraints &amp;quot;\['j~. Because of this structural feature, we adopt the two-phase approach for the parsing of the Japanese sentence: the first phase for analyzing each E-bunsetsu; the second, for detecting the kakariuke-relational structure of the sentence.</Paragraph>
    <Paragraph position="13"> It is apparent that the extensive characterization of the E-bunsetsu yields the wide coverage of input sentential forms to the system. Specifically, the extensive enumeration of the annex-expressions will drastically broaden the range of acceptable input forms, since they make the syntactic and semantic &amp;quot;framework&amp;quot; of the sentence. However, the annex- or conceptual expression may itself be a compound of atomic expressions and is too multiformed to be enumerated extensively.</Paragraph>
    <Paragraph position="14"> From these points of view, we have constructed a grammatical model for analyzing the E-bunsetsu by, first, enumerating extensively atomic expressions excepting most of the conceptual ones that are quite numerous; secondly, classifying them by the syntactic and partially semantic functions; thirdly, prescribing the connectability rules of atomic expressions within the E-bunsetsu.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Atomic Expressions
</SectionTitle>
      <Paragraph position="0"> The notion of &amp;quot;atomic expression&amp;quot; is the extended one of &amp;quot;word&amp;quot; so as to include the idiomatic word-string which has the unitary, self-supported meaning and the definite syntactic function. Though we often encounter such idiomatic strings in the sentence of every-day use, it has not been clarified exhaustively The notion of &amp;quot;bunsetsu&amp;quot; in the conventional school grammar is well known as the unit for sentence construction. However, the unitary local structure in the real sentence used in every day life is often too multiformed to be analyzed with it. The notion of &amp;quot;E-bunsetsu&amp;quot;, which is a fully extended version of &amp;quot;bunsetsu&amp;quot;, was devised from the standpoints mentioned in the previous chapter.</Paragraph>
      <Paragraph position="1"> %%When we let a string, EB l EB 2 ''' EB n be a sentence~ 'each E-bunsetsucEBi(iSi&lt;n) must depend on only one of EBi+i,''deg,EB n without passing any EBj(i&lt;j) that governs at least one of EB l, ''', EBi- I. Moreover, EB n must be predicative.</Paragraph>
      <Paragraph position="2"> how many are needed for building up the natural sentence and how they can be used.</Paragraph>
      <Paragraph position="3"> We have singled out the atomic expressions extensively excepting most of conceptual ones from approximately 12,000 sentences of technical papers and text-books of the senior high schools.</Paragraph>
      <Paragraph position="4"> Their rough categorization is sho~1 in the following. (The number of the expressions is given in parentheses.) Annex-expressions Atomic annex-expressions are classified into two kinds: relational expressions which provide the information of type(b); and co-predicative expressions which provide the information of type(c).</Paragraph>
      <Paragraph position="5"> Relational Expressions(575). While the typical example of the relational expression is the particle in the conventional grammar, eighty percent of the relational expressions are idiomatic word-strings. For example, the wordstring,'ni tsui te' is atomic and relational because it has a proper, undividable and self-supported meaning equivalent to that of the preposition,'about' in English in such context as 'Mary ni tsui te hanasu'('talk about Mary').</Paragraph>
      <Paragraph position="6"> (The original meaning of the verb,'tsuku' is almost missing in the context.) The atomic annex-expressions can be divided roughly into ten categories according to their abilities to indicate the kakariuke-relation.</Paragraph>
      <Paragraph position="7"> We denote these categories by RNpi,RNP2,RNp3, Rppi,Rpp2 ,Rpp3,RNNi ,RNN2 ,RNN3and RpN. RNp I , RNp 2 or RNp 3, for example, is a category of expressions which indicate the dependency of the nominal E-bunsetsu, N on the predicative E-bunsetsu; P. 'hi tsui te' mentioned above is included in RNp 1 .</Paragraph>
      <Paragraph position="8"> Co-predicative Expressions(348). The auxiliary verb in the conventional school grammar is typically co-predicative but ninety percent of the co-predicative expressions singled out are idiomatic. For example, the word-string,'ta hou ga yoi',which is equivalent to 'had better' in English provides the information of the modality. These can be divided into seven categories,i.e.</Paragraph>
      <Paragraph position="9"> hnpl ,Anpz ,Anp3 ,Appl ,App2 ,App3 and App~ according to the functions of the connection and whether they can inflect or not. Appl, for example, represents a category of inflectable expressions each of which yields a predicative expression, p by connecting(agglutinating) to a predicative expression, p. The atomic expression, 'ta hou ga yoi' mentioned above is in App I * Conjunctive Expressions(122) Besides the traditional conjunction, many conjunctive, idiomatic expressions have been singled out as atomic ones. For example, the string 'sikasi nagara', wich is equivalent to 'however' in English is conjunctive and atomic.</Paragraph>
      <Paragraph position="10"> The conjuctive expression is not annexational, but offers the information of type(b). There observed two categories: one, denoted by C I, of expressions which can indicate both of the relation between two sentences and the relation  between two E-bunsetsus; the other ~, denoted by C 2, of expressions which indicate exclusively the relation between two sentences.</Paragraph>
      <Paragraph position="11"> Suffixal Expressions(403) The conceptual expressions are too numerous to be enumerated exhaustively. In addition, it is difficult in the present state to settle the extensive rules for constructing the conceptual compound.</Paragraph>
      <Paragraph position="12"> We have singled out only the suffixal constituents of the conceptual compounds that are used very frequently and have definite syntactic and semantic functions. These are classified roughly into seven categories, i.e. Snpl,Snp 2, Spp, Snnl,Snn2,Snn s and Spn, by their functions. For example, Snp I , that includes such a string as 'de aru' being used quite frequently, is a category of expressions each of which constitutes a predicative conceptual expression,p when suffixed to a nominal conceptual expression,n.</Paragraph>
      <Paragraph position="13"> The conceptual compound of quantitative, temporal or locational meaning, e.g. '3 zi 15 hun' ('a quarter past three ~) is sometimes exceptionally easy to be decomposed into constituents. A good many suffixal constituents of these compounds are included in Snn I .</Paragraph>
      <Paragraph position="14"> Adverbial Expressions(262) The adverbial expressions fall into two categories, D 2, for the expression which is always used in cooperation with some other specific expression and D I, for the rest. For example, 'kanarazusimo ''' (nai)' ('not necessarily') is in D2.</Paragraph>
      <Paragraph position="15"> Adnominal Expressions(165) The adnominal expression, such as 'subete no' ('all') is similar to the adjective except that it is uninflective and used always attributively being located ahead of the nominal E-bunsetsu to be modified. The category of these expressions is denoted by T.</Paragraph>
      <Paragraph position="16"> Structure of E-bunsetsu The structure of the E-bunsetsu can be characterized in the form of &amp;quot;transition net&amp;quot;, since it has no complex embedded structures. Our structural characterization is based on prescribing the connection rules of the atomic expressions within an E-bunsetsu. It is shown in two stages in this chapter.</Paragraph>
      <Paragraph position="17"> General Structure of E-bunsetsu The general structure of E-bunsetsu is shown in Fig.l using the above-mentioned categories and three traditional ones, Mi,M2 and Y,representing for the noun, verbal-noun(i.e, noun called initial ) node -~</Paragraph>
      <Paragraph position="19"> &amp;quot;sahen-meishi&amp;quot;) and yougen(i.e, verb, adjective, adjective verb), respectively. In Fig.l, nodes represent the categories and arrows denote that expressions in starting nodes can be immediately followed(agglutinated) by those in ending nodes.</Paragraph>
      <Paragraph position="20"> The E-bunsetsu can be analyzed, though roughly, by starting at the initial node and tracing a path in the figure. Each node is the acceptable node for the E-bunsetsu. The conceptual expression corresponds to a path terminating at a node located above the dotted line. The syntactic and semantic function of the E-bunsetsu can be estimated by recognizing the terminating node in the path.</Paragraph>
      <Paragraph position="21"> Generality of Characterization. In order to verify the generality of the characterization shown in Fig.l, we have inspected approximately 1,500 actual sentences in technical papers by segmenting each sentence into E-bunsetsus applying the above rules. Table 1 shows the results of the inspection. From this, it came out that our enumeration of annex-expressions is almost sufficient and all of the sentences inspected can be segmented into E-bunsetsus if we newly register and classify the expressions missing in the enumeration into existing categories. In addition, it turned out that the idea of the E-bunsetsu~ which elucidates a Table I. Results of Inspection number of atomic expressions missing in the enumeration: annex- 6  larger structure than a &amp;quot;bunsetsu&amp;quot;, is quite effective for reducing the load of the second phase of the parser because it causes fourteen percent decrease of the number of immediate constituents of the sentence. Moreover, the rate of appearance of the atomic relational expressions which are originally compound was found to be sixteen percent. These facts assure the generality of the characterization to a reasonable extent.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Detailed Structure of E-bunsetsu
</SectionTitle>
      <Paragraph position="0"> In the course of the development of the natural language system, it is a fundamental and crucial problem how much the grammatical rule should be elaborate or how much the syntactic and semantic structure of the sentence should be disambiguated within the grammatical phase of the processing. We think it profitable for increasing the total efficiency of the system to disambiguate them as much and as early as possible. From this point of view/ we try to do it in the phase of analyzing the E-bunsetsu by refining the characterization of Fig.l without destroying its grammatical features and generality.</Paragraph>
      <Paragraph position="1"> Inflectional Endings. The word-inflection of Japanese language is closely related to the agglutination of words. The connection represented in Fig.1 by the asterisked arrow should be restricted by the inflectional type and inflectional form of the preceding expression, which is inflectable.</Paragraph>
      <Paragraph position="2"> While subcategorizing the inflectable expressions by their inflectional types, we gave respective expressions in the ending nodes of the asterisked arrows a dictionary entry denoting what inflectional types and forms it can be connected to. The inflectional form is known by detecting the ending. Table 2 shows the paradigm. The asterisked letter in the table is a euphonical one by which the final letter of the stem may be replaced. '~' represents an empty ending.</Paragraph>
      <Paragraph position="3"> This paradigm (and the experimental system described in the next chapter) is based on a way of expressing Japanese characters by English letters which is devised from the viewpoints of mechanical processing.</Paragraph>
      <Paragraph position="4"> Subcategorization. We subcategorized some of the annex-expressions by their detailed agglutinative functions using a formal algorithm ~. It should be noted that the homonymous expression whose meanings have individual agglutinative functions was categorized duplicatively into different categories according to respective functions. These expressions' Ti.e. to partition the set,E=RNp~RNP2URppiURpp2 ORpp 3 by the following relation, ~ into equivalence classes.</Paragraph>
      <Paragraph position="5"> for Vx,y~E (xRy ~. for VWl,Wz~E ((x*wl+-~  meanings, therefore, can be disambiguated by checking the agglutinative structure of the E-bunsetsu.</Paragraph>
      <Paragraph position="6"> Suffixal expressions were also subcategorized mainly by their semantical functions in order to decompose limited types of the conceptual compounds in the experimental system.</Paragraph>
      <Paragraph position="7"> The numerical outline of these refinements of the categories is given in Table 3. The asterisk in the table implies the subcategorization based on the inflectional type.</Paragraph>
      <Paragraph position="8"> Refined Connection Rules. The connection rules were refined by using the finally obtained categories that amount to 142. The number of these rules is approximately 3,600.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Experiment
Overview
</SectionTitle>
      <Paragraph position="0"> The Japanese sentence is ordinarily written in kana(phonetic) letters and Chinese(ideographic) characters without leaving a space between words.</Paragraph>
      <Paragraph position="1"> From the viewpoint of machine-processing, however, it is preferable to express clearly the units composing the sentence in such a way as to leave a space between every word as in English.</Paragraph>
      <Paragraph position="2"> We have no standard way of spacing the units though the need for this has been demanded for a long time. Supposing tentatively that a sentence is written in English letters with a space between each E-bunsetsu, we have developed an experimental system which decomposes the input E-bunsetsu into atomic expressions using the refined rules and decides its function.</Paragraph>
      <Paragraph position="3"> The system is overviewed as follows: (i) The system consists of five components: a program; a dictionary of atomic expressions; a table of the connection rules; a paradigm; and a table of euphonical rules(not mentioned in this paper); (2) Each entry expression is given one or more triple of the information in the dictionary.</Paragraph>
      <Paragraph position="4"> A triple consists of a code of the (refined) category such as A48 or R56, a code of the inflectional condition of the connection, and a code of the meaning;  (3) As to the inflectable expression, the dictionary includes only its stem; (4) E-bunsetsu is decomposed from left to right on it by the &amp;quot;longest-match method&amp;quot; and all possible analyses are tried in the &amp;quot;depth-first&amp;quot; manner; (5) The category code such as Mi3 or Y05, of  the noun or yougen is used in the input and dictionary for the actual expression in it.</Paragraph>
      <Paragraph position="5"> Op_erational Examples Operational examples follow. The string of letters parenthesized in the output description is the inflectional ending and '/' denotes the boundary between the conceptual expression and the annex-expression detected by the system.</Paragraph>
      <Paragraph position="6"> The arrows in the following illustration show the string of categories which corresponds to a leftmost substring of the input and is assured to be successful by both of the connection rules of the category level and the inflectional conditions given in the dictionary. On the other hand, the dotted arrow shows that the connection is allowed by the rule of the category level but not by the rule of the inflectional level.</Paragraph>
      <Paragraph position="8"> Without checking the refined rules ( of two levels: the category level, and inflectional level), the following two decompositions would have been obtained.</Paragraph>
      <Paragraph position="10"> While the decomposition i-I is successful, 1-2 was rejected because the auxiliary verb,'ta' is prohibited from being connected to the preceding auxiliary verb,'ta' by the inflectional rule.</Paragraph>
      <Paragraph position="11"> The triples given in the dictionary to 'tame' are as follows: {R91; &amp;quot;connectable to adnominal forms of all types&amp;quot;; CAUSE.REASON }; {R91; &amp;quot;connectable to adnominal forms  of verb types&amp;quot;; PURPOSE }; {A4A; &amp;quot;connectable to adnominal forms of all types&amp;quot;; CAUSE.REASON }; {A4A; &amp;quot;connectable to adnominal forms of verb types&amp;quot;; PURPOSE }.</Paragraph>
      <Paragraph position="12"> In i-i, since the inflectional type of 'ta' is not verbal, the second and fourth triples are not acceptable. In addition, the third one is unavailable since the ending form of the input E-busetsu results to be a stem, and inadequate. Finally, only the first one was accepted and at the same time the meaning of 'tame' was disambiguated. null</Paragraph>
      <Paragraph position="14"> Without using the rules, the following three kinds of decompositions would have been possible.</Paragraph>
      <Paragraph position="15">  2-3, which are understood as a conjunctive verb, and a suffixal expression, respectively, can not be connected to 'nimotoduite'</Paragraph>
      <Paragraph position="17"> categories 1 = M\]4 $29 A48 function 1 = P IN THE SENTENCE-FINAL POSITION segmentation 2 = Ml4 TEKI(NA) NO/DEHANA(1) categories 2 = Ml4 $29 $47 Al8 function 2 = P IN THE SENTENCE-FINAL POSITION  The result was twofold according to two sorts of interpretations of 'no':the first one is to understand it has no special meaning; the second, it is a suffixal variant of the noun, 'mono' ('thing'). There exist latently following eight different decompositions but only 3-1 and 3-6 were accepted by the rules.</Paragraph>
      <Paragraph position="19"> AS for 3-6, it was understood that the atomic expression,'no' was not a particle(R70) which indicates a kakariuke relation between two nominal E-bunsetsus or a particle(R01) of the meaning of AGENT, but a suffixal expression(S47) which nominalizes the predicative expression.</Paragraph>
      <Paragraph position="20">  Example 4.</Paragraph>
      <Paragraph position="21"> input = M2ODEKINAKUNARUTO(~ ~ ~ &lt;~ ~) output : segmentation 1 = M20/DEKI() NAKUNAR(U) TO categories 1 = M20 A24 A4\] R92 function 1 = P MODIFYING P segmentation 2 = M20/DEKI() NAKUNAR(U) TO categories 2 = M20 A24 A4\] R94 function 2 = P MODIFYING P  The decomposition was unique but the interpretation of 'to' was twofold as follows.  In the first interpretation, 'to' is a conjunctive particle of the meaning,ASSUMPTION, and in the second, it is a particle of the meaning, QUOTATION. This ambiguity is, therefore, quite reasonable.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Results of Experiments
</SectionTitle>
      <Paragraph position="0"> We show the results of experiments made for 162 E-bunsetsus in Table 5 and 6. The average number of atomic expressions composing an E-Table 5. Ambiguity of Decomposition number of decompositions number of E-bunsetsus zero 1 (not decomposable) one 158 two 3 more than or equal to three 0  --7--Table 6. Ambiguity of Category Sequence number of category sequences number of per a single decomposition decompositions  bunsetsu fed to the system has been 4.8. The ambiguities of both the decomposition and the category sequence have been reduced sufficiently. Most of the ambiguities left by the system have been quite reasonable in the sense that further reductions of them would require more detailed information from the outside of the E-bunsetsu. In addition, the ambiguities to be left to higher phases of parsing for reduction have not been reduced by the system.</Paragraph>
      <Paragraph position="1"> As exemplified in Example i., the disambiguation of the atomic expression's meaning is carried out by selecting the triple of functional information given in the dictionary. Nine percent of the entry expressions are given plural triples and then their meanings can be reduced by our rules on the bases of its structural surroundings in the E-bunsetsu.</Paragraph>
      <Paragraph position="2"> Conclusions Extending the domain of input sentential forms of the natural language processing system enables, in principle, the system to manipulate more precice or delicate meanings and to communicate with men more naturally. Our grammatical model presented in this paper is so compre L hensive that the local structures of colloquial and written sentences actually used in everyday life can almost always be analyzed with it. It is also elaborate enough to reduce the syntactic and semantic ambiguities of the local structure. It should be noted that the local structure analyzed by our grammar plays a quite important role in the Japanese language processing because it is not only a larger structure which can include idiomatic strings of words than a bunsetsu, but also a syntactic and semantic unit for sentence construction.</Paragraph>
      <Paragraph position="3"> Every atomic expression, which is the smallest component of the sentence, has been chosen to have undividable and self-supported meaninqs. Though we have not mentioned it in detail in this paper, we have already settled extensively the meanings of annex-expressions by classifying them.</Paragraph>
      <Paragraph position="4"> --8--</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML