File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/86/c86-1067_metho.xml
Size: 16,030 bytes
Last Modified: 2025-10-06 14:11:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C86-1067"> <Title>A Kana-Kanji Translation System for Non-Segmented Input Sentences Based on Syntactic and Semantic Analysis</Title> <Section position="3" start_page="280" end_page="280" type="metho"> <SectionTitle> 2. EXTRACTION OF AMBIGUITY FROM INPUT KANA SENTENCES </SectionTitle> <Paragraph position="0"> This section first describes the method for extracting highly possible ambiguities by morphological analysis, and then describes an efficient data structure for storing those ambiguities in memory.</Paragraph> <Section position="1" start_page="280" end_page="280" type="sub_section"> <SectionTitle> 2.1 Morphological Analysis 2.1.1 Morphological characteristics of Japanese language </SectionTitle> <Paragraph position="0"> A Japanese sentence is composed of a string of Bunsetsu, and each Bunsetsu is a string of morphemes. In a Bunsetsu the relationship between the preceeding morpheme and succeeding morpheme is strongly regulated by grammar. The grammatical connectability between morphemes can be easily determined by using a grammatical table in morphological analysis \[2\]. On the other hand, on the morphological level there is little if any grammatical restriction between the last morpheme in a Bunsetsu and the first morpheme in the following Bunsetsu. In this sence a compound word is also a series of Bunsetsu, each of which contains an independent word.</Paragraph> <Paragraph position="1"> There is no limit to the length of a compound word, and there are no restrictions in the way words can be combined. Therefore, since there are a tremendous number of compound words, it is almost impossible to compile a dictionary of these words.</Paragraph> <Paragraph position="2"> 2.1.2 Morpheme chain model The lack of restrictions on the relationship of consecutive Bunsetsu increases the ambiguity of segmentation in the morphological analysis. This is especailly true if the formation of compound words is not restricted in some way. Under these circumstances the result is often meaningless because compound words are generated mechanically.</Paragraph> <Paragraph position="3"> This problem can be solved by introducing the concept of a statistical model of a morpheme chain.</Paragraph> <Paragraph position="4"> Statistical research in this area \[3\] indicates that compound words have some distinct morphological characteristics: (1) Part of speech: about 90% of morphemes in compound words are nouns or their prefixes or suffixes.</Paragraph> <Paragraph position="5"> (2) Word category: about 7'7% of all morphemes are words of foreign origin (Chinese).</Paragraph> <Paragraph position="6"> (3) Length: About 93% of compound words are 3 to 5 Kanji in length.</Paragraph> <Paragraph position="7"> These properties can be used to distinguish very likely candidates for compound words from unlikely ones. Morpheme M can be represented by the property set ( P, C, L ), where P, C and L mean the part of speech, the word category and the length in Kana, respectively. A compound word can then be modeled as a morpheme chain, and is represented by pairs of the property set. The pairs can be classified into three levels according to the probability of occurrance. To generalize the representation a null property set (-,-,-) is introduced for the edge of a compound word. Table 1 is a part of the model representation.</Paragraph> <Paragraph position="8"> Figure 1 shows the algorithm for the morphological analysis. All candidates for dependent words are first picked up from input Kana sentences by using a string-matching operation and by examining the grammatical connectability between a preceding word and its successor. This process is executed from right to left, resulting in the generation of subtrees of dependent words.</Paragraph> <Paragraph position="9"> Next, candidates for independent word are picked up by string-matching using a word dictionary starting from the leftmost character of the input sentence. Those elements which correspond to the level I chain are then selected. If the selected independent word adjoins a dependent word which has already been extracted in the previous process, the grammatical connectability between them is also checked. In this way all independent level 1 words that begin from the first column are extracted. The control is shifted to the next column of the input sentence. If the current position is the ending edge of the already extracted independent word or its successive dependent word, the same action is taken. If not, the control moves to the next column without extracting any independent words. The control stops when it reaches the end of the sentence after having successfully extracted all level 1 independent words or related successive dependent words. If the system fails to extract any words on level 1, the control backtracks to the beginning, and trys again using level 2 extraction. On this pass, level 2 independent words are picked up and tested in the same manner as in level 1 extraction. If the level 2 extraction fails, then an unknown word process, level 3, is invoked, which assumes an unknown word exists in the input sentence and the control skips to the nearest dependent word. The skipped part is assumed to be the unknown word. In this way, the control of the extraction level for independent words based on a statistical model of morpheme chains enables highly possible ambiguities in input Kana sentences to be extracted by pruning rare compound words.</Paragraph> </Section> <Section position="2" start_page="280" end_page="280" type="sub_section"> <SectionTitle> 2.2 Network Representation of Ambiguity </SectionTitle> <Paragraph position="0"> The ambiguous morphemes extracted in the morphological analysis are stored as common data in a network form to reduce both storage and processing overhead. Figure 2 shows an example of a morpheme network. Each morpheme is represented by an arc. Both ends of each morpheme are indicated by circles. A double circle corresponds to the end of a Bunsetsu, whereas a single circle corresponds to the boundary of a morpheme in a Bunsetsu.</Paragraph> <Paragraph position="1"> The information for a group of ambiguous morphemes is represented by the data structure: VTX (Vertex), EDG (Edge) and ABL (AmBiguity List). The VTX represents the position of morpheme in a sentence. The EDG represents the common attributes of the ambiguous morphemes. The common attributes are a part of speech, the type of inflection and Kana string. The ABL represents individual attributes of the morphemes. The individual attributes are the Kanji representation, the meaning code and the word frequency. An ABL list is referenced by EDG. VTX and EDG refer to each other. A VTX is considered to be shared if the grammatical relationship between the preceeding EDG and its succeeding EDG is the same. A double circled VTX can usually be shared.</Paragraph> </Section> </Section> <Section position="4" start_page="280" end_page="284" type="metho"> <SectionTitle> 3. SELECTION OF THE MOST SUITABLE MORPHEME STRING </SectionTitle> <Paragraph position="0"> The second step in the Kana-Kanji translation process is divided into two substeps: (1)Extraction of morpheme strings from the morpheme network.</Paragraph> <Paragraph position="1"> (2) Selection of the best morpheme string by syntactic and semantic analysis.</Paragraph> <Section position="1" start_page="280" end_page="280" type="sub_section"> <SectionTitle> 3.1 Extraction of Preferential Paths </SectionTitle> <Paragraph position="0"> Each path, or morpheme string, can be derived by tracing morphemes on the network from the beginning of the sentence to the end of the sentence. In order to avoid the combinatorial explosion of possible paths, it is necessary to introduce some heuristics which make it possible to derive highly possible paths. This is accomplished in the following way. First, a quasi-best path is chosen based on the least weighted-morpheme number using the best-first-search technique \[4\]. Next, a restricted range of morphemes near the quasl-best path is selected fi'om the morpheme network in light of the probability of ambiguity.</Paragraph> <Paragraph position="1"> The leasL Bunsetsu number \[5\] is known as an effective heuristic approach for determining the segmentation of non-segmented input Kana sentences. In this approach, the segmentation which contains the least number of' Bunsetsu is most likely to be correct. The authors have modified this method to improve the correctness of segmentation by changing the counting unit from the number of Bunsetsu to the sum of the weighted morphemes. The weights of morphemes are basically defined as 1 for each independent word and 0 for each dependent word. Since a Bunsetsu is usually composed of an independent word and some dependent words, the sum of the weights of a sentence is roughly equal to the number of Bunsetsu in the sentence. While the least bunsetsu number ignores the contents of the Bunsetsu, the new method evaluates the components of the Bunsetsu to achieve more realistic segmentation. The weights morphemes were modified based on empirical statistical data. Consequently, some independent words such as Keishikimeisi ( a kind of noun ), Hojodoshi ( a kind of verb ) and Rentaishi ( a kind of particle ) are weighted words is defined 0.5. Table 2 shows the weight for morphemes.</Paragraph> <Paragraph position="2"> In Figure 3, VTX(O) and VTX(n) correspond to the beginning and the end of a sentence, respectively. Each VTX and each EDG contains a cumulative number of weighted morphemes beginning from the end of the sentence. They are represented W(i) for VTX(i) and W(ij) for EDG(ij). X(ij) is the weight of the EDG(ij).</Paragraph> <Paragraph position="3"> For the VTX(n)</Paragraph> <Paragraph position="5"> This means that the minimun W(ij) is selected among the EDGs which share VTX(i) on their left side.</Paragraph> <Paragraph position="6"> By repeating (2) and (3), the minimum sum of the weighted-morpheme number can be got as W(O). Then a quasi-best path which has a least weighted-morpheme number can be easily obtained by tracing the minimum W(ij) starting from the VTX(O). Since the complexity of the above process is on an oder of n, the quasi-best path can be obtained very efficiently.</Paragraph> <Paragraph position="8"> F_~gure 3 Best-flrst-searc_h on Morpheme Network</Paragraph> </Section> <Section position="2" start_page="280" end_page="280" type="sub_section"> <SectionTitle> 3.1.3 Selection of alterna~;ive paths </SectionTitle> <Paragraph position="0"> Since the selected quasi-best path is not always the most suitable one, alternative paths are created near the quasi-best path by combining the restricted range of ambiguous morphemes. The range is decided by a preferential ambiguity relationship table (See Table 3) which contains typical patterns of segmentation ambiguity. By looking up this table, highly possible ambiguities for morphemes of the quasi-best path can be selected from the morpheme network.</Paragraph> </Section> <Section position="3" start_page="280" end_page="284" type="sub_section"> <SectionTitle> 3.2 Syntactic and Semantic Analysis 3.2.1 A meaning system </SectionTitle> <Paragraph position="0"> A detailed general purpose meaning system is necessary for Kana-Kanji translation. The meaning system adopted was basically a thesaurus of 32,600 words Table 3 Preferential Ambiguity Relation quasi The range of alternative best ambiguous morphemes path n. v. a. a.v. adv. posp.</Paragraph> <Paragraph position="2"> n.: nouns, v.: verbs, a.: adjectives, a.v.: adjective verbs, adv.: adverbs, posp.: postpositions classified into 798 categories of 5 levels \[6\]. The system was enhanced by adding 11 higher level meaning codes called macro codes, such as &quot;human&quot;, &quot;thing&quot; and &quot;place&quot;. Each macro code was made by gathering related meaning codes in the system. In the original system, these codes appeared in different categories. The word dictionary developed for the new system contains 160,000 words.</Paragraph> <Paragraph position="3"> Each word is given a meaning code according to the new meaning system.</Paragraph> <Paragraph position="4"> 3.2.2 Case grammar and case frames Case grammar \[7\] is widely used in natural language processing systems. It is also useful in Kana-Kanji translation because it can be applied to homonym selection as well as to syntactic analysis. When used for this purpose, the case frame must have a high resolving power so that it can distinguish a correct sentence from among many ambiguous sentences. The way in which the new approach achieves high resolving power in case frames can be summerized as follows: (1) Detailed meaning description in case frames.</Paragraph> <Paragraph position="5"> Each slot in a case frame has a list of meaning codes that fit for each case. The meaning codes are written in the lowest level of the meaning system except when higher meaning codes are preferable. In special cases, such as when an ideomatic expression is required for a case slot, a word itself is written instead of the meaning code.</Paragraph> <Paragraph position="6"> (2) Rank specification of cases.</Paragraph> <Paragraph position="7"> Cases are classified into either obligatory or optional cases.</Paragraph> <Paragraph position="8"> (3) Multi-case frames for each verb.</Paragraph> <Paragraph position="9"> A case frame is provided corresponding to each usage of a verb.</Paragraph> <Paragraph position="10"> A case frame dictionary of 4,600 verbs was developed for this system.</Paragraph> <Paragraph position="11"> Table 4 shows an example of case frame description. Each case frame consists of case slots and information about the transformation such as voice. Each case slot contains the case name, the typical postposition associated with the surface structure, the case rank indicator and meaning codes.</Paragraph> <Paragraph position="12"> Syntactic and semantic analysis is performed concurrently. Moreover, the homonym selection is made simultaneously. The process is basically a pattern-matching of paths with the case frame dictionary and is performed as follows. A path is scanned from left to right. Every noun Bunsetsu which depends on a verb in the path is pushed down to a stack. Whenever a verb is encountered during scanning, case frame matching is carried out. Every combination of noun Bunsetsu and case slots of the verb are tried and evaluated. The best combination is determined using the following conditions: (1) Coincidence of postpositions.</Paragraph> <Paragraph position="13"> The postposition of the noun Bunsetsu must be equal to the one for the case slot.</Paragraph> <Paragraph position="14"> (2) Coincidence of meaning code.</Paragraph> <Paragraph position="15"> The meaning code of the noun must be equal to the one for the case slot. If the noun has homonyms in ABL, a coincident homonym is selected.</Paragraph> <Paragraph position="16"> (4) Total occupation of case slots.</Paragraph> <Paragraph position="17"> To addition to the condition (3), higher total occupation of case slots is preferable.</Paragraph> <Paragraph position="18"> If using the above conditions it is not possible to choose a single combination, then word frequency information is used. throughout this process, unmatched noun Bunsetsu are left in the stack and are assumed to depend on verbs which occur later in the path. This case frame matching is repeated everytime a verb is encountered in the path. The parsing result of the path is obtained when the scanning reaches the end of the path. The same parsing is tried for other paths constructed in the previous step. Then the most suitable path is selected among the successfully parsed paths by measuring the degree of fit for conditions (3) and (4) above. The result is the text of the Kana-Kanji translation.</Paragraph> </Section> </Section> class="xml-element"></Paper>