File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2104_metho.xml
Size: 12,066 bytes
Last Modified: 2025-10-06 14:14:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2104"> <Title>A Portable & Quick Japanese Parser : QJP</Title> <Section position="3" start_page="0" end_page="619" type="metho"> <SectionTitle> 2 Analysis Method </SectionTitle> <Paragraph position="0"> QJP performs two types of analysis : 1) morphological analysis to segment a sentence into part-of-speech tagged morphemes and words, 2) syntactic analysis to place words into bunsetsuLdependency structure.</Paragraph> <Paragraph position="1"> Analysis strategies are the followings : * The morphological analysis is achieved by expanding an earlier methods\[ill2\] for bunsetsu or word segmentation using character-types thus Mlowing the use of a very small dictionary.</Paragraph> <Paragraph position="2"> * The syntactic ana\]ysis uses no semantic information, only part-of-speech and other syntactic information. In addition, rather than creating all possible, or some preferable, parses, we construct the best syntactic structure preserving local ambiguities.</Paragraph> <Paragraph position="3"> 1Bunsetsu(3~i) is a kind of phrasal unit i n Japanese, consistiug of one content word(~/~,~.~,/~) \[such as nonn(~ ~), verb-noun(@~), verb(~315~), adjective(~l:~.~), verbadjective(~J~J) and ad~rerb(~q~\])\] and successive adjunctive words(l';.~) \[such as auxiliary verbs(II)J~J~) and post-positional particles(tlJJ~)\], and carrlng one concept.</Paragraph> <Section position="1" start_page="616" end_page="616" type="sub_section"> <SectionTitle> 2.1 Morphological Analysis </SectionTitle> <Paragraph position="0"> Characteristics of written Japanese A Japanese sentence has no spaces between words\[Figure 1\]. So it is difficult to segment a sentence into words. However, the fact that at least four distinct sets of characters \[for example, kanfi(&quot; ~'' ,,~c .... ~=x,, ,,~o~ ragana(&quot; (c)&quot;,&quot; J~&quot;,&quot; ~ &quot;,etc), katakana(&quot; .\]&quot;,&quot; ~:',&quot; ~&quot;,etc) and other characters (alphabets, nmnbers, symbols etc.)\] are used to write Japa~mse can be used for segmenting words. Most words written in kanji or katakana are content words, such as nouns, verb-noun and stems(-~ @) of verbs or adjectives. Most words in hiragana are functional words(~4~), such as postpositional particles, auxiliary verbs and inflective suffix(7:~,~-) of verbs and others \[Table 1\]. And the vocabulary of content words is umch larger than that of functional words.</Paragraph> <Paragraph position="1"> and their word examples</Paragraph> <Paragraph position="3"> Sharing of Morphemes by Dictionary and Rules Our strategy is that all functional words, wMch are few in nmnber, are stored in the dictionary and most content words or their stems in kanji or katakana are to be extracted and given their I)ar~-ofo speech candidates based on character-types.</Paragraph> <Paragraph position="4"> Standard morphological analyser uses a dictionary to obtain morpheme or word candidates. But in our approach, morpheme candidates 2 are extracted either from the morpheme dictionary or using allocation rules based on character-type. For exampie, if the dictionary look-up fails, the allocation rules extract each sequence of character in which all of the characters belong to the same character set.</Paragraph> <Paragraph position="5"> Then, using the allocation rules, part-of-speech candidates are assigned based on the sequence's character set and length. The candidates au'e disambiguated by checking connection with the the following morphemes based on the connection table between morphenm parts-of-speech. The following morphemes, in most cases, are functional words or inflective suffixes. The dietiomu'y contains funetionM words \[such as postpositional particles, auxilim'y verbs, formal nouns(N~<~), adverbM nouns(~q~SN), conjunctions(~-~N), adverbs and so on\], inflective suffixes and exceptional content words which cannot be or axe not covered by the allocation rules.</Paragraph> <Paragraph position="6"> Here are some examples of the allocation rules for 1) 1-kanji character sequence, 2) 2-kanfi character sequence and 3) katakana character sequence.</Paragraph> <Paragraph position="7"> 2in this analysis, a inllected word is treated a+s two or more morphemes - a stem part and one or more inflection part. ~li~# aux. functional v.</Paragraph> <Paragraph position="8"> ~J~ inflective suffix T~-~J~ derivative suffix ~ prefix ~J~i~ suffix E ~1 Examples ~-J-~, ~m~h-J-6, -~--~-j-6 ~-<, ,~-~, ~-J'6, I$\]+,-~ (~- ~) ~-b. ~ L-t,~, J:-I,<-C/1~6 L-L~ ~-f~, ~-V. ~-~PS, U ,~ -~-f&quot; N:U:(~ J: U:), $/':15. ~5~ I,~12 ~-C/, ~5-~, ~-~, 15 < -~ ~Y. ,5, ~', ~{,5 * ~ *l:indeloendent word *Z:adjunctive word *3:affix * 4:content word(conceptual word) *5:functional word #:inflective -: select point between stem and infleution part 1) noun / stem of 5-dan verb(~i!~g~NJ~ ) / stem of shimo-1-dan verb(~--~-~.~JJ~,7) 2) noun / (stem of 5-dan verb) / (stem of shimo1-dan) / verb-noun(sahen-meishi; ~)'~ N) / verb- adjective (~ &quot;-~!~0J~ ) 3) noun / verb-noun / verb-adjective The 1-kanji character nouns and verb-stems are largely of old-JN)anese-origin words, wago(~H~), and 2-kanji character nouns, verb-nouns and verbadjectives are mainly Chinese-origin words, kango(~ ~). In addition, there are several 1-kanji chaxacter stems of kami-l-dan verbs(\]a--~\[~tJ~), sahen verbs(+)-~gOJ-~) and adjcctives()f~-~l) which axe stored in the dictionalsr because they ~rc so few in nmnber. The word number of words which can be treated using rules like those given above is so great that the dictionary size is substantially reduced. Treating of Wage compound words Another characteristic of old-Japanese-origin verbs (wage verb) is that they often continue with other words or morphemes to become verbs or nouns. For examples, two verbs &quot;=-j\]~ <&quot;('to write') and &quot;~ ~e&quot; ('to become crowded') combine into the compound verl) &quot;~@i_},_~&quot;('to write into'), the verb &quot;~2&quot;('to read') become the verb &quot;~i&quot;'(' cause to read') with the causative suffix &quot;@-&quot;, and the verb &quot;~JSu-~2&quot;('to step') becoines the the noun &quot;,~fi~Y-\]-&quot;('a step') with the derivative suffix &quot;7f'. There axe a great mmly compound words such ,as these.</Paragraph> <Paragraph position="9"> A word-compounding part determines a word fl-om morphemes using word-constituent rules based not only on inflections but also on compounds or derivations such as those shown above. Such rules also greatly reduce the diction,'u'y size.</Paragraph> <Paragraph position="10"> Morphological Output from QJP An example of segmented morphemes with morpheme-tags are shown in Figure 2, where 8 nouns (&quot; 1~2~&quot;,&quot;~-~&quot;,etc.) and 2 stems of word (,,t:)\]', ,, ~&quot;), maxked by '\[zJ\]', in kanji character are recognized using allocation rules and connection table. The words with part-of-speed, tags and morpheme-divisions('-','+') axe shown in Figure 3, where a compound noun &quot;~avb H&quot; (the 7th word) is a compound of the morphemes 8-10 \[&quot;~&quot;(stem of shimo- l-dan verb &quot; ~)J ~ &quot; ), &quot; ~%&quot; ( renyou-kei inflective suffix of shimo-l-dan verb; T~~\[~)~:~-z~ ~) and &quot; g&quot;(noun)\] using a word-constituent rule. In Figure 3, the root forms of inflected words have been derived and are shown in the <>-parentheses, such us &quot;~\]~ <&quot; which is the root form (shuushi-kei; ~) of &quot;~&quot;. These morphemes and words ~u'e not in the dictionaxy.</Paragraph> </Section> <Section position="2" start_page="616" end_page="619" type="sub_section"> <SectionTitle> 2.2 Syntactic Analysis Kakari-uke Analysis </SectionTitle> <Paragraph position="0"> Many J~l)aamse syntactic analyses are ba~ed on orthodox bunsets,>depcndency analysis, called kakari-uI~e a anMysis(~.~ 0 ~}~{J~:) between bunsetsu phrases, where a buckets'a-dependency structure corresponds to a set of kakari-uke bunsetsu pairs. We also take this approach because it is intuitive, understandable and easily implemented.</Paragraph> <Paragraph position="1"> aThe relation of kakari and uke equals to modifier and rood ifiee.</Paragraph> <Paragraph position="2"> Simple Treatment of Structural Ambiguities null StructurM ambiguities are usually dealt with either by generating all possible structures or by selecting the more preferable ones ba,sed on some scoring scheme. Such method usually leads to combinatorial explosions which causes a lot of memory and processing time.</Paragraph> <Paragraph position="3"> For this problem we have already proposed a substitutional light method\[5) in kakari-uke analysis. This method extracts all possible kakari-uke pairs, and then rather than generate not M1 or some possible sets of pairs~ only one best set of pairs is generated while still retaining all other possible \]Tairs. Thus, instead of generating multiple number of sets, it most-likely set is selected ~ld the applie~tion/user is presented with alternative kakari-uke pMrs at the same time that the selected pairs are presented. If the application/user corrects any alternative kakari-uke pairs, the most likely set is re-calculated using retaining possible kakari-uke pairs. This means of dealing with structural ambiguities avoids combinatorial explosions and requires flu' less machine resources.</Paragraph> <Paragraph position="4"> Not Using of Semantic Inibrmations Most methods for analyzing Japanese use c~e patterns with semantic features for preference selections. However, such analysis techniques using semantic informations are not yet adequate and seinetimes i~ctmdly lead to adverse results\[6).</Paragraph> <Paragraph position="5"> In addition, semantic information nmst be stored in the dictionary. This reduces the merit of the very small dictiomtry achie.w.'d in morphological analysis section. We limit the information to morphologic~fl/word ~md syntactic levels \[such as the presence of coi\[Ima(-~,%~), adverbial noun, surface or syntactic similarity\[7\]\] without using semantic information for structurM analysis.</Paragraph> <Paragraph position="6"> Flow of QJP's Syntactic Analysis Under these approaches, QJP's syntactic aataiyser processes words sequence in three steps\[Figure 4\] each following its own set of rules. First it determines bunsetsu fcatures\[A\] for each bunsctsu according to its word constituents. Second it extracts &quot;all possible kakari-uke bunsetsu pairs \[marked by ' O' in B\] based on specific combinations of bunsctsu features for each bunsetsu pair.</Paragraph> <Paragraph position="7"> Last, it selects the best uke-bunsctsu (modifice) \[marked by ' ~' in C\] from possible ones for each bunsetsu which is a kakari-bunsctsu (modifier), except tim last one, because every bunsetsu modifies one of the following bunsctsus, so the last one has no uke-bunsetsu. Thc default uke selection is the nearest possible uke bunsctsu and, if nccessazsr, Q.}P substitutes the selcetion 1)ascd on rules comparing the two pairs - the currcnt selected ukc-bunsetsu and a more distant possible uke-bunsetsu for thc subject kakaribunsctsu. In Figure 7, solnc pairs are not the new,rest ones. Tile at)pli(:ation/uscr's kakari-ukc pairs corrections rest,%rts the selection ; QJP first selects the corrected kakari.ukc p,fir(s) \[maxked by ' I' in Figure 7\] and then re-selects remaining kakari-ukc pairs.</Paragraph> <Paragraph position="8"> Figure 4-C and Figure 7 ;~re kakari-uke matrices showing the possible pairs and selected pairs. Figure 5 is the output of kakari-ukc pairs tagged with parts-of-speech and bunsctsu features.</Paragraph> <Paragraph position="9"> \[Se~entation of Words by Hor~ological klalyser\] I \[1. Setting of BunsetsuFeatures\]</Paragraph> <Paragraph position="11"/> </Section> </Section> class="xml-element"></Paper>