File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/c94-1033_intro.xml
Size: 9,288 bytes
Last Modified: 2025-10-06 14:05:34
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1033"> <Title>Backtracking-Free Dictionary Access Method for Japanese Morphological Analysis</Title> <Section position="2" start_page="0" end_page="211" type="intro"> <SectionTitle> 2 Japanese </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="208" type="sub_section"> <SectionTitle> Analysis Morphological 2.1 Grammar </SectionTitle> <Paragraph position="0"> A Japanese morphological analyzer (hereafter called the JMA) takes an input sentence and segments it into words and phrases, attaching a part-of-speech code to each word at the same time. Figure 1 shows a sample input and the output of our JMA.</Paragraph> <Paragraph position="1"> Tile grammaticality of a sequence of Japanese words is mainly determined by looking at two consecutive words at a time (that is, hy looking at two-word windows). Therefore, Japanese morphological analysis is norreally done by using a Regular Grammar (e.g., Maruyama and Ogino 1994). Our JMA grammar rules have the following general form: state1 ~ &quot;word&quot; \[linguistic-features\] state2 cost=cost.</Paragraph> <Paragraph position="2"> Each grammar rule has a heuristic cost, and tile parse with the minimum cost will be selected as the most plausible morphological reading of the input sentence. A part of our actual gram- null mar is shown in Figure 2. Currently our grammar has about 4,300 rules and 400 nonterminal symbols.</Paragraph> </Section> <Section position="2" start_page="208" end_page="208" type="sub_section"> <SectionTitle> 2.2 Dictionary Lookup </SectionTitle> <Paragraph position="0"> While the flmction words (particles, auxiliary verbs, and so on, totaling several hundred) are encoded in the grammar rules, the content words (nouns, verbs, and so on) are stored in st separate dictionary. Since content words may appear at any position in tile input sentence, dictionary access is tried from all the positions n.</Paragraph> <Paragraph position="1"> For example, in the sentence fragment: ill Figure 3, &quot;7..~ ~{'/. (large)&quot; and &quot;J~ ~I'J. ~i\[ ~i ~ (mainframe)&quot; are the results of dictionary access at, posit.ion 1. For simplicity, we assume that the dictionary contains only the following words: &quot;i~ ~ (large),&quot; &quot;;k2 ~lJ. ~\['.~ ~: (mainframe),&quot; &quot;a\]'~ (computer)&quot;, &quot;~Ig ~ f~ (eomput.ing facility),&quot; and &quot;,~R~ (facility).&quot;</Paragraph> </Section> <Section position="3" start_page="208" end_page="208" type="sub_section"> <SectionTitle> 2.3 Using TRIE </SectionTitle> <Paragraph position="0"> The most common method for dictionary lookup is to use an index structure called TRIE (see Figure 4). The dictionary lookup begins with the root node. The hatched nodes represent.</Paragraph> <Paragraph position="1"> the terminal no(tes that correspond to dictionary entries. At position 1 in tile sentence ab(we, I, wo words, &quot;Jq~.eJ. (large)&quot; and &quot; )k:)l'-I.}i\].~:~.~ (mainframe),&quot; are found.</Paragraph> <Paragraph position="2"> Then, the starting position is advanced 1;o the second character in the text; and the dictionary lookup is tried again. In this case, no word is found, because there are no words thai, begins ~Actual' dictionaries' 'PS1so co,train i')'C (big),&quot; &quot; ,b~,! (type),&quot; &quot;~'1&quot; (measure),&quot; &quot;i{l'~: (compute),&quot; &quot;~'~: (cwdm.),&quot; &quot;~ (m,.:hi.~.),&quot; &quot;~b~ (~.~t,,bU.~h),&quot; ,.,,i &quot;~;;i (p,.,,~ ...... ).&quot; with &quot;~{'.1.&quot; in the dictionary. The start, lug position is l, hen set I,o 3 and t.rled again, and this ,.i,,~ th,. words &quot;al~:~ (,:,lnlp,,Ce,.y and &quot;i}t~'; k)~;{'~{i\[i (comput.ing facilit,y)&quot; are obtained. 'e The problem here ix (,hal;, even though we know that, 1,here is &quot;TQ){l!}i\[.~\])~ (ma.in\[rarne)&quot; al, posit;ion I, we look up &quot;}}\[~{:t~ (computer)&quot; again. Since &quot;iil~:~.~: (computer)&quot; is a snhstring of &quot;9'4~{~iI'~;)1~ (n-lainframe),&quot; we know that, t;he word &quot;~,i\]~,i~ (compul:er)&quot; exists at, posit;ion 3 as soon as we lind &quot;X~{~}~\[~3,,i~ (lnainframe)&quot; at i)o-. sit;ion I. Therefore, going back l;o 1;he root node at position 3 and trying mat;citing all over again means duplicatAng our efforts unnecessarily.</Paragraph> </Section> <Section position="4" start_page="208" end_page="211" type="sub_section"> <SectionTitle> 2.4 Eliminating Backtracking </SectionTitle> <Paragraph position="0"> Our idea is to use t, he b)dex stsuct,m'e de w~loped by Abe and Corasick to find muli;iple sl,rings in a text. Figure 5 shows l;he TRII!; with a point.er called t;he fail pointer associated with the node corresponding to l;he word &quot;7)k/~I T/ ~'\[~2~: (mail fxa.nm) ' (the rightmost, word in Lhe first row). When a match st;re'Ling al, position n reaches I, his node, it is gnaranl,eetl that tile sl.ring &quot;~,ilJ)i~,~&quot; exists starting at position n -t-2. Therefore, if the next character in the input sentence does not mat,oh any of the child nodes, we do not go b~ck to the root but go back to the node corresponding 1,o this substring by following t, he fail pointer, and resume matching from this node. For the input sentence in l,'igure 3, l.he dict, ionary access proceeds as indica.ted by the dot, t;ed line in I.he Figure 5, 13n(ling the words &quot;)<~{t! (la.rge),&quot; &quot;g<~{t\[}\]\[#:~.~ (mair,\[','ame),&quot; &quot;\]i\[~'~: '~ (COlIIplll;cT),&quot; and so on. Thus, the nmnt)er of dictionary node ac(:esses is greatly reduced.</Paragraph> <Paragraph position="1"> Ill many Japanese tnorphok)gical analysis systems, the dictionary ix held in the secondary storage, a.nd t, herefore the number of dictionary ~Wh,, r,,:~, ch,~t &quot;X~{'~iil~A:~.~ (,,**i,,~'~,-,,,0&quot; w,~.~ re,,.,1 heft)re does no(. neees;sarily mean f,\[ud, there is no need to l~mk up &quot;~{I'~:)~%.(computr.r ...),&quot; because at this point twa interpretat.ions, &quot;mainframe facilit.y&quot; and &quot;large computing facilit.y,&quot; are possible.</Paragraph> <Paragraph position="3"> Theoretically there is a I)ossiMlity of 1)rm,ing dictionary lookup by using the state set at. position n. For example, if no noun can follow rely of the states in the current state set, there is no need to look up nouns. One way to do this pruning is to associate with each node a bit vector representing the set of all parts of speech of some word beyond this node. \[f the intersection of the expected set of parts of speeche an(t the possi.</Paragraph> <Paragraph position="4"> bilities beyond this node is empty, the expansion of this no(te can be pruned. In general, however, almost every character position t)redicts most of the parts of speech. Thus, it is common practice in Japanese morphok)gical analysis to h)ok up every possible prefix at every character position.</Paragraph> <Paragraph position="5"> Hidaka et al. (1984) used a modified l{-tree instead of a simple TRIE. Altough a B-tree has much less nodes than a TRIE and thus the number of secondary storage accesses can be significantly reduced, it still backtracks to the next character position and duplicate matching is inevitable. null Since Step 1 is well known, we will describe only Step 2 here.</Paragraph> <Paragraph position="6"> \]&quot;or each node n, Step 9 given the value fail(n). In the following algorittlm, for'ward('n, c) denotes the chikl node of the node 'n whose associated character is c. If there is no such node, we define forward(n, e) = hi\]. Root is the root no(le of the T1HF,.</Paragraph> <Paragraph position="7"> &quot;2-1 j'ail(l~oot) ~- leooe 2-2 for each node ft. of depth 1, fail(n) ~ lSmt 2-a re,. e~,:l~ depth d-- 1,2, ..., 2-3-1 for each node. n with depLh d, 2-3- I-I for each child node rn of n (where m = forward(n, c:)), fail(m) +-- f(fail(n), c).</Paragraph> <Paragraph position="8"> l\[ere, \]'(n, c) is defined as follows:</Paragraph> <Paragraph position="10"> If tile node corresponds to the end of some word, we record the length l of the word in the node. For example, at the node that corresponds to the end of the word &quot;~:~t'~:~ (mainframe)&quot;, I = 5 and l = 3 are recorded because it is the end of both of the words &quot;~\].~:~ (mainframe, l = 5)&quot; and &quot;~l'-~-~ (computer, l = 3).&quot; 3 Figure 6 shows the complete TRIE with tile fail pointers.</Paragraph> <Paragraph position="11"> traditional TRIE and was 27% faster in CPU time. The CPU time was measured with all the nodes in the main memory.</Paragraph> <Paragraph position="12"> For the computer manuals, the reduction rate was a little larger. This is attributable to the fact that computer manuals tend to contain longer, more technical terms than newspaper artMes.</Paragraph> <Paragraph position="13"> Our method is more effective if there are a large number of long words in a text.</Paragraph> </Section> </Section> class="xml-element"></Paper>