File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2110_metho.xml

Size: 18,978 bytes

Last Modified: 2025-10-06 14:07:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2110">
  <Title>Japanese Dependency Analysis using a Deterministic Finite State Transducer</Title>
  <Section position="3" start_page="0" end_page="761" type="metho">
    <SectionTitle>
2 Backward beam search algorithm
</SectionTitle>
    <Paragraph position="0"> l?irst, we wouhl lil:e to describe the backwm'd beam search algoril:lm: tbr ,\]at)m:ese dependency analysis proposed by Sekinc et sd. (Sekine et al., 2000). Their experin:ents suggested the method proposed in this paper.</Paragraph>
    <Paragraph position="1"> The following characteristics are known ibr Japanese dependency. Sekine et al. assumed these characteristics in order to design the algorithm 1  (1) l)ependencies are directed from left to right (2) Dependencies don't cross.</Paragraph>
    <Paragraph position="2"> (3) Each bunsetsu except the rightmost one has only one head (4) Left context is not necessary to deternfine a dependency.</Paragraph>
    <Paragraph position="4"> Head 6 3 4 6 6 -Translation: She received the pie made by him with pleasure. Figure h Example of a Japanese sentence, bunsetsus and dependencies Sekine et al. proposed a backward beam search algorithm (analyze a sentence froln the tail to the head, or right to left). Backward search has two merits. Let's assume we have analyzed up to the/14 + l-st bunsetsu in a sentence of length N (0 &lt; M &lt; N). Now we are deciding on the head of the M-th bunsetsu. The first merit is that the head of the dependency of the /1J-th bunsetsu is one of the bunsetsus between t14 + 1 and N, which are already analyzed. Becmlse of this, we don't have to keel) a huge number of possible analyses, i.e. we can avoid something like active edges in a chart parser, or making parallel stacks in GLR parsing, as we can make a decision at this time. Also, wc can use the bemn search mechanism, t)y keeping a certain nmnber of candidates of analyses at each t)unsetsu. The width of the beam search can be e~Lsily tuned and the memory size of the process is proportional to the product of the input sentence length and the beam search width. The other merit is that the 1)ossible heads of the dependency can be narrowed down because of the assmnption of non-crossing dependencies. For example, if the K-th bunsetsu depends on the L-th bunsetsn (/1J &lt; K &lt; L), then the 21//th bunsetsu can't depend on any bunsetsus between I( and L. According to our experiment, this reduced the nmnber of heads to consider to less than 50%.</Paragraph>
    <Paragraph position="5"> Uchimoto et al. implemented a Japanese dependency analyzer based on this algorithm in combination with the Maxinmm Entropy learning method (Uchimoto et al., 2000). The analyzer demonstrated a high accuracy. Table 1 shows the relationship between the beam width and the accuracy of the system. Froln the table, we can see that the accuracy is not sensitive to the beam width, and even when the width is  Table h Bemn width and accuracy 1, which Ineans that at each stage, the dependency is deterministieally decided, the accuracy is almost the same as the best accuracy. This means that the left, context of a dependency is not necessary to decide the dependency, which is closely related to characteristic (4). This result gives a strong motivation tbr st~rting the research reported in this paper.</Paragraph>
  </Section>
  <Section position="4" start_page="761" end_page="3126" type="metho">
    <SectionTitle>
3 Idea
</SectionTitle>
    <Paragraph position="0"> Untbrtunately, the fact that the beam width can be 1 by itself can not lead to the analysis in time proportional to the input length. Theoretically, it takes time quadratic in the length and this is observed experimentally as well (see Figure 3 in (Sekine et al., 2000)). This is due to the process of finding the head among the candidate bunsetsns on its right. The time to find the head is proportional to the number of candidates, which is roughly proportional to the number of bunsetsus on its right. The key idea of getting linear speed is to make this time a constant.</Paragraph>
    <Paragraph position="1"> Table 2 shows the number of candidate bunsetsus and the relative location of the head among the candidates (the number of candidate bunsetsus from the bmlsetsu being analyzed). The data is derived fl'om 1st to 8th of  Nmn.of Location of head eand. 1 2 3 4 5 6 7 8 9 10 11 12 13 1.4  January part of the Kyoto COl'pus, wlfich is a hand created cort)llS tagged.with POS, tmnsetsu and dependency relationships (l{urohashi, Nagao, 1997). It is the same 1)ortion used as tile training data of the system described late, r. 'J~he nmnbc, r of cmMidates is shown vertically and the location of the head is shown horizontally. l;br exalnple, 2244 in the fom'th row and the se(:Olid (-ohmn~ means that there are 2244 bunsetsu which have 4 head Calldidates and the 2nd fl'om the left is the head of the bunsetsu.</Paragraph>
    <Paragraph position="2"> We can observe that the. data. is very biased.</Paragraph>
    <Paragraph position="3"> The head location is very lilnited. 98.7% of instances are covered 1)y the first 4 candidates and the last candidate, combined. In the table, the data which are not covered by the criteria are shown in italics. From this obserw~tion, we come to the key idea. We restrict the head candidate locations to consider, mid we remember all patterns of categories at the limited locations. For each remembered lmttern, we also remember where the. head (answer) is. This strategy will give us a constant time select;ion of the head. For example, assmne, at ~ certain lmnsetsu in the training data, there are five Calldidates. Then we. will remember the categories of the current lmnsetsu, st\y &amp;quot;A&amp;quot;, and five candidates, &amp;quot;B C 1) E F&amp;quot;, as wall as the head location, for example &amp;quot;2nd&amp;quot;. At the analysis phase, if we encounter the same sittmtion, i.e. the same category bmlsetsu &amp;quot;A&amp;quot; to be mmlyzed, and the same categories of five candidates in the same order &amp;quot;13 C D E F&amp;quot;, then we can just return tile remenfl)ered head location, &amp;quot;2nd&amp;quot;. This process can be done in constant tilne and eventually, the analysis of a sentence can 1)e done in time 1)roportional to the sentence length.</Paragraph>
    <Paragraph position="4"> For the iml)lementation, we used a DFST in which each state represents the patterll of the candidates, the input of an edge is the category of the. l)llnsetsll 1)eing analyzed and the output of a.n edge. is the location of hea(t.</Paragraph>
  </Section>
  <Section position="5" start_page="3126" end_page="3126" type="metho">
    <SectionTitle>
4: Implementation
</SectionTitle>
    <Paragraph position="0"> We still have several 1)rol)lelns whi('h have to be solved in order to iint)lement the idea. As-SUlning the number of candidates to be 5, the t)roblelns are to  (l) define the c~tegories of head bunsetsu candidates, null (2) limit the nunlber of patterns (the lmmber of states in DFST) to a mallageable range, because the Colnbilmtion of five categories could t)e huge (3) define the categories of intmt bunsetsus, (4) deal with unseen events (sparseness l)roblem). null hi this section, we l)resent how we implemented the, system, wlfich at the stone tinle  shows the solution to the problems. At the end, we implemented the systeln in a very small size; 1200 lines of C progrmn, 188KB data file and less than 1MB processing size.</Paragraph>
    <Paragraph position="1"> Structure of DFST For the categories of head bunsetsu candidates, we used JUMAN's POS categories as the basis and optimized them using held-out data. JUMAN has 42 parts-of-speech (POS) including the minor level POSs, and we used the POS of the head word of a candidate bunsetsu. We also used the information of whether the bunsetsu has a colnma or not. The nulnber of categories 1)ecomes 18 after tuning it using the held-out data.</Paragraph>
    <Paragraph position="2"> The input of all edge is the information about the 1)unsetsu currently being analyzed. We used the inforination of the tail of the bunsetsu (mostly function words), and it becomes 40 categories according to the stone tuning process. The output of an edge is the information of which candidate is the head. It is simply a nun&gt; ber from 1 to 5. The training data contains examples which represent the same state and input, but different output. This is due to the rough definition of the categories or inherent impossibility of to disambiguating the dependency relationship fi'om the given infbrmation only. In such cases, we pick the one which is the most frequent as the answer in that situation. We will discuss the problems caused by this strategy. Data size Based on the design described above, the number of states (or number of patterns) is 1,889,568 (18 ~) and the number of edges is 75,582,720 as each state has 40 outgoing edges. If we implement it as is, we may need several hundred megabytes of data. In order to keep it fast, all the data should be in memory, and the current memory requirement is too large to implement.</Paragraph>
    <Paragraph position="3"> To solve the problem, two ideas are employed. First, the states are represented by the combination of the candidate categories, i.e. states can be located by the 5 candidate categories. So, once it transfers from one state to another, the new state can be identified from the previous state, input bunsetsu and the output of the edge. Using this operation, the new state does not need to be remembered for each edge and this leads to a large data reduction. Second, we introduced the default dependency relationship. In the Japanese dependency relationship, 64% of bunsetsu debend on the next bunsetsu. So if this is the output, we don't record the edge as it is the default. In other words, if there is no edge in~brmation for a particular input at a particular state, the outtmt is the next bunsetsu (or 1). This gave unexpectedly a large benefit. For unseen events, it is reasonable to guess that the bunsetsu depends the next bunsetsu. Because of the default, we don't have to keep such information. Actually the majority (more than 99%) of states and edges are unseen events, so the default strategy helps a lot to reduce the data size.</Paragraph>
    <Paragraph position="4"> By the combination of the two ideas, there could be a state whidl has absolutely no information. If this kind of state is reached in the DFST, the output for any input is the next bunsetsu and the next state can be calculated froln the information you have. In fact, we have a lot of states with absolutely no information. Before implementing the supplementation, explained in the next section, the number of recorded states is only 1,006 and there are 1,554 edges (among 1,889,568 possible states and 75,582,720 possible edges). After implementing the supplementation, we still have only 10,457 states and 31.,316 edges. The data sizes (file sizes) are about 15KB and 188KB, respectively. null Sparseness problem Tile amount of training data is about 8,000 sentences. As the average number of bunsetsu in a sentence is 10, there are about 72,000 data points in the training data. This number is very much smaller than the nmnber of possible states (1,889,568), and it seems obvious that we will have a sparseness problem 2.</Paragraph>
    <Paragraph position="5"> In order to supplement the unseen events, we use the system developed by Udfimoto et.al (Uchimoto et al., 2000). A large corpus is parsed by the analyzer, and the results 2However, we can make the system by using the default strategy mM surprisingly the accuracy of the system is not so bad. This will be reported in the Experiment section  are added to the training cori)us. In practice, we parsed two years of lmwspaper articles (2,536,229 sentences of Mainichi Shinbun 94 and 95, excluding Jalmary 1-10, 95, as these are used in the Kyoto corpus).</Paragraph>
  </Section>
  <Section position="6" start_page="3126" end_page="3126" type="metho">
    <SectionTitle>
5 Experiment
</SectionTitle>
    <Paragraph position="0"> In this section, we will report the experilnent.</Paragraph>
    <Paragraph position="1"> ~C/\r(, used the Kyoto corl)us (ve.rsion 2). The training is done using January 1-8, 95 (7,960 sentences), the test is done using Jmmary 9, 95 (1,246 sentences) mid the parameter tuning is done using Jmmary 10, 95 (1,519 sentences; held-out data). The input sentences to the sysrein are morl)hologically analyzed and bunsetsu nre detected correctly.</Paragraph>
    <Section position="1" start_page="3126" end_page="3126" type="sub_section">
      <SectionTitle>
Dependency Accuracy
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the accuracy of the systems.</Paragraph>
      <Paragraph position="1"> The 'dependency accuracy' metals the percentage of the tmnsetsus whose head is correctly mmlyzed. The bmlsetsus are all but the last bunsetsu of the sentence, as the last 1)unsetsu has no head. The 'default nlethod' is the sysrein in which the all bunsetsus are supt)osed to dei)end on the next bunsetsu. ~t'he 'baseline reel;hod' is a quite siml)le rule-based systoni which was created by hand an(t has a\])out 30 rules. 'l'he details of the system are rel)orted in (Uchimoto et al., 1999). The 'Uchimoto's system' is tit(; system rel)orted in (Uchimoto ct al., 2000). They used the same training data m~d test (lata. The del)endency accuracies of  our systems are 81% with supl)lenlentation and 78% without supt)lementation. The result is about 17% and 9% better than tile default mid the baseline nlethods respectively. Compared with Uchimoto's system, we are about 7% behind. But as Uchimoto's system used about 40,()00 features including lexical features, and they also introduced combined features (up to 5), it is natural that our system, which uses only 18 categories and only combinations of two catcgories, has less accuracy a.</Paragraph>
    </Section>
    <Section position="2" start_page="3126" end_page="3126" type="sub_section">
      <SectionTitle>
Analysis speed
</SectionTitle>
      <Paragraph position="0"> The main objective of the system is speed.</Paragraph>
      <Paragraph position="1"> Table 4 shows the analysis speed on three dif fcrent platforms. On the fastest machine, it analyzes a sentence in 0.17 millisecond. Tat)le 5 shows a comlmrisou of the analysis speed of three difl'erent systems on SPARC-I. Our system runs about 100 times faster than Uchimoto's  tile slowest nlachine (Ultra SPARC-I, 170MHz) in order to minimize obserw~tional errors. We cml clearly observe that the anMysis tinle is proportional to the sentence length, as was predicted by the algorithm.</Paragraph>
      <Paragraph position="2"> The speed of tile training is slightly slower l:han that of tile analysis. The training on the smaller training data (about 8000 sentences) t;akes M)out 10 seconds on Ultra SPAR.C-I.</Paragraph>
      <Paragraph position="3"> aHowever, our system uses context information of fly( ~, bunsctsus, which is not used in Uchimoto's system.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="3126" end_page="3126" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3126" end_page="3126" type="sub_section">
      <SectionTitle>
6.1 The restriction
</SectionTitle>
      <Paragraph position="0"> People may strongly argue that the main I)rol&gt; lenl of the system is the restriction of the head location. We eonlpletely agree. We restrict the candidate to five bunsetsus, as we described earlie:', and we thcoretically ignored 1.3% of accuracy. Obviously, this restriction can be loosened by widelfing the range of the candidates. For example, 1.3% will be reduced to 0.5% if we take 6 instead of 5 candidates, hnplenlentation of this change is easy. However, the. problen: lies with the sparseness. As shown in Table 2, there are fewer examples with a large number of bmlsetsu candidates. For examI)le, there are only 4 instances which haw~ 14 candidates. It may be impossible to accumulate enough training exmnples of these kinds, even if we use a lot of untagged text fbr the supplementation. In such cases, we believe, we should have other kinds of remedies. One of them is a generalization of a large number candidates to a smaller number candidates by deleting nniniportant bunsetsus.</Paragraph>
      <Paragraph position="1"> This remains for fl:ture research.</Paragraph>
      <Paragraph position="2"> We can easily imagine that other systems may not analyze accurately the cases where the correct head is far fl'om the bunsetsu. For example, Uchimoto's system achieves an accuracy of 41% in the cases shown in italics in Table 2, which is much lower than 88%, the system's ow~rall accuracy. So, relative to the Uchimoto's system, the restriction caused a loss of accuracy of only 0.5% (1.3% x 41%) instead of 1.3%.</Paragraph>
    </Section>
    <Section position="2" start_page="3126" end_page="3126" type="sub_section">
      <SectionTitle>
6.2 Accuracy
</SectionTitle>
      <Paragraph position="0"> There are several things to do in order to achieve better accuracy. One of the major things is to use the information which has not been used, but is known to be useful to decide dependency relationships. Because the accuracy of the system against the training data is only 81.653% (without supplementation), it, is clear that we miss some important information, We believe the lexical relationships in verb frmne element preference (VFEP) is one of the most important types of information. Analyzing the data, we can find evidence that such information is crucial. For exmnplc, there are 236 exmnples in the training corpus where there are 4 head candidates and they are bunsetsus whose heads are noun, verb, noun and verb, and the current bunsetsu ends with a kaku-j oshi, a major particle. Out of the 236 exami)les, the nulnber of cases where the first, second, third and last candidate is the head are 60, 142, 3 and 31, respectively. The answer is widely spread, and as the current system takes the most fl'equent head as the answer, 94 eases out of 236 (40%) are ignored. This is due to the level of categorization which uses only POS information. Looking at the example sentences in this ease, we can observe that the VFEP could solve the problenl.</Paragraph>
      <Paragraph position="1"> It is not straightfbrward to add such lexieal information to the current franlework. If such information is incorporated into the state inibrmarion, the nmnber of states will become enormous. We can alternatively take an approach which was taken by augmented CFG. In this approach, the lexical infbrmation will be referred to only when needed. Such a process m~\y slow down the analyzer, but since the nuinber of invocation of the process needed in a :~entence may be proportional to the sentence length, we believe the entire process may still operate in tiine proportional to the sentence length.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML