File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1082_metho.xml
Size: 12,403 bytes
Last Modified: 2025-10-06 14:07:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1082"> <Title>Bunsetsu Identification Using Category-Exclusive Rules</Title> <Section position="3" start_page="565" end_page="568" type="metho"> <SectionTitle> a.2 Maximum-entropy method </SectionTitle> <Paragraph position="0"> The maximum-entropy method is useful with sparse data conditions and has been used by many researchers (Berger et al., 1996; Ratnaparkhi, 1996; Ratnaparkhi, 1997; Borthwick el; al., 1998; Uchimoto et al., 1999). In our maximuln-entropy experiment we used Ristad's system (Ristad, 1998). The analysis is performed by calculating the probability of inserting or not inserting a partition mark, from the output of the system. Whichever probability is higher is selected as the desired answer.</Paragraph> <Paragraph position="1"> In the maximum-entropy method, we use the same four types of morI)hological information, (i) major POS, (ii) minor POS, (iii) semantic information, and (iv) word, as in the decision-tree method. However, it, does not consider a combination of features. Unlike the decision-tree method, as a result, we had to combine features mmmally.</Paragraph> <Paragraph position="2"> First we considered a combination of the bits of each morphological information. Because there were four types of information, the total number of combinations was 2 ~- 1. Since this number is large and intractable, we considered that (i) major POS, (ii) minor POS, (iii) semantic information, aim (iv) word information gradually becolne inore specific in this order, and we coml)ined the four types of information in the following way: Information A: (i) major POS Intbrmation B: (i) major POS and (ii) minor POS hfformat, ion C: (i) major POS, (ii) minor POS and (iii) semantic information Information D: (i) major POS, (ii) minor POS, (iii) semantic informa~aion and (iv) word (~) We used only Information A and B for the two outside morphemes because we (lid not use semantic and word information in the same way it is used in the decision-tree inethod.</Paragraph> <Paragraph position="3"> Next, we considered the combinations of each type of information. As shown in Figure 4, the number of combinations was 64 (2 x 4 x 4 x 2).</Paragraph> <Paragraph position="4"> For data sparseness, in addition to the above combinations, we considered the cases in which frst, one of the two outside morphemes was not used, secondly, neither of the two outside ones were used, m~d thirdly, only one of the two middle ones is used. The nmnber of features used in the maximum-entropy method is 152, which is obtained as follows: a 3When we extr~,cted features from all of the articles on No. of t>atures= 2 x 4 x 4 x 2</Paragraph> <Paragraph position="6"> In Figure 2, (;lie feature that uses Infornultion B in the far left morl)heme, Infbrnmtion D in the left mort)heine, Information C in the right morpheme, and Information A in the fa.r right mop l/heme is &quot;Noun: Nornml Noun; Particle: Case-Particle: none: wo; Verl): Nornml Form: 217; Symbol&quot;. In tim maximmn-entrol)y method we used for each space 152 ligatures such as this ()tie.</Paragraph> <Section position="1" start_page="566" end_page="567" type="sub_section"> <SectionTitle> 3.3 Example-based method (use of </SectionTitle> <Paragraph position="0"> similarity) An example-based method was t)rollosed t) 3, Nagao (Nagao, 1984) in an attempt to solve I)roblenls in machine translation. To resolve a. l)rol)h'm, it; uses the most similar (;xami)le. In the i)resent work, the examt)le-1)ased method imt)artially used the same four types of information (see Eli. (1)) as used in the maxinmm-entrotly method, To use tills method, we must define the similarity of an ini)ut to an example. We use the 152 1)atterns fl'om the maximum-entropy method to establish the level of similarity. We define the similarity S between all input and an exmnl)le according to which one of these 152 levels is the lnatching level, as follows. (The equation reflects the importance of the two middle morphemes.) January 1, 1995 of a Kyoto University corpus (l;hc mnnber of spaces between nmrl)henms was 25,81d) by using this method, the nunfl)e,r of types of features was 1,534,701.</Paragraph> <Paragraph position="2"> Here m-l, m4-\], m-2, and m+2 refer respectively to the left;, rigid;, far M't, ;rod far righl; mortflmnms, and s(x) is the mort)hological similarity of a lllOl't)hellle x, which is defined as follows: s(x) =1 (when no information of x is matched) Figure 5 shows an exmnple of the levels of similarity. When a pattern matches Information A of all four lnort)henies , such as &quot;Noun; Particle; Verb; Symbol&quot;, its similarity is 40,004 (2 x 2 x 10,000 + 2 x 2). When a pattern matches a pattern, such as &quot; ; Particle: Case-Particle: none: wo; ; &quot;, its similarity is 50,001 (5 x 1 x 10,000 + 1 x 1).</Paragraph> <Paragraph position="3"> The exmnl)le-1)ased method extracts the example with the highest level of similm'ity and checks whether or not that exami)le is marked. A partition marl{ is inserted in tile input data only when the ex~ amt)le iv marked. When multit)le exalnl)les have the same highest level of similarity, the selection of tile best example is ambiguous, hi this case, we count tile number of nlarked and mlinarked sl)aces in all of the examples and choose the larger.</Paragraph> <Paragraph position="4"> a.4 Decision-list method (use of probability and fl'equeney) Tile decision-list method was proposed by Rivest (Rivest, 1987), in which tile rules are not expressed as a tree structure like in the decision-tree method, but are expanded by combining all the features, and are stored in a one-dimensional list. A priority order is defined in a certain way and all of the rules are arranged in this order. The decision-list method searches for rules Dora tile top of the list and analyzes a particular problem by using only the first applicable rule.</Paragraph> <Paragraph position="5"> In this study we used ill the decision-list method the same 152 types of patterns that were used in/;lie ma.ximuln-entropy method.</Paragraph> <Paragraph position="6"> To determine the priority order of the rules, we referred to Yarowsky's method (Yarowsky, 1994) and Nishiokwama's method (Nishiokaymna et al., 1998) and used the probability a.nd frequency of each rule as measures of this priority order. When nnlltiple rifles had the same probability, the rules were arranged in order of their frequency.</Paragraph> <Paragraph position="7"> Suppose, for example, that Pattern A &quot;Noun: Normal Noun; Particle: Case-Particle: none: wo; Verb: Normal Form: 217; Symhol: Punctuatioif' occurs 13 times in a learlfing set and that tell of the occurrences include the inserted partition Inal:k. Suppose also thai; Pattern B &quot;Noun; Particle; Verb; Symbol&quot; occurs 12a times in a learning set and that 90 of the occurrences include the mark.</Paragraph> <Paragraph position="8"> This exmnple is recognized by the following rules: in order of their probabilities and, for any one probability, in order of their frequencies. This list was searched from tile top ml(:l the answer was obtained by using the first, ai)plicable rule.</Paragraph> </Section> <Section position="2" start_page="567" end_page="568" type="sub_section"> <SectionTitle> 3.5 Method I (use of eategory-exelusive </SectionTitle> <Paragraph position="0"> rules) So far, we have described the four existing machine learning methods. In the next two sections we describe our methods.</Paragraph> <Paragraph position="1"> It is reasoimble to consider tile 152 patterns used in three of the previous methods. Now, let us suppose that the 152 patterns fl'om the learning set yield the statistics of Figure 6.</Paragraph> <Paragraph position="2"> &quot;Partition&quot; means that the rule determines that a partition mark should be inserted in the input data and &quot;non-t)arl:ition&quot; ineans that tile rule determines that a partition mark should not be inserted.</Paragraph> <Paragraph position="3"> Suppose that when we solve a hypothetical problem Patterns A to G are apt)licable. If we use the decision-list inethod, only Rule A is used, which is applied first, and this determines that a partition mark should not be inserted. For Rules B, C, and D, although the fl'equency of each rule is lower thml that of Rule A, tile suln of their frequencies of the rules is higher, so we think that it is better to use Rules B, C, ml(t D than Rule A. Method 1 follows this idea, but we do not simply sum up tile frequencies. Instead, we count the munber of exalnples used ill Rules B~ C, and D and judge the category having tile largest number of exmnplcs that satisfy the pattern with the highest probability to be the desired ai1swer.</Paragraph> <Paragraph position="4"> For exmnple, suppose that in the above examt)le the number of examples satis(ying Rules B, C, and D is 65. (Because some exmnples overlq) in multipie rules, the total nunfl)er of exalnples is actually smaller than the total number of tile frequencies of the three rules.) In this case, among the examples used by the rules having 100% probability, tile nmnber of examples of partition is 65, m~d the number of examt)les of non-t)artitioi~ is 34. So, we deternline that tile desired answer is to partition.</Paragraph> <Paragraph position="5"> A rule having 100% probability is called a category-exclusive rule because all the data satist~ying it belong to one category, which is either partition or noi>partition. Because for any given space the number of rules used call be as large as 152, category-exclusive rules are applie(t often ~. Method 1 uses all of these category-exclusive rules, so we call it tile method using category-exclusive rules.</Paragraph> <Paragraph position="6"> Solving problems by using rules whose prol)abilil;ies are nol; 100% may result ill the wrong solutions. Almost all of the traditional machine learning methods solve problelns by usiug rules whose i)robabilities 4'l'he ratio of the spaces analyzed by using category-exclusive rules is 99.30% (16864/16983) in Experiinent 1 of Section d. This indicates that ahnost all of the spaces are analyzed by category-exclusive rules.</Paragraph> <Paragraph position="8"> Fl'o, qucncy \] 3 lq'equc, ncy 540 are not J(}0%. By using such methods, we cannot hol)e to improve, a,:curacy. If we want to improve ac(;llra(;y~ we nlllst use catt;gory-excltl,qive l'lllCS. Thel'e are some eases, however, tbr which, even if we take this at)l)r()ach, eategory-exchlsive rules aye rarely al> plied. In such cases, we lilllSl; add new feai;ures t() I;he mlalysis to create a situation in which many ci~tegory-exehlsive rules Call \])(; appli(~d.</Paragraph> <Paragraph position="9"> I\]owever, il; is not suflieient to use t;~/tt~oryexclusive rules. There arc, many nmaningless rules which \]la,1)l)ell to 1)e c~tl;egor3~-ex(;lll,qive ollly ill ~t learning set. We lllllSt consider how to (~Iimim/te such meaningh;ss rule,q.</Paragraph> </Section> <Section position="3" start_page="568" end_page="568" type="sub_section"> <SectionTitle> 3.6 Method 2 (rising category-exclusive </SectionTitle> <Paragraph position="0"> mined by using the rule. having the high(>t probability. When mull;iple rules have the same 1)rolmbility, M(;thod 2 uses the wdue of the similarity described in the section of the examl)h>based m(;thod and }lllalyzes the 1)robk:m with the rule having the highest simihu'ity. When multiple rules have th(; stone probnbilil;y and similm'ity, the method takes the exampies used by the rules having the highest probability and the higlmst si,nilarity, and chooses the (:ategory with the larger llllllI))CF Of exami)les as t:hc desired answer~ in the same way as in Method 1.</Paragraph> <Paragraph position="1"> Itowever, when (:ategory-c.xchlsive rules having llltiro thall Olte fre(lllellCy exist, the a})ove t)roccdllr(; is performed after eliminating all of the category-exclusive rules having one frequency, in el;her words, category-exclusive rules having more than one fl'equency are giwm a higher priority than category-exclusive rules having only one. flTo.qll(;nc.y })lit having ~ high(w similarity. This is 1)c(:ause eategory(;xclusivc rules having only one fl'equen(:y are not so reliabh',.</Paragraph> </Section> </Section> class="xml-element"></Paper>