File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1082_intro.xml
Size: 4,326 bytes
Last Modified: 2025-10-06 14:00:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1082"> <Title>Bunsetsu Identification Using Category-Exclusive Rules</Title> <Section position="2" start_page="0" end_page="565" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper is about machine learning methods for identifying bwnsr'ts'~zs, which correspond to English phrasal units such as noun phrases and t)rel)ositional phrases. Since .Japanes(.' syntactic analysis ix usually done after lmnselisu identitication (Uchimot;o el; a\].. 1999), i(lentitlying lmnsetsu is important l'or analyzing ,J;~p~tnese seltt(}tl(:es. The conventional studies on lmnsetsu identitieation ~ have used hand-made rules (Kameda, \]995; Kurohashi, 3998), })ill; bunsel;su identification is not an easy task. Conventional studies used many hand-nmde rules develot)ed at the cost of many man-hours. Kurohashi, tbr examl)le, made 146 rules for lmnsetsu identification (Kurohashi, 1998).</Paragraph> <Paragraph position="1"> Itl }l.tl a.ttetllpt to reduce the mnnber of manhours, we used machine-learning methods for bunsetsu identitication. Because it; was not clear which machine-learning method would 1)e the one most al)propriate for bunsctsu identification, so we tried a variety of them. In this paper we rei)orl; ext)erinlellts comparing tbur inachine-learning me, thods (decision tree, maximmn entropy, example-based, and decision list; methods) and our new methods using category-exclusive rules.</Paragraph> <Paragraph position="2"> l lhmsetsu ideni,illcation is a ln'oblem similar to ohm,king (lLamshaw and Marcus, 1995; Sang and \h;ellsl;ra, 1999) in other l;mguages.</Paragraph> <Paragraph position="3"> 2 Bunsetsu identification problem We conducted experiments (m the following supervised learning methods tbr idel~tiflying }mnsetsu: tagging ill English. Japanese syntactic structures are usually ref)resented by the. relations between lmnsetsus, which correspond to l)hrasal units such as a noml l)hrase or a t)repositional 1)hrase in \]r, nglish. St), 1)unsetsu identification is imi)ortant in .lnpanese sentence mmlysis.</Paragraph> <Paragraph position="4"> In this paper, we identit\[y a bunsetsu by using intbrmation Dora a morl~hological analysis. Bunsetsu identitication is treated as the task of deciding whether to insert a &quot;\[&quot; mark to indicate the partition between two hunsetsus as in Figure 1. There, fore, bunsetsu identilical;ion is done by judging whether a partition mark should be inserted between two adjacent nlorphemes or not. (We. do not use l;he inserted partition mark in the tbllowing analysis ill this paper for the sake of simplicity.) Our lmnsetsu identification method uses i;t1(} lilOrphok)gk:al intbrmation of the two preceding and two succeeding morphemes of an analyzed space bel;ween two adjacent morphemes. We use the following mor- null phological information: (i) Major part-of speech (POS) category, 2 (ii) Minor P()S category or intlection tYl)e, (iii) Semantic information (the first three-digit nun> bet of a category nmnlmr as used ill &quot;BGIt&quot; (NLI{,I, 1964:)), 2Part-of-spec.ch c~ttegories follow those of 3 \[/MAN (Kurohashi and N~tgao, 1998).</Paragraph> <Paragraph position="5"> For simplicity we do not use the &quot;Semmltic informatioif' and &quot;Word&quot; in either of the two outside morphemes.</Paragraph> <Paragraph position="6"> Figure 2 shows the information used to judge whether or not to insert a partition mark in the space between two adjacent morphemes, &quot;wo (obj)&quot; and &quot;kugiru (divide),&quot; in the sentence &quot;bun wo kugiru. In this work we used the program C4.5 (Quinlan, 1.995) for the decision-tree learning method. The four types of information, (i) major POS, (ii) minor POS, (iii) semmltic information, and (iv) word, mentioned in the previous section were also used as features with the decision-tree learning method. As shown in Figure 3, the number of features is 12 (2 + 4 + 4 + 2) because we do not use (iii) semantic information and (iv) word information from the two outside morphemes.</Paragraph> <Paragraph position="7"> In Figure 2, for example, the value of the feature 'the major POS of the far left morpheme' is 'Noun.'</Paragraph> </Section> class="xml-element"></Paper>