File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1019_metho.xml
Size: 6,346 bytes
Last Modified: 2025-10-06 14:14:52
<?xml version="1.0" standalone="yes"?> <Paper uid="M98-1019"> <Title>NYU: DESCRIPTION OF THE JAPANESE NE SYSTEM USED FOR MET-2</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> ALGORITHM </SectionTitle> <Paragraph position="0"> In this section, the algorithm of the system will be presented. There are twophases, one for creatingthe decision tree from trainingdata #28trainingphase#29 andtheother for generatingthetagged output based on thedecision tree #28runningphase#29. We use a Japanese morphological analyzer, JUMAN #5B6#5Dand a program package for decision trees, C4.5 #5B7#5D. Weusethree kinds of feature setsinthedecision tree: . Organization namehas twotypes of dictionary; one for proper names andtheother for general nouns whichshould be tagged when they co-occur with proper names. Also, wehave a special dictionarywhichcontains wordswritten inRomanalphabetbut mostlikelythese are notanorganization #28e.g. TEL, FAX#29. Wemade a list of 93 suchwords.</Paragraph> <Paragraph position="1"> Creatingthe special dictionaries is not very easy,butitisnotvery laborious work. The initial dictionary was builtinaboutaweek. In the course of the system development, in particular during creatingthe training corpus, we added someentities tothe dictionaries.</Paragraph> <Paragraph position="2"> Thedecision tree gives an output for eachtoken. It is oneofthe 4 possible combinations of opening, continuation and closinganamed entity,andhavingnonamed entity,shown in Table 2. In this paper, we Output beginning ending</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> OP-CL openingofNE closingofNE </SectionTitle> <Paragraph position="0"> OP-CN openingofNE cont. of NE CN-CN cont. of NE cont. of NE CN-CL cont. of NE closingofNE none none none will use two di#0Berentsetsofterms in order toavoid the confusion between positions relativetoatoken and regions of named entities. Theterms beginning and ending are used toindicate positions, whereas opening and closing are used toindicatethestart andendofnamed entities. Notethatthere is no overlappingor embeddingofnamed entities. An example of real dataisshown in Figure 1. TrainingPhase First, the trainingsentences are segmented and part-of-speechtagged by JUMAN. Then eachtoken is analyzed byitscharacter type andismatched against entries in the special dictionaries. Onetoken can matchentries in several dictionaries. For example, #5CMatsushita&quot; could matchthe organization, person and location dictionaries.</Paragraph> <Paragraph position="1"> Usingthe trainingdata, a decision tree is built. It learns abouttheopeningand closingofnamed entities based on thethree kinds of informationofthe previous, currentand followingtokens. Thethree types of information are the part-of-speech, character type and special dictionary information described above. If we just use thedeterministic decision created bythe tree, it could cause a problem in therunningphase. Because thedecisions are made locally,the system could make an inconsistent sequence of decisions overall. For example, onetoken could be tagged as theopening of an organization, while thenext token mightbe tagged as the closing of person name. We can think of several strategies to solvethis problem #28for example, themethod by #5B2#5D will be described in a later section#29, butwe used a probabilistic method. The instances in the training corpus correspondingto a leaf of thedecision tree may not all havethe sametag. At a leaf we don't just record the most probable tag; rather, wekeep the probabilitiesofthe all possible tags for that leaf. In this way we can salvage cases where a tag is part of the most probable globally-consistenttaggingofthetext, even though it is not the most probable tag for this token, andso would be discarded if wemadeadeterministic decision at eachtoken.</Paragraph> <Paragraph position="2"> subsectionRunningPhase In therunningphase, the #0Crst three steps, token segmentation and part-of-speechtaggingby JUMAN, analysis of character type, and special dictionary look-up, are identical tothatinthe trainingphase. Then, in order to#0Cndthe probabilities of openingand closinganamed entity for eachtoken, the properties of the previous, currentand followingtokens are examined against thedecision tree. Figure 2 shows two example paths in thedecision tree. For eachtoken, the probabilities of `none'andthe four combinationsofanswer pairs for eachnamed entitytype are assigned. For instance, if wehave7named entitytypes, then 29 probabilities are generated.</Paragraph> <Paragraph position="3"> Once the probabilities for all thetokens in a sentence are assigned, the remainingtask is to discover the most probable consistentpaththrough thesentence. Here, a consistentpathmeans that for example, a path can't have org-OP-CN and date-OP-CL in a row, but can have loc-OP-CN and loc-CN-CL.Theoutputis generated from the consistent sequence withthe highest probability for eachsentence. The Viterbi algorithm</Paragraph> <Paragraph position="5"> Figure 1 shows an example sentence along withthree types of information,part-of-speech, character type and special dictionary information, and given information of openingand closingofnamed entities. Figure 2shows two example paths in thedecision tree. For the purpose of demonstration, we used the #0Crst and secondtoken of the example sentence in Figure 1. Each line corresponds to a question asked bythe tree nodes alongthepath. The last lineshows the probabilities of named entity information whichhave more than 0.0 probability. This instance demonstrates howthe probabilitymethod works. As we can see, the probability of none for the #0Crst token #28Isuraeru = Israel#29 is higher than that for theopening of organization #280.67 to 0.33#29, butinthe secondtoken #28Keisatsu =Police#29, the probability of closing organization is much higher than none #280.86 to 0.14#29. The combined probabilities of thetwo consistentpaths are calculated. One of these paths makes thetwotokens an organization entity while alongtheother path, neither token is part of a named entity.The probabilities are higher in the #0Crst case #280.28#29 than thatinthelatter case #280.09#29, So thetwotokens are tagged as an organization entity.</Paragraph> </Section> class="xml-element"></Paper>