File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1036_metho.xml
Size: 15,778 bytes
Last Modified: 2025-10-06 14:13:36
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1036"> <Title>SE(IMENTING A SENTENf,IC/ INTO MOItl)IIEM1,;S USING STNI'ISTIC INFOI{MATION BI,TFWEEN WORI)S</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 INTRODUCTION AND MOTIVATION </SectionTitle> <Paragraph position="0"> An English sentence has several words and those words are separated with a space, it is e~usy to divide an English sentence into words. I\[owever a a apalmse sentence needs parsing if you want to pick up the words in the sentence. This paper is on dividing non-separated language sentences into words(morphemes) without using any grammatical information. Instead, this system uses the statistic information between morphenws to select best ways of segmenting sentences in non-separated languages.</Paragraph> <Paragraph position="1"> Thinldng about segmenting a sentence into pieces, it is not very hard to divide a sentence using a certain dictionary for that. The problem is how to decide which 'segmentation' the t)est answer is. For examl)le , there must be several ways of segmenting a Japanese sentence written in lliragana(Jal)a,lese alphabet). Maybe a lot more than 'several'. So, to make the segmenting system useful, we have to cot> sider how to pick up the right segmented sentences from all the possible seems-like-scgrne, nted sentences, This system is to use statistical inforn,ation between morphemes to see how 'sentence-like'(how 'likely' to happen a.s a sentence) the se.gmented string is. To get the statistical association between words, mutual information(MI) comes to be one of the most interesting method. In this paper MI is used to calculate the relationship betwee.n words found ill the given sentence. A corpus of sentences is used to gain the MI.</Paragraph> <Paragraph position="2"> 'Fo implement this method, we iml)lemented a system MSS(Morphological Segmentation using Statistical information). What MSS does is to find the best way of segmenting a non-separated language, sentence into morphemes without depending on granamatieal information. We can apply this system to many languages. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="227" type="metho"> <SectionTitle> ~2 )/\[ORPHOLOGICAL ANALYSIS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 What; a Morphological Analysis Is </SectionTitle> <Paragraph position="0"> A morpheme is the smallest refit of a string of characters which has a certain linguistic l/leaning itself. It includes both content words and flmction words, in this l)aper the definition of a morl)heme is a string of characters which is looked u I) in tile dictionary.</Paragraph> <Paragraph position="1"> Morphoh)gical analysis is to: l) recognize the smallest units making up tile given sentellce if the sentence is of a l|on-separated hmguage, divide the sentence into morphenms (automatic segmentation), and 2) check the morlflmmes whether they are the right units to make up the sentence.</Paragraph> </Section> <Section position="2" start_page="0" end_page="227" type="sub_section"> <SectionTitle> 2.2 Segmenting Methods </SectionTitle> <Paragraph position="0"> We have some ways to segment a non-separated sentence into meaningflll morphemes. These three methods exl)lained below are the most popular ones to segment ,I apanese sentences.</Paragraph> <Paragraph position="1"> * The longest-sc'gment method: l~,ead the given sentence fi'om left to right and cut it with longest l)ossible segment. For exampie, if we get 'isheohl' first we look for segments wilich uses the/irst few lette,'s in it,'i' and 'is'. it is ol)vious that 'i';' is loIlger thall 'i', SO tile system takes 'is' as the segment. Then it tries the s;tllle method to find the segnlents in 'heold' and tinds 'he' and 'old'.</Paragraph> <Paragraph position="2"> The, least-bunsetsu segmenting m(',thod: Get all the possible segmentations of the input sentence and choose the segmentation(s) which has least buusetsu in it.. 'l'his method is to seg:ment Japanese sentence.s, which have content words anti function words together in one bunsetsu most of the time. This method helps not to cut a se, ntenee into too small meaningless pieces.</Paragraph> <Paragraph position="3"> Lettm'-tyl)e, segmenting method: In Japanese language we have three kinds of letters called Iliragana, Katakana and Kanji. This method divides a Japanese sentence into meaningful segments checking the type of letters.</Paragraph> </Section> <Section position="3" start_page="227" end_page="227" type="sub_section"> <SectionTitle> 2.3 The Necessity of Morphological Analysis </SectionTitle> <Paragraph position="0"> When we translate an English sentence into another language, the easiest way is to change the words in the sentence into the corresponded words in the target language. It is not a very hard job. All we have to do is to look up the words in the dictionary, flowever when it comes to a non-separated language, it is not as simple. An non-separated language does not show the segments included in a sentence. For example, a Japanese sentence does not have any space between words. A Japanese-speaking person can divide a Japanese sentence into words very easily, however, without arty knowledge in Japanese it is impossible. When we want a machine to translate an non-separated language into another language, first we need to segment the given sentence into words.</Paragraph> <Paragraph position="1"> Japanese is not the only language which needs the morphological segmentation. For example, Chinese and Korean are non-separated too. We can apply this MSS system to those languages too, with very simple preparation. We do not have to change the system, just prepare the corpus for the purpose.</Paragraph> </Section> <Section position="4" start_page="227" end_page="227" type="sub_section"> <SectionTitle> 2.4 Problems of Morphological Analysis </SectionTitle> <Paragraph position="0"> The biggest problems through the segmentation of an non-separated language sentence are the ambiguity and unknown words.</Paragraph> <Paragraph position="1"> For example, niwanihaniwatorigairu.</Paragraph> <Paragraph position="2"> ~: C/-C/2 N ~: w6 niwa niha niwatori ga iru A cock is in the yard.</Paragraph> <Paragraph position="3"> /E I,c t.~ -<NI PS'0 ~: v, 6 o niwa niha niwa tori ga iru Two birds are in the yard.</Paragraph> <Paragraph position="5"> niwa ni haniwa tori ga iru A clay-figure robber is in the yard. Those sentences are all made of same strings but the included morphemes are different. With dill>rent segments a sentence can have several meanings. Japanese h~ three types of letters: I\[iragana, Katakana and Kanji. lIiragana and Katakana are both phonetic symbols, and each Kanji letters has its own meanings. We can put several Kanji letters to one lliragana word. This makes morphological analysis of Japanese sentence very difficult. A Japanese sentence can have more than one morphological segmentation and it is not easy to figure out which one makes sense. Even two or nlore seglnentation can be 'correct' lbr one sentence. null To get the right segmentation of a sentence one may need not only morphological analysis but also semantic analysis or grammatical parsing. In this paper no grammatical information is used arid MI between morphemes becomes the key to solve this problem.</Paragraph> <Paragraph position="6"> rio deal with unknown words is a big problem in natural language processing(NLP) too. To recognize unknown segments in tim sentences, we have to discuss the likelihood of tim unknown segment being a linguistic word. In this pal)er unknown words are not acceptable as a 'morpheme'. We define that 'morpheme' is a string of characters which is registered in the dictionary.</Paragraph> </Section> </Section> <Section position="5" start_page="227" end_page="228" type="metho"> <SectionTitle> 3 CALCULATING TIlE SCORES OF SENTENCES </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="227" end_page="227" type="sub_section"> <SectionTitle> 3.1 Scores of Sentences </SectionTitle> <Paragraph position="0"> When the system searches the ways to divide a sentence into morphemes, more than one segmentation come out most of the time. What we want is one (or more) 'correct' segmeutation and we do not need any other possibilities. If there arc many ways of seg,nenting, we need to select the best one of them. For that purpose the system introduced the 'scores of sentences'. null</Paragraph> </Section> <Section position="2" start_page="227" end_page="228" type="sub_section"> <SectionTitle> 3.2 Mutual Information </SectionTitle> <Paragraph position="0"> A mutual information(MI)\[1\]\[2\]\[3\] is tile information of the ~ussociation of several things. When it comes to NLI', M I is used I.o see the relationship between two (or more) certain words.</Paragraph> <Paragraph position="1"> The expression below shows the definition of the</Paragraph> <Paragraph position="3"> together in a corpus Tiffs expression means that when wl and w.2 has a strong association between them, P(wt)P(w~) <<</Paragraph> <Paragraph position="5"/> </Section> <Section position="3" start_page="228" end_page="228" type="sub_section"> <SectionTitle> 3.3 Calculating the Score of a Sentence </SectionTitle> <Paragraph position="0"> Using the words in the given dictionary, it is easy to make up a 'sentence'. llowever, it is hard to consider whether the 'sentence' is a correct one or not.</Paragraph> <Paragraph position="1"> The meaning of 'correct sentence' is a sentence which makes sense. For example, 'I am Tom.' can make sense, however, 'Green the adzabak arc the a ran four.' is hardly took ms a meaningful sentence. 'Fhe score is to show how 'sentence-like' the given string of morphemes is. Segmenting ~t non-sel)arated language sentence, we often get a lot of meaningless strings of morphemes. To pick up secms-likc-mea,fingfid strings from the segmentations, we use MI.</Paragraph> <Paragraph position="2"> Actually what we use in tim calculation is not l, he real MI described in section 3.2. The MI expression in section 3.2 introduced the bigrams. A bigram is a possibility of having two certain words together in a corpus, as you see in the expression(l). Instead of the bigram we use a new method named d-bigram here in this paper\[3\].</Paragraph> <Paragraph position="3"> The idea of bigrams and trigraiT~s are often used in the studies on NLP. A bigram is the information of the association between two certain words and a tri-gram is the information among three. We use a new idea named d-bigram in this paper\[3\]. A d-bigram is the possibility that two words wt and w2 come out together at a distance of d words in a corpus. For example, if we get 'he is Tom' as input sentence, we have three d-bigram data:</Paragraph> <Paragraph position="5"> ('he' 'is' 1) means the information of the association of the two words 'tie' and 'is' appear at the distance of 1 word in the corpus.</Paragraph> </Section> <Section position="4" start_page="228" end_page="228" type="sub_section"> <SectionTitle> 3.4 Calculation </SectionTitle> <Paragraph position="0"> The expression to calculate the scores between two words is\[3\]:</Paragraph> <Paragraph position="2"> d : distance of the two words Wl and w2 P(wi) : the possibility the wm'd wl appears in the coq)us P(wl,w2,d) : the possibility wl and w2 eoll'le out d words away fl'om each other in the corpus As the value of Mid gets bigger, the more those words have the ,association. And the score of a sentence is calculated with these Mid data(expression(2)). The definition of the sentence score is\[l\]:</Paragraph> <Paragraph position="4"> This expression(3) calculates the scores with the algoritlmt below: 1) Calculate Mld of every pair of words included in the given sentence.</Paragraph> <Paragraph position="5"> 2) Give a certain weight accordiug to the distance, d to all those Mid.</Paragraph> <Paragraph position="6"> 3) Sum up those 3~7~. The sum is the score of the sentence.</Paragraph> <Paragraph position="7"> Church and llanks said in their pN)er\[1\] that the information between l.wo remote wo,'ds h~s less meaning in a sentence when it comes to the semantic analysis. According to the idea we l)ut d 2 in the expression so that nearer pair can be more effective in calculating the score of the sentence.</Paragraph> </Section> </Section> <Section position="6" start_page="228" end_page="229" type="metho"> <SectionTitle> 4 Tns SYSTSM MSS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="228" end_page="229" type="sub_section"> <SectionTitle> 4.1 Overview </SectionTitle> <Paragraph position="0"> M,qS takes a lliragana sentence as its input. First, M,qS picks Ul) the morphemes found ill the giwm sentence with checking the dictionary. The system reads the sentence from left to rigltt, cutting out every possibility. Each segment of the sentence is looked up in the dictionary and if it is found in the dictionary the system recognize the segnlent as a morpheme. Those morphemes are replaced by its corresponded Kanji(or lliragana, Katakana or mixed) morpheme(s). As it is tohl in section 2.4, a lliragana morpheme can have several corresponded l(anji (or other lettered) morphemes. In that case all the segments corresponded to the found l|iragana morpheme, are memorized as morl)hemes found in the sentence,. All the found morphemes are nunfl)ered by its position in the sentence.</Paragraph> <Paragraph position="1"> After picking Illl all the n,orphenu.'s in I.he sentence the system tries to put them together mtd brings them up back to sentence(tat)h~ I).</Paragraph> <Paragraph position="2"> \[nl)ut a lliragana sentence.</Paragraph> <Paragraph position="3"> Cut out t, he morphemes. lI Make up sentences with the morphemes.</Paragraph> <Paragraph position="4"> tI Calculate the score of sentences using the mutual information.</Paragraph> <Paragraph position="5"> g Compare. the scores of all the. made-up sentences and get the best-marked one as the most 'sentence-like' sentence.</Paragraph> <Paragraph position="6"> Then the system compares those sentences made up with found morl)he.mes and sees which one is the</Paragraph> <Paragraph position="8"> most 'sentence-like'. For that purpose this system calculate the score of likelihood of each sentences(section 3.4).</Paragraph> </Section> <Section position="2" start_page="229" end_page="229" type="sub_section"> <SectionTitle> 4.2 The Corpus </SectionTitle> <Paragraph position="0"> A corpus is a set of sentences, These sentences are of target language. For example, when we apply this system to Japanese morphological analysis we need a corpus of Japanese sentences which are already segmented. null The corpus prepared for the paper is the translation of English textbooks for Japanese junior high school students. The reason why we selected junior high school textbooks is that the sentences in the text-books are simple and do not include too many words. This is a good environment for evaluating this system.</Paragraph> </Section> <Section position="3" start_page="229" end_page="229" type="sub_section"> <SectionTitle> 4.3 The Dictionary </SectionTitle> <Paragraph position="0"> The dictionary for MSS is made of two part. One is the heading words and the other is the morphemes corresponded to the headings. There may be more than one morphemes attached to one heading word.</Paragraph> <Paragraph position="1"> The second part which has morphemes is of type list, so that it can have several morphemes.</Paragraph> </Section> </Section> class="xml-element"></Paper>