File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/c94-1036_evalu.xml

Size: 10,886 bytes

Last Modified: 2025-10-06 14:00:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1036">
  <Title>SE(IMENTING A SENTENf,IC/ INTO MOItl)IIEM1,;S USING STNI'ISTIC INFOI{MATION BI,TFWEEN WORI)S</Title>
  <Section position="7" start_page="229" end_page="230" type="evalu">
    <SectionTitle>
5 RESULTS
</SectionTitle>
    <Paragraph position="0"> Implement MSS to all input sentences and get the score of each segmentation. After getting the list of segmentations, look for the 'correct' segmentedsentence and see where in the list tile right one is. The data shows the scores the 'correct' segmentations  the very sentences in tile corpus replaced one rnorllheme in the sentence (the buried morpheme is in the corpus) replaced one morpheme in the sentence (tile buried morpbeme is not in the corpus) sentences not in the corpus (the morphemes are all in tim corpus) sentences not in the corpus (include morphemes not; in the corpus)</Paragraph>
    <Section position="1" start_page="229" end_page="230" type="sub_section">
      <SectionTitle>
5.1 Ext)eriment in Japanese
</SectionTitle>
      <Paragraph position="0"> According to the experimental results(table 2), it is obvious that MSS is w.'ry useful. The table 2 shows that most of the sentences, no matter whether the sentences are in the. corpus or not, are segmented correctly. We find the right segmentation getting the best score in the list of possible segmentations, c~ is tile data when the input sentences are in corpus.</Paragraph>
      <Paragraph position="1"> That is, all the 'correct' morphemes have association between each other. That have a strong effect in calculating the sco,'es of sentences. The condition is almost same for fl and 7. Though the sentence has one word replaced, all other words in the sentence have relationship between them. Tim sentences in 7 inelude one word which is not in the corpus, but still tile 'correct' sentence can get the best score among the possibilities. We can say that the data c~, fl and 7 are very successfld.</Paragraph>
      <Paragraph position="2">  llowever, we shouhl remember that not all the sentences in the given corpus wouht get the best score through the list. MSS does trot cheek the corpus itself when it calculate the score, it just use the Mid, the essential information of the corpus. That is, whether the input sentence is written in the corpus or not does not make any effect in calculating scores directly. Ilowever, since MSS uses Mid to calculate the. scores, the fact that every two morphemes in the sentence have connection between them raises the score higher.</Paragraph>
      <Paragraph position="3"> When it comes to the sentences which are not in corpus themselves, the ratio that the 'correct' sentence get the best score gets down (see table 2, data ~, e).</Paragraph>
      <Paragraph position="4"> The sentences of 6 and g are not found in the corpus. Even some sentences which are of spoken language and not grammatically correct are included in the input sentences. It can be said that those ~ and e sentences arc nearer to the real worhl of Japanese language. For ti sentences we used only morphemes which are in the corpus. That means that all tim morphenres used in the 5 sentences have their own MI,I. And e sentences have both morphemes it( the corpus and the ones not in the corpus. The morphemes which arc not in the corpus do not have any Ml(l. Table 2 shows that MSS gets quite good result eve(, though the input sentences arc not in the corpus. MSS do not take the necessary information directly from the co&gt; pus and it uses the MIa instead. This method makes the information generalize.d and this is the reason why 5 and e can get good results too. Mid comes to }&gt;e the key to use the effect of the MI between morphemes indirectly so that wc can put the information of the mssoeiation between morphemes to practical use. This is what we expected and MSS works successfldly at this point.</Paragraph>
    </Section>
    <Section position="2" start_page="230" end_page="230" type="sub_section">
      <SectionTitle>
5.2 The Corpus
In this paper we used the translation of English text:
</SectionTitle>
      <Paragraph position="0"> books for Japanese junior high school students. Primary textbooks are kiud of a closed worhl which have limited words in it an&lt;l the included sentences are mostly in some lixed styles, in good graummr. The corpus we used in this pal)er has about 630 sentences which have three types of Japanese letters all mixed.</Paragraph>
      <Paragraph position="1"> This corpus is too small to take ms a model of the ,'eal world, however, for this pal&gt;e( it is big enough. Actually, the results of this paper shows that this system works efficiently even though the corpus is small.</Paragraph>
      <Paragraph position="2"> The dictionary an&lt;l the statistical information are got from the given corpus. So, the experimental re= suit totally depends on the corpus. That is, selecting which corpus to take to implement, we can use I.his system ill many purposes(section 5.5).</Paragraph>
    </Section>
    <Section position="3" start_page="230" end_page="230" type="sub_section">
      <SectionTitle>
5.3 Comparison with the Other
Methods
</SectionTitle>
      <Paragraph position="0"> It is not easy to compare this system with other seg,nenting methods. We coral)are with tile least-bunsetsu method here ill this paper.</Paragraph>
      <Paragraph position="1"> The least-bunselsv method segment the given sentences into morphemes and fin(l the segmentations with least bunselsu. This method makes all the segmentation first an(l selects the seems-like-best segmentations. This is the same way MSS does. The difference is that the least-bdnsetsv method checkes the nmnber of tile bumselsu instead of calculating the scores of sen(elites.</Paragraph>
      <Paragraph position="2"> Let us think about implementing a sentence the morl)hcmes are l,ot in the dictionary. That means that the morphemes do not have any statistical informations between them. In this situation MSS can not use statistical informations to get the scores. Of course MSS caliculate the scores of sentences accord: ing to tile statistical informations between given morphemes, llowe.ver, all the Ml,l say that they have no association I)etween t\]le (~lorpherlles. When there is no possibility that the two morl&gt;hemes appears together ill the corpus, we give a minus score ~s tit('. Ml,t wdue, so, as the result, with more morphemes the score of the+ sentence gets lower. That is, tire segmentation which has less segments ill it gets better scores. Now compare it with the least-bunsetsu method. With using MSS the h.'ast-morpheme segme.ntations are selected as the goo(I answer, q'hat is tile same way the least-bunsetsu method selects the best one. '\['his means that MSS and the least-bttnscts.le method have the same efficiency when it comes to the sentences which morl(hemes are not in the corpus. It is obvious that when the sentence has morphemes in the corpus the ellicie.ncy of this systern gets umch higher(table 2).</Paragraph>
      <Paragraph position="3"> Now it is proved that MSS is, at least, as etli: cicnt as the least-b'unsets'~ nmthod, no matter what sentence it takes. We show a data which describes I.his(tabh~ 3).</Paragraph>
      <Paragraph position="4"> &amp;quot;Fable 3 is a good exanq)le of the c;use whelL the. input sentence has few morphemes which are in the corl)uS. This dal.a shows that in I.his situal.ion I.here is an outstanding relation between the number of morl)hemes and the scores of the segmented se.ntenees.</Paragraph>
      <Paragraph position="5"> This example(table 3) has an ambiguity how to segment the sentence using the registere(l morphemes, and all the morphemes which causes the alnbiguity are not in the given (:orpus. Those umrl)hemes not in the corpus do not have any statistical information betweel, them and we have no way to select which is bett&lt;.'r. So, the scores of sentences are Ul) to the length of the s&lt;~gmented sentence, that is, the number how many morl)hemes the sentence has. '\['he segmented sentence which has least segments gets the best score, since MSS gives a minus score for unknown mssociation between morphemes. That means that with more segments in the sentence the score gets lower. This sit-</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="230" end_page="230" type="evalu">
    <SectionTitle>
ZT/
</SectionTitle>
    <Paragraph position="0"> input : a non-segmented Japanese tliragana sentence not in the corpus all unknown morphemes in the sentence are registered in the (lictionary (some morphemes in the corpus are included) &amp;quot; sumomo mo nlonlo hie memo no ilCh\]  in the corpus : &amp;quot; no .... Ill(lllO &amp;quot; morphemes not included : &amp;quot; IAI .... ~4!. ~ &amp;quot; in the corpus : &amp;quot; uchi .... sunm &amp;quot; &amp;quot; sumomo &amp;quot; *' hie j~ &amp;quot; ~t&amp;quot; 'P nlOUlO ~p uation is resemble to the way how the least-bunseisu method selects the answer.</Paragraph>
    <Section position="1" start_page="230" end_page="230" type="sub_section">
      <SectionTitle>
5.4 Experiment in Chinese
</SectionTitle>
      <Paragraph position="0"> The theme of tiffs paper is to segment non-separaLe(\] language sentences into morphemes. In this paper we described on segmentation of Japanese non-segmented sentences only but we are working on Chinese sentences too. This MSS is not for Japanese only. It can be used for other non-separated languages too. &amp;quot;lb implement for other languages, we just need to prepare the corpus for that and make up the dictionary from it.</Paragraph>
      <Paragraph position="1"> llere is the example of implementing MSS for Chinese language(table 4). The input is a string of characters which shows the pronounciations of a Chinese sentence. MSS changes it into Chinese character senteces, segmenting the given string.</Paragraph>
    </Section>
    <Section position="2" start_page="230" end_page="230" type="sub_section">
      <SectionTitle>
5.5 Changing the Corpus
</SectionTitle>
      <Paragraph position="0"> To implement tiffs MSS system, we only need a eel pus. The dictionary is made from the corpus. This Tal)le 4: Experiment in Chinese input : nashiyizhangditu.</Paragraph>
      <Paragraph position="1"> correct answer output sentences scores -~ )J\[~ ~: --,~ ;t~. 15.04735 )Jl~ ~! --~ .tt~\]. -14.80836 )JI~ {0~ --'\]~ ~1~. -14.80836 gives MSS system a lot of usages and posibilities. Most of the NLP systems need grammatical i,ffof malleus, and it is very hard to make up a certain grammatical rule to use in a NLP. The corpus MSS needs to implement is very easy to get. As it is described in the previous section, a corpus is a set of real sentence.s. We can use IVISS in other languages or in other purposes just getting a certain corpus for that and making up a dictionary from the corpus. That is, MSS is available in many lmrposes with very simple, easy preparation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML