File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/c96-2099_evalu.xml
Size: 7,577 bytes
Last Modified: 2025-10-06 14:00:20
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2099"> <Title>Segmenting Sentences into Linky Strings Using D-bigram Statistics</Title> <Section position="6" start_page="588" end_page="689" type="evalu"> <SectionTitle> 5 RESULTS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="588" end_page="588" type="sub_section"> <SectionTitle> 5.1 Experimental Results Experiment Condition </SectionTitle> <Paragraph position="0"> LKC//takes a set of non-separated sentences as its input and segments them into linky strings. For the test corpus we chose sentences at random from the training corpus.</Paragraph> <Paragraph position="1"> into 20-25 linky strings on average 5. And in one sentence there are only one or two spots on average which break a morpheme into meaningless strings.</Paragraph> <Paragraph position="2"> With no linguistic knowledge, this can be said to be quite a good result.</Paragraph> <Paragraph position="3"> It is hard to check whether an extracted linky string is a right one, however, it is not that difficult to find over-segmented strings, for a linky string needs to hold the meaning. We check those over-segmented linky strings according to a dictionary, Iwanami Kokugo Jiten.</Paragraph> <Paragraph position="4"> Table 4 shows the numbers of over-segmented spots. The figure is the ~mmber of over-segmented spots, not the number of morphemes over-segmented 6. In Table 4 A and B are neighboring letters in a sentence which are forced to separate. The row &quot;kanji hiragana&quot; stands fdr over-segmented spots between a km~ji letter and a hiragmm letter.</Paragraph> <Paragraph position="5"> for test corpus: 302 To see the efficacy of d-bigram, we compare the experimental results of two data: d-bigram data and bigram data.</Paragraph> </Section> <Section position="2" start_page="588" end_page="689" type="sub_section"> <SectionTitle> Experimental Results </SectionTitle> <Paragraph position="0"> As shown in Table 3, with d-bigrmn information only 7.39% of the segment spots are over-segmented.</Paragraph> <Paragraph position="1"> The ratio of over-segmented morphemes for each part of speech is shown in Table 5. 'K' stands for kanji, 'h' is for hiragana and 'k' is for katakana. There was no missegmentation between katakana and other character types. There also was not any</Paragraph> </Section> <Section position="3" start_page="689" end_page="689" type="sub_section"> <SectionTitle> 5.2 A Linky String Characteristics of Linky Strings </SectionTitle> <Paragraph position="0"> Linky strings in Japanese are not equal to conventional morphemes in Japanese. As discussed in section 1.2, it is not easy to decide an absolutely correct segmenting spot in a Japanese sentence.</Paragraph> <Paragraph position="1"> That is one of the reasons that we decided to extract linky strings, instead of conventional morphemes. Itowever, if those linky strings do not keep the meanings, it is useless.</Paragraph> <Paragraph position="2"> The result shows the linking score works well enough not to segment senteces too much (Table 3). That is, we succeeded in extracting memfingfld strings using only statistical information. Figure 7 shows some examples of extracted linky strings.</Paragraph> <Paragraph position="3"> Sometimes LS8 extracts strings that look too long (Figure 8). This is not a bad result, though. When a linky string contains several morphemes in it, it is something like picking out idioms. A linky string with several morphemes may be a compound word, or an idiom, or a fixed locution.</Paragraph> <Paragraph position="4"> The Concept of the Linky Strings Grammar-based NLP systems generally specify a target language. On the other hand statistically-based approachs do not need rules or knowledge. This makes a statistically-based approach suitable to nmltilingual processing.</Paragraph> <Paragraph position="5"> ISg is not only for Japanese. With a corpus of non-separated sentences of any language, LSS can perform the same kind of segmentation.</Paragraph> <Paragraph position="6"> To deal with natural languages most systems use conventional morphemes or words as their processing units. That is, most systems need to recognize morphemes or words in sentences, and they need to make up a fairly-good morphological analysis before the main processing. We have been working for processing natural languages in linguistic ways, though we do not know whether it is a right way in computational linguistics. A linky string is extracted only with statistical information, using no grammars nor linguistic knowledge. The system does not need to behave like a native speaker of the target language; all it has to do is check statistical information, which is what computers are good at. We expect that linky strings can be a key to solve problems of NLP.</Paragraph> <Paragraph position="7"> Compound Words The results show that the system has 7.39% incorrect segmentation. This result is based on a Japanese dictionary, and when a morpheme listed in the dictionary gets separated, we count it as over-segmented. However, a dictionary often holds compound words. That is, some number of tile segmented spots which we have counted as &quot;oversegmented&quot; ones are not really over-segmented. From this point of view, the percentage of over-segmentation is actually even lower.</Paragraph> <Paragraph position="8"> Inflections Verbs, adjectives, adverbs and auxiliary verbs are inflected in Japanese. In the experimental result, 89.7% (with d-bigram data) of over-segmented spots between kanji and hiragana occurs in inflective morphemes. We decided correct segmenting spots for inflective morphemes according to a Japanese dictionary. According to statistical information, segmenting method for inflective morphemes is different fl'om grammatical one. So most of the over-segmented spots can be treated as correct segmenting spots according to statistical information.</Paragraph> </Section> <Section position="4" start_page="689" end_page="689" type="sub_section"> <SectionTitle> 5.3 D-bigram Statistics </SectionTitle> <Paragraph position="0"> According to Table 3, it scents that using the bi-gram method the output is apt to be more segmented than with the d-bigram method.</Paragraph> <Paragraph position="1"> This happens t.)ecause bigrmn cannot pick out long strings. Bigrmn does not hoht information between remote (actually more than one letter away) letters. That makes long strings of letters easily segmented. When LcN checks a three-letter morpheme ABC, with bigram data it can see the string only as A-B and B-C. If those strings AB and BC do not .appear often, the linking scores get low and Lq.S decides to segment between A-B and B-C. IIowever, with d-bigram data ISS can get the information between A and C as well, that helps to recognize tlmt A, B and C often come out together. This happens frequently between two katakana letters (Table 4), because of the usage of katakana letters in Japanese.</Paragraph> <Paragraph position="2"> This does not mean that with d-bigram method sentences are less likely to be segmented. As shown in Figure 9, the distribution is not so different between two methods. The x-axis shows the nmnbers of linky strings in sentences and the y-axis shows the number of sentences with x linky strings.</Paragraph> <Paragraph position="3"> According to Figure 9, the distributions of sentences are not so different between the method with d-bigram and the one with bigram.</Paragraph> </Section> </Section> class="xml-element"></Paper>