File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1084_metho.xml
Size: 22,671 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1084"> <Title>Automatic Semantic Sequence Extraction from Unrestricted Non-Tagged Texts</Title> <Section position="3" start_page="0" end_page="579" type="metho"> <SectionTitle> 2 Japanese Characters and Terms </SectionTitle> <Paragraph position="0"> Taking a word as a basic semantic unit simplifies the conflming tasks of processing real languages.</Paragraph> <Paragraph position="1"> However single words are often not a good unit regarding the meaning of the context, because of the polysemy of the words(Fung, 1998). Instead a phrase or a term can be taken as smallest semantic units.</Paragraph> <Paragraph position="2"> Most of the phrase/term extraction systems are based on recognizing noun phrases, or domain-specific terms, fi'om large corpora. Arganion et a1.(1998) proposed a memory-based approach for noun phrase, which was to learn patterns with several sub-patterns. Ananiadou(1994) proposed a methodology based on term recognition using morphological rules.</Paragraph> <Section position="1" start_page="0" end_page="579" type="sub_section"> <SectionTitle> 2.1 Term Extraction in Japanese </SectionTitle> <Paragraph position="0"> Japanese has 11o separator between words. On noun phrase extraction many researches have been done in Japanese as well, both stochastic and gramlnatical ways. In stochastic approaches ~z-gram is one of the most fascinating model. Noun phrase extraction(Nagao and Mori, 1994), word segmentation(Oda and Kita~ 1999) and diction extraction are the major issues. There also are many researches on segmentation according to the entropy. Since Japanese has a great number of characters use of the information of letters is also a very interesting issue.</Paragraph> </Section> <Section position="2" start_page="579" end_page="579" type="sub_section"> <SectionTitle> 2.2 Characters in Japanese </SectionTitle> <Paragraph position="0"> Unlike English, Japanese has great mnount of characters for daily use. Japanese is special not only for its huge set of characters but its containing of three character types. Hiragana is a set of 71 phonetic characters, which are mostly used for flmction words, inflections and adverbs.</Paragraph> <Paragraph position="1"> Katakana is also a set of phonetic characters, each related to a hiragana character. The use is mainly restricted to the representation of foreign words. It's also used to represent pronunciations. Kanji is a set of Chinese-origin characters. There are thousands of kanji characters, and each kanji holds its own meaning. They are used to represent content words. We also use alphabetical characters and Arabic numerals.</Paragraph> </Section> </Section> <Section position="4" start_page="579" end_page="579" type="metho"> <SectionTitle> 3 Overview </SectionTitle> <Paragraph position="0"> This system takes Japanese sentences as input.</Paragraph> <Paragraph position="1"> It processes sentences one by one, and we obtain segments of the sentences which are recognized as meaningful sequences as output. The flow of this system is as follows(Figure 1): Our system</Paragraph> </Section> <Section position="5" start_page="579" end_page="580" type="metho"> <SectionTitle> TRAINING EXTRACTION </SectionTitle> <Paragraph position="0"> takes one sentence as an input at one time, and calculates tile scores between two neighboring letters according to the statistical data driven from the training corpus. After scoring the system decides which sequences to extract.</Paragraph> <Section position="1" start_page="579" end_page="579" type="sub_section"> <SectionTitle> 3.1 Automatic Sequence Extraction </SectionTitle> <Paragraph position="0"> Nobesawa et a1.(1996; 1999) proposed a system which estimates the likelihood of a string of letters be a meaningfifl block in a sentence. This method does not need any knowledge of lexicon, and they showed that it was possible to segment sentences in meaningflfl way only with statistical information between letters. Tile experimeN; was in Japanese, and they also showed that tile cooccurrence information between Japanese letters had enough information for estimating the connection of letters.</Paragraph> <Paragraph position="1"> We use this point in this paper and had experiments on extracting meaningfnl sequences in email message texts to make up the lack of vocabulary of dictionaries.</Paragraph> </Section> <Section position="2" start_page="579" end_page="579" type="sub_section"> <SectionTitle> 3.2 Scoring </SectionTitle> <Paragraph position="0"> Our system introduces the linking score, which indicates the likelihood that two letters are neighboring as a (part of) meaningful string(Nobesawa et al., 1996).</Paragraph> <Paragraph position="1"> Only with neighboring bigrams it is impossible to distinguish the events 'XY' in 'AXYB' fi'om 'CXYD'. Thus we introduce d-bigram which is a bigram cooccurrence information concerning the distance(Tsutsumi et al., 1993).</Paragraph> <Paragraph position="2"> Expression (1) calculates the score between two neighboring letters;</Paragraph> <Paragraph position="4"> where wl as an eveN;, d as the distance between two eveN;s, dmax as the maximum distance used in the processing (we set drnax -~ 5), and g(d) as the weight fimction on distance (for this system g(d) = d-2(Sano et al., 1996), to decrease tile influence of tile d-bigrams when the distance get longer (Church and Hanks, 1989)). When calculating the linking score between the letters wi and Wi+l, tile d-bigram information of the letter pairs around tim target two (such as (wi-l, wi+2; 3)) are added.</Paragraph> <Paragraph position="5"> Expression (2) calculates the mutual information between two events with d-bigram data;</Paragraph> <Paragraph position="7"> where x, y as events, d for the distance between two events, and P(x) as the probability.</Paragraph> </Section> <Section position="3" start_page="579" end_page="580" type="sub_section"> <SectionTitle> 3.3 Sequence Extraction </SectionTitle> <Paragraph position="0"> Using the linking score calculated according to tile statistical information, our system searches for the sequences to extract (thus we call our system LSE(linky sequence extraction) system).</Paragraph> <Paragraph position="1"> Figure 2 shows an example graph of the linking scores for a sentence. Each alphabet letter on the x-axis stands for a letter in a sentence. The linking scores between two neighboring letters are dotted on the graph on the y-axis. Since the linking score gets higher when the pair has stronger connection, the mountain-shaped lines may get considered as unsegmentable blocks of letters. The linking scores of the pairs in longer words/phrases can be higher with the influence of the statistical information of other letter pairs around them. On the other hand, the linking score between two letters which are accidentally neighboring gets lower, and it makes valley-shaped point on the score graph. Our system extracts the mountain-shaped parts of the sentence as the qinky sequences', which is considered to be nleaningflfl according to the statistical information. In example Figure 2, strings AB, CDEF and HIJK might be extracted.</Paragraph> <Paragraph position="2"> The height of mountains are not fixed, according to the likelihood of the letters blocked as a string. Tiros we need a threshold to decide strings to extract according to the required size and the strength of connection. With higher threshold the strings gets shorter, since the higher linking score means that the neighboring letters can be connected only wlmn they have stronger commotion between them.</Paragraph> </Section> <Section position="4" start_page="580" end_page="580" type="sub_section"> <SectionTitle> 3.4 I-Iow the System Uses the Statistical Information </SectionTitle> <Paragraph position="0"> Figure 3 shows the example graph on a sentence &quot;~i~&quot;~&quot;~2 \[o-gen-ki-de-su-ka-?\]&quot; (: How are you?)(Sano, 1997). Each graph line indicates the linking score of the sentence after learning some thousamts of sentences of the target do~ main (for this graph we used a postcard corpus as Lhe target domain, and for the base domain we took a newspaper corpus). When the system have no information on the postcard domain, the system could indicate that only the pair of letters &quot;~/z(, (gen-ki)&quot; is connectable (there is a mountain-shaped line for this pair). Obtaining the information of postcard corpus, the linking scores of every pair in this sentence gel; bigger, to make higher mountain. And the shape of the mountain also changes to a flat one mountain which contains whole sentence from a steel) mountain with deep valleys.</Paragraph> </Section> </Section> <Section position="6" start_page="580" end_page="584" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We had experiments on extracting semantic sequences based only on letter cooccm'rencc information. null We tried a dictionary-based Jap~mese morphological parser ChaSen vet. 1.51 (\] 9 9 7) oil the test corpus as well to check sequences whid~ a dictionary-based parser can not: recognize.</Paragraph> <Section position="1" start_page="580" end_page="581" type="sub_section"> <SectionTitle> 4.1 Corpus </SectionTitle> <Paragraph position="0"> We chose email messages as the corpora for experiments of our system. Email messages are mostly written in colloquialism, especially when they are written by younger people to send to their friends. In Japanese colloquialism has casual wording which (lifters from literary style.</Paragraph> <Paragraph position="1"> Casual wording contains emphasizing and terms not in dictionary such as slangs. In English an emphasized word may he written in capital let;ters, such as in &quot;it SURE is not true&quot;, which is easily connected to the basic word &quot;sure&quot;. We do the same kind of letter type changes in Japanese for emphasizing, however, since the relationship between letter types is not the same as English, it is not easy to connect the emphasized terms and the basic terms.</Paragraph> <Paragraph position="2"> The training corpus we used to extract statistical information is a set of email messages sent between young female friends during 1998 to 1999. This corpus does not cover the one used as the test corpus. All the messages were sent to one receiver, and the number of senders is 17. The email corpus contains 351 email messages, which has 7,865 sentences(176,380 letters, i.e. 22.4 letters per sentence on average).</Paragraph> <Paragraph position="3"> We did not include quotations of other emails in the training corpus to avoid double-counting of same sentences, though email messages often contain quotations.</Paragraph> <Paragraph position="4"> The test corpus is a set of email messages sent between young female friends during 1999. This corpus is not a part of the training corpus. All the messages were sent to one receiver, and the number of senders is 3. This corpus contains 1,118 sentences(24,160 letters, i.e. 21.6 letters per sentence on average).</Paragraph> </Section> <Section position="2" start_page="581" end_page="581" type="sub_section"> <SectionTitle> 4.2 Preliminary Results </SectionTitle> <Paragraph position="0"> Figure 4 shows the distribution of the linking scores. The average of the scores is 0.34. The pairs of letters with higher linking scores are treated as highly 'linkable' pairs, that is, pairs with strong connection according to the statistical inforination of the domain (actually of the training corpus).</Paragraph> <Paragraph position="2"> Pairs of letters with high scores are mainly found in high-scored sequences (Tahle 1).</Paragraph> <Paragraph position="3"> Table 1 shows a part of the sequences extracted with our system using letter cooccurrence information. The threshold of extraction for Table 1 is 5.0.</Paragraph> <Paragraph position="4"> These sequences which extracted frequently are the ones often use in tile target domain.</Paragraph> </Section> <Section position="3" start_page="581" end_page="583" type="sub_section"> <SectionTitle> 4.3 Undefined Words with ChaSen </SectionTitle> <Paragraph position="0"> Since ChaSen is a dictionary-based system, it outputs unknown strings of letters as they are, with a tag 'undefined word'.</Paragraph> <Paragraph position="1"> Table 2 shows the number of sequences which ChaSen resulted as &quot;undefined words&quot;. The row 'undefined words' indicates the sequences which were labeled as 'undefined word' with ChaSen, and the row 'parsing errors' stands for the sequences which were not undefined words with ChaSen but not segmented correctly 1 . The extraction threshold is 0.5.</Paragraph> <Paragraph position="2"> ChaSen had 627 undefined words as its output. Since the test corpus contains 1,118 sentences, 56.08% of the sentences had an undefined word on average. As it is impossible to divide an undefined sequence into two undefined words, when two or more undefined sequences are neighboring they m'e often connected into one undefined word s Ttms the real number of undefined sequences can be more than counted.</Paragraph> <Paragraph position="3"> Table 2 shows that our system on statistical information can be a help to recover 69.06% of the undefined sequences detected by ChaSen.</Paragraph> <Paragraph position="4"> 1Since our system is not to put POS tags, we do not count tagging errors with ChaScn (i.e., we do not contain tagging errors in the 'parsing errors').</Paragraph> <Paragraph position="5"> 2ChaSen cm, divide two neighboring undifined sequences when the letter types of the sequences differs. a sue.: succeeded to extract b pm't.: pm'tiMly extracted Table 2 also shows that this system has better precision with tile sequences with larger frequency. For the sequences with frequency over 10 times (in the test corpus), 81.85% of the sequences have extracted correctly. Ignoring sequences which appeared in the test corpus once, the rate of correct extraction rose up to 77.71%. Table 3 shows how our system worked with the sequences whirl1 are found as undefined words with ChaSen parsing system. The threshold for extraction is 0.5. Table 3 shows that tim mldefined words w 7 LSE system category #total sue.&quot; part) fidled proper nouns 60 39 17 4 new words 70 48 12 l0 letter additions 119 89 4 26 changes ~ 276 194 28 54 term. marks ~z 58 43 0 15 smileys 15 9 6 0 et:c. 29 12 1 16 toted 627 433 68 126 a sue.: succeeded to extract I, part.: partially extracted c changes: representation changes 't tenn. marks: termination marks biggest reason for the undefined words are the woblem of the representation. As descril)ed in Section 4.3.2, we change the way of description wlmn we want to emphasize the sequence. The pronunciation extension with adding extra vowels or extension marks is also for the same reason. Adding these two categories, 356 sequences out of 627 undefined words(56.78%) are caused in this emphasizing.</Paragraph> <Paragraph position="6"> Termination marks as undefined words conlain sequences such as &quot; ...... &quot; and &quot; ! ! &quot; The termination marks not in dictionary often indicate the impression, sudl as surprise, hal)piness, considering and so on.</Paragraph> <Paragraph position="7"> New words including proper nouns are the actual 'undefined words'. ChaSen had 130 of them as its output, that is 20.73% of the undefined words.</Paragraph> <Paragraph position="8"> Table 4 shows the types of letters included in the 'undefined words' with ChaSen. Tile figures indicate the numbers of letters.</Paragraph> <Paragraph position="9"> We had 627 undefined words in the test corpus with ChaSen (Table 2), which contain 1,493 letters totally. The average length of the undefined words is thus 2.38. 70.40% of the letters in %~ble 4: Letter Types of Undefined Words undefined words w/LSE system type variety #total sllc. a pal't, b failed a sue.: succeeded to ex~rac~ l, part.: pm-t, ially extracted undefined words were katakana letters(Table 4), which are phonetic and often used for describing new words. Katakana letters are also often used for emphasizing sequences.</Paragraph> <Paragraph position="10"> OI1 the other hand, there was only one letter each for kanji and numeral figure. That is because each kanji letter and numeral figure has its own meaning, and those letters are mostly found in the dictionary, even though tlle tags are not semantically correct. Or, as for kan.\]i letters, it sometimes can be tagged with incorrect segmentation 3. Thus undefined words in kanji letters are not counted as 'undefined words' mostly, and instead they cause segmentation failure(Section 4.4).</Paragraph> <Paragraph position="11"> Since Japanese have two phonetic character sets, we have several ways to represent one term; in kanii (if thm'e is any for the tin-m), in hiragana, in katakana, or sevm'al dlaracter type mixed. It is also possible to use Romanization to represent a tern1.</Paragraph> <Paragraph position="12"> a ,,~_ a) \[ko-no\] (:this)/t!k'~ \[se-l~d\] (:the world)&quot; is incorrectly segmented as &quot;,~ 0~91~: \[ko-no-yo\](:the present life)/ \[kai\](:world)&quot;; &quot;kono yo&quot; is a fixed phrase, and &quot;lmi&quot; is a suffix for a nmm to put the meaning of &quot;the worht of&quot;, e.g. &quot;~7::gl ~ (:the literary world)&quot; Table 5 shows the numbers of ChaSen errors according to the representation change. Most of undefined words w/LSE system subeategory ~total sue.&quot; part. b failed term chmlges 40 33 3 4 lmtal~na 137 102 12 23 chmlge & katalmala 55 34 10 11 etc. 44 25 3 16 total 276 194 28 54 . sue.: succeeded to extract b part.: partiMly extracted the dictionaries have only one basic representation for one term as its entry 4. However, in casual writing we sometimes do not use the basic representation, to emphasize the term, or just to simplify tile writing.</Paragraph> <Paragraph position="13"> In JapalmSe language we have many kinds of function words to put at the end of sentences (or sometimes 'bunsetsu' blocks). The function words for sentence ends are to change the sound of the sentences, to represent friendliness, ordering, and other emotions. These function words are not basically used in written texts, but in colloquial sentences.</Paragraph> <Paragraph position="14"> In Japanese language we put extra letters to represent the lengthening of a phone,. Since almost all Japanese phones have vowels, to lengthen a phone for emphasizing we put extra vowels or extension marks after tim letter. Table TaMe 6: Extra Letters output as Undefined</Paragraph> </Section> <Section position="4" start_page="583" end_page="583" type="sub_section"> <SectionTitle> Words </SectionTitle> <Paragraph position="0"> letter ~b ~, -) ~ $~ ~ 'y ~&quot; total a i u e o t t n sue.&quot; 39 2 5 32 7 3 1 0 89 part.b 0 0 0 0 0 4 0 0 4 failed 5 1 4 2 1 7 5 1 26 total 44 3 9 34 8 14 6 1 119 suc.: succeeded to extract b part.: partially extracted.</Paragraph> <Paragraph position="1"> the entries, not as headings. These small letters in this table are extra letters to change the pronunciation; i.e. they ~re mostly not included in the dictionary. However they are actually a part of the word, since they could not be separated from the previous sequences. null</Paragraph> </Section> <Section position="5" start_page="583" end_page="584" type="sub_section"> <SectionTitle> 4.4 Segmentation Failure with ChaSen </SectionTitle> <Paragraph position="0"> Table 7 shows (;he result of the extraction of sequences which ChaSen made parsing errors.</Paragraph> <Paragraph position="1"> It indicates that our system could recognize 70.88% of the sequences which ChaSen made parsing errors.</Paragraph> <Paragraph position="2"> A: sequences incl. alphabetical characters B: sequences incl. numeral figures C: proper nouns D: new words excl. proper nouns E: fixed locutions F: sequences with representation changes G: sequences in other character types II: emph,xsized expressions I: termination marks J: parsing errors a sue.: succeeded to extract part.: partially extracted Category F is for the sequences which changed their representations according to tile terms' pronunciation changes for casual use. For example, &quot;~deg \[ya-p-pa\]&quot; is a casual form of &quot;~ lTk 9 \[ya-ha-ri\](: as I thought)&quot;. In casual talking using original terln &quot;yahari&quot; sounds a little too polite. Sonm common casual forms are in dictionaries, but not all.</Paragraph> <Paragraph position="3"> For the category B, our system failed to extract 25 sequences. All the sequences in B are with counting suffixes. 12 sequences out of the 25 couhl not l)e connected wil;h the counting suffixes, e.g. &quot;3 0 H \[3-0-nichi\](: 30 days, or, the 30th day)&quot; got over-segmented l)etween zero and the suffix. We have a big w~riety of counting suffixes in Japanese and since our system is only on letter cooccurrence information we couM not avoid tlm over-segmentation.</Paragraph> <Paragraph position="4"> Category C indicates the sequences wlfich are written in other character types for emphasizing. The major changes are: (1) to write in hiragana characters instead of kan.ji characters, and (2) to write in katakana characters to emphasize the term.</Paragraph> </Section> </Section> <Section position="7" start_page="584" end_page="584" type="metho"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> Dictionary-based NLP tools often have worse precision with ~exts written in casual wordings and texts which contain many domain-specific terms. 'lbrm recognition system available fi)r any corpora as a preprocessing enables the use of NLP tools on many kinds of texts.</Paragraph> <Paragraph position="1"> In this paper we proposed a simple mefllod fi)r term recognition based on statistical information. We had experiments on extracting semantically meaningflfl sequences according to the statistical information drawn fi:om the training corpus~ and our system recognized 69.06% of the sequences whidl were tagged as undefined words witll a conventional nmrphologieal parser.</Paragraph> <Paragraph position="2"> Our sysi;em was efl3cient in recognizing different representations of terms, proper nouns, and other casual wording phrases. This helps to salvage semantically meaningful sequences not in dictionaries and this can be an efficient preprocessing. null</Paragraph> </Section> class="xml-element"></Paper>