File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2121_metho.xml
Size: 16,334 bytes
Last Modified: 2025-10-06 14:13:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2121"> <Title>Word Sense i)ismnl}iguati<)n and Text Set mentation</Title> <Section position="4" start_page="755" end_page="755" type="metho"> <SectionTitle> 2 Lexical Cohesion </SectionTitle> <Paragraph position="0"> Consider the following example, which is the English translation of the fragment of one of Japanese texts that we use for the experiment later.</Paragraph> <Paragraph position="1"> In the universe that continues expancb ing, a number of stars have appeared aml disappeared again and again. And about ten billion years after tile birth of the universe, in the same way as the other stars, a primitive galaxy was formed with the primitive sun as the center.</Paragraph> <Paragraph position="2"> Words {nniverse, star, universe, star, galaxy, sun} seem to be semantically same or related to each other and they are included in the same category in Roget's International Thesaurus. Like Morris and tIirst, we compute such sequences of related words(lexical chains) by using a thesaurus as the knowledge base to take into account not only the repetition of the same word but the use of superordinates, subordinates, and synonyms. We. use a Japanese thesaurus 'Bnnruigoihyo'\[1\]. Bunrui-goihyo has a similar organization to Roger's: it consists of 798 categories and has a hierarchical structure above this level. For each word, a list of category numbers which corresponds to its multiple word senses is given. We count a sequence of words which are included in the same category as a lexical chain. It might be (:lear that this task is computationally trivial. Note that we regard only a sequence of words in the same category as a lexical chain, rather than using the complete Morris and Hirst's framework with five types of thesaural relations.</Paragraph> <Paragraph position="3"> The word sense of a word can be determined in its context. For example, in the context {universe, star, universe, star, galaxy, sun}, the word 'earth' has a 'planet' sense, not a 'ground' one. As clear from this example, lexical chains ('an be used as a contextual aid to resolve word sense ambiguity\[10\]. In the generation process of lexical chains, by choosing the lexical chain that the current word is added to, its word sense is determined. Thus, we regard word sense disambiguation as selecting the most likely category number of the thesaurus, as similar to \[16\].</Paragraph> <Paragraph position="4"> l';arlier we proposed incremental disambiguation method that uses intrasentential information, such as selectional restrictions and case frames\[l 2\]. In the next section, we describe incremental disambiguation method that uses lexical chains as intersentential(contextual) information.</Paragraph> </Section> <Section position="5" start_page="755" end_page="758" type="metho"> <SectionTitle> 3 Generation of Lexical Chains </SectionTitle> <Paragraph position="0"> In the last section, we showed that lexical chains carl play a role of local context, t\]owever, multiple lexical chains might cooccur in portions of a text and they might vary in their plausibility as local context. For this reason, for lexical chains to function truly as local context, it is necessary to arrange them in the order of the salience that indicates the degree of tile plausibility. We base the salience on the following two factors: the recency and the length. The more recently updated chains are considered to be the more activated context in the neighborhood and are given more salience. The longer chains are considered to be more about the topic in the neighborhood and are given more salience.</Paragraph> <Paragraph position="1"> By checking lexical cohesion between the cu> rent word and lexical chains in the order of the salience, the lexical chain that is selected to add the current word determines its word sense and plays a role of local context.</Paragraph> <Paragraph position="2"> Based on this idea, incremental generation of lexical chains realizes incremental word sense disambiguation using contextual information that lexical chains reveal. During the generation of lexical chains, their salience is also in crementally updated. We think incremental disambiguation\[9\] is a better strategy, because a combinatorial explosion of the number of to tal ambiguities rnight occur if ambiguity in not resolved as early as possible during the analyt ical process. Moreover, incremental word sense disarnbiguation is indist)ensable during the gem eration of lexical chains if lexical chains are used for incremental analysis, because tile word sense ambiguity might cause many undesirable lexical chains and they might degrade the performance, of the analysis(in this case, the disambignation itself).</Paragraph> <Section position="1" start_page="756" end_page="756" type="sub_section"> <SectionTitle> 3.1 The Algorithm </SectionTitle> <Paragraph position="0"> First of all, a &~pauese text is automatically seg-mented into a sequence of words 1)y the morphological analysis\[l 1\]. Ih-om tile result of the |norphological analysis, candidate words are selected to inch.lde in lexical chains. We consider only nouns, verbs, and adjectives, with sonte excep lions such as nouns in adverbial use and verbs in postpositional use.</Paragraph> <Paragraph position="1"> Next lexical chains are formed. Lexical cohesion among candidate words inside a sentence is first; checked by using the thesaurus. Ilere the word sense of the current w/)rd might be determined. This preference for lexica.1 cohesion inside a sentence over the intersentential one retlects our observation that the former nfight be tighter.</Paragraph> <Paragraph position="2"> After the analysis htside a sentence, i:audidate words are tried to be added to one of the lexieal chains that are recorded in the register in the order of the above salience. The ih'st chain that the current word has tile lexica\] cohesion relation is selected. The salience of the selected lexical chain gets higher and then the arrangement in the register is updated.</Paragraph> <Paragraph position="3"> Here not ()lily the word sense amt)iguity of the current word is resolved but the word sense of the amt)iguous words in the selected \]exica\[ chain cau also be determined. Because the lexical chain gets higher salience, other word senses of the mnhigu ous words in the lexic~d chain whi/-h correspond to other lexical chains can he rejected. There fore,, lcxica\] chains can be used riot only a.s prior context but, also later context for word seuse disambiguation. null If a candidate word can not be added to the existing lexical chain, new lexieal chains for each word sense are recorded in the register.</Paragraph> <Paragraph position="4"> As clear fl'om tile algorithm, rather than the truly incremental method where the register of lexical chains is updated word by word in a sentenee, we adopt the incremental method where updates are performed at the end of each sentence because we regard intrasentential information as more iml)ortant.</Paragraph> <Paragraph position="5"> The process of word sense disambiguation using lexical chains is illustrated in Figure 1. The most salient lexical chain is located at the top in the register. In the initial state the word W1 re-utains aml)iguous. When tile current unambiguous word W2 is added, tile chain b is selected(top left). The chain b t)ecomes the most salient(topright). Ilere the word sense ambiguity of the word W\[ in the chain b is resolved(bottom-left). If the word to be added is ambiguous(W3), tile word sense corresponding to the more salient \]exieal chaln(1D21) in seh;eted(l)ottom-right).</Paragraph> </Section> <Section position="2" start_page="756" end_page="758" type="sub_section"> <SectionTitle> 3.2 The Ewfluation </SectionTitle> <Paragraph position="0"> Wc apply Lhe algodthn~ to five texts. Tal)le l shows the system's performance.</Paragraph> <Paragraph position="1"> The 'correctness' of the disambiguation is judge, d by one of the authors. The system's performance is con|tinted as the quotient of the num ber of correctly disambiguated words by the num her of ambiguous words miuus the nmnber of wrongly segmented words(morphological attalysis ergo rs) 3, Words that relnaill ambiguous are those that (1o llOt \['orin any lexical chains with other words. F, xcept t)y the errors in the ntorphologieal analysis, most of the errors in the disambiguation are caused by being dragged into the wrong context.</Paragraph> <Paragraph position="2"> The average performance is 63.4 %. We think the system's l?erformam:e is promising for the following reasous: I. l,exical cohesion is not the only knowledge sour('e lbr word sense disatnbiguation and \[)roves to be usefill at least as a source supplernentary to our earlier framework that used cane frmnes\[12\].</Paragraph> <Paragraph position="3"> 2. In fact, higher performance is reported in \[16\], thai; uses bro~der context acquired by at, I lie accuramy ot' the inorphological analysis will be iml)r(wed by adding new word entries or the like.</Paragraph> <Paragraph position="5"> training on large corl)ora , but. our method can attain such tolerab\[e level of performance without any training.</Paragraph> <Paragraph position="6"> However, our salience of lexical chains is, of course, rather naive and must be refined by using other kinds of inibrmation, such as Japanese</Paragraph> <Paragraph position="8"/> </Section> </Section> <Section position="6" start_page="758" end_page="759" type="metho"> <SectionTitle> 4 Text Segmentation by Lexi- </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="758" end_page="758" type="sub_section"> <SectionTitle> cal Chains </SectionTitle> <Paragraph position="0"> The second importance of lexic~d chains is that they provide a clue for the deternfination of segment boundaries. (Jertain spans of sentences in a text form selnantic units and are usually called segments. It is crucial to identify the segment boundaries as a first step to construct the structure of a text\[2\].</Paragraph> </Section> <Section position="2" start_page="758" end_page="758" type="sub_section"> <SectionTitle> 4.1 The Measure for Segnmnt tioun<l- aries </SectionTitle> <Paragraph position="0"> When a portion of a text forms a semantic unit, there is a tendency for related words to be used.</Paragraph> <Paragraph position="1"> Therefore, if lexical chains can be found, they will ten(t to indicate the segment boundaries of the text. When a l.exical chain ends, there is a tendency for a segment to end. \[f a, llew chain begins, this might be an indication thai; a new segment has begun\[l 0\]. Taking into account tiffs correspondence of \[exieal chain boundaries to segment boundaries, we measure the plausibilit;y el each point; in the text; as ~ segment hotmdary: tbr each point between sentences n an(l 'n k I (where it ranges fl'om 1 to the m|nlt)er el' sentences in the text minus 1), compute the stun of the numl)er of lexical chains that en(l at the sentence ?z and the number of lexical chains that begin at the sentence n + 1. We call this naive measure of a degree of agreement of the start and end points of lexicM chains w(n,n + l) boundary strength like \[14\]. The points ill the text are selected in the order of boundary strength as candidates of segment boundaries.</Paragraph> <Paragraph position="2"> Consider for example the live lexieal chains in the imaginary text that consists of 24 sentences in Figure 2. In this text, the boundary strength can be computed as follows: w(a,4) = 1,,.,(7,s) -</Paragraph> <Paragraph position="4"/> </Section> <Section position="3" start_page="758" end_page="759" type="sub_section"> <SectionTitle> 4.2 The Evahmtion </SectionTitle> <Paragraph position="0"> We, try to segnient the texts ill section 3.2 and apply the above measure to the lexical chains that were tbluned. We pick out three texts(No.3,4,5), which are fi:om the exam ques tions of the Japanese language, that ask us to partiglon the texts into a given number of segments. The system's performmwe is judged by the com.</Paragraph> <Paragraph position="1"> p~rison with segment boundaries marked as an attaehe(l model answer. Two more texts(No.6,7) \['rom the questions are also tried to be segtnented. Here we do not t:M~e into account the intbrma tion of paragraph lmundaries, such as the inden ration, at all in the following rea,sons: * \]{ceallse OllF texts aFe h'oin the exam ques tions, nla, ny ()f them have no I\]Tta, rks of paragraph I)oundaries; * ill? case of ,laps.nose, it is pointed out that paragraph and segment boundaries do not always coincide with each other\[l 3\].</Paragraph> <Paragraph position="2"> Table 2 shows the t)crformanee in case where the system generates the given number of segment botm(laries 4 in the order el&quot; the strength. From Table 2, we can compute the system's marks as an exanlinee in tim Lest that consists of these five quesLiolm. Tal-)le 3 shows the performance in case where segment boundm:ies are generated down to half of the maximum strength. 'l'he metrics that we. use for the ewduation are as follows: Recall is the quotient of' the in|tuber of correctly identified boundaries by the total mmlber of correct bound aries. Precision is the quotient of the nmnber of (:orre(:t\[y identifie(l I)ounda, ries by the tnllnl)er of generated boundaries.</Paragraph> <Paragraph position="3"> We think the poor result for the text No.5 might be caused by the difficulty of tile text ~The number of boundaries to be given is the mtmber of segments given in the question minus 1.</Paragraph> <Paragraph position="5"> itself because it is written by one of the most difficult writers in Japan, KOBAYASH\[ Hideo. Table 2 shows that our system gets 8(1+3+3+1)/15(1+6+1+4F3)= 53 % in the test. From Table 3, the average recall and precision rates are 0.52 and 0.25 respectively. Of course these results are unsatisfactory, but we think this measure for segment boundaries is promising and useful as a preliminary one.</Paragraph> <Paragraph position="6"> Since lexical chains are considered to be different in their degree of contribution to segment boundaries, we arc now refining the measure by taking into account their importance. We base the importance of lexical chains on the following two factors: 1. The lexical chains that include more words with topical marker 'wa' get more importance. null 2. The longer lexical chains tend to represent a semantic unit and get more importance.</Paragraph> <Paragraph position="7"> The start and end points of the more important lexical chains can get the more boundary strength. This refinement of the measure is in the process and yields a certain extent of improvement of the system's performance.</Paragraph> <Paragraph position="8"> Moreover, this ewduation method is not necessarily adequate since partitioning into a larger number of smaller segments might be possible and be necessary for the given texts. And so we will have to consider the evaluation method that the agreement with hmnan subjects is tested in future. Ilowever, since human subjects do not always agree with each other on segmentation\[6, 4, 14\], our evaluation method using the texts in the questions with model answers is considered to be a good simplification.</Paragraph> <Paragraph position="9"> Several other methods to text segmentation have been proposed. Kozima\[7\] and Youmans\[17\] proposed statistical measures(they are named LCP and VMP respectively), which indicate the plausibility of text points as a segment boundary. Their hills or valleys tend to indicate segment boundaries. However, they only showed the correlation between their measures and segment boundaries by their intuil, ive analysis of few sample texts, and so we cannot compare our system's and their performance precisely.</Paragraph> <Paragraph position="10"> ltearst\[5\] independently proposes a similar measure for text segmentation and evaluates the performance o\[ her method with precision and recall rates. However, her segmentation method depends heavily on the information of paragraph boundaries and always partitions a text at the points of paragraph boundaries.</Paragraph> </Section> </Section> class="xml-element"></Paper>