File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2145_intro.xml
Size: 4,476 bytes
Last Modified: 2025-10-06 14:06:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2145"> <Title>Text Segmentation with Multiple Surface Linguistic Cues</Title> <Section position="4" start_page="881" end_page="881" type="intro"> <SectionTitle> 2 Surface Linguistic Cues for Japanese Text Segmentation </SectionTitle> <Paragraph position="0"> There are many linguistic cues that are available for identifying segment boundaries (or non-boundaries) of a Japanese text. However, it is not clear which cues are useful to yield better results for text segmentation task. Therefore, we first enumerate all the linguistic cues. Then, we select the useful cues and combine the selected cues for text segmentation.</Paragraph> <Paragraph position="1"> We use the method that a weighted sum of the scores for all cues is used as the overall measure to rank the possible segmentation with multiple linguistic cues.</Paragraph> <Paragraph position="2"> First we explain this method used for text segmentation with multiple linguistic cues. Here, we represent a point between sentences n and n + 1 as p(n,n + 1), where n ranges from 1 to the number of sentences in the text minus 1. Each point, p(n, n+l), is a candidate for a segment boundary and has a score scr(n, n + 1) which is calculated by a weighted sum of the scores for each cue i, scri(n,n + 1), as follows: scr(n,n+</Paragraph> <Paragraph position="4"> A point p(n, n + 1) with a high score scr(n, n + 1) becomes a candidate with higher plausibility. The points in the text are selected in the order of the score as the candidates of segment boundaries.</Paragraph> <Paragraph position="5"> We use the following surface linguistic cues for Japanese text segmentation: * Occurrence of topical markers (i = 1..4). If the topical marker 'wa' or the subjective postposition 'ga' appears either just before or after + 1), add 1 to scri( , + 1).</Paragraph> <Paragraph position="6"> * Occurrence of conjunctives (i = 5..10). If one of the six types of conjunctives 1 appears in the head of the sentence n+l, add 1 to scri(n, n+l).</Paragraph> <Paragraph position="7"> * Occurrence of anaphoric expressions (i = 11..13). If one of the three types of anaphoric expressions 2 appears in the head of the sentence n + 1, add 1 to scri(n, n + 1).</Paragraph> <Paragraph position="8"> * Omission of the subject (i=14). If the sub-ject is omitted in the sentence n + 1, add 1 to scri(n, n + 1).</Paragraph> <Paragraph position="9"> s Succession of the sentence of the same type (i = 15..18). If both sentences n and n+l are judged as one of the four types of sentences s, add 1 to scri(n, n + 1).</Paragraph> <Paragraph position="10"> arises from the difference of the characteristics of their refer- ents from the viewpoint of the mutual knowledge between the speaker/writer and hearer/reader(Seiho, 1992). SThe classification of types of sentences originates in the work in Japanese linguistics(Nagano, 1986). * Occurrence of lexical chains (i = 19..22). Here we call a sequence of words which have lexical cohesion relation with each other a lezical chain like(Morris and Hirst, 1991). Like Morris and Hirst, we assume that lexical chains tend to indicate portions of a text that form a semantic unit. We use the information of the lexical chains and the gaps of lexical chains that are the parts of the chains with no words. The gap of a lexical chain can be considered to indicate a small digression of the topic. In the case that a lexical chain or a gap ends at sentence n, or begins at sentence n + 1, add 1 to scri(n,n + 1). Here we assume that related words are the words in the same class on thesaurus 4.</Paragraph> <Paragraph position="11"> * Change of the modifier of words in lexical chains (i = 23). If the modifier word of words in lexical chains changes in the sentence n + 1, add 1 to scri(n,n + 1). This cue originates in the idea that it might indicate the different aspect of the topic becomes the new topic.</Paragraph> <Paragraph position="12"> The above cues indicate both the plausibility and implausibility of the point as the segment boundary. Occurrence of the topical marker 'wa', for example, the indicates the segment boundary plausibility, while occurrence of anaphora, succession of the same type sentence indicate the implausibility. The weight for each cue reflects whether the cue is the positive or negative factor for the segment boundary. In the next section, we present our weighting method.</Paragraph> </Section> class="xml-element"></Paper>