File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/82/c82-1055_metho.xml
Size: 9,178 bytes
Last Modified: 2025-10-06 14:11:31
<?xml version="1.0" standalone="yes"?> <Paper uid="C82-1055"> <Title>LEXICAL PARALLELISM IN TEXT STRUCTURE DETERMINATION AND CONTENT ANALYSIS</Title> <Section position="3" start_page="0" end_page="339" type="metho"> <SectionTitle> I. INTRODUCTION </SectionTitle> <Paragraph position="0"> Lexical parallelism, that is, the repetition of lexical items, is an important device for indicating the sentence connections in a text(discouse). The recurrent lexical items, or lexical equivalents need not have the same syntactic function or parts of speech in the two sentences in which they occur. They may be identical in form and in meaning, or they may be related by lexico-semantic relationship, such as synonymy, hyponymy, antonymy. In a special case they may be partly identical both in form and in meaning, as in ~ (ultrasonic wave), ~(sound wave) and ~ (sound).</Paragraph> <Paragraph position="1"> Another device for indicating the sentence connections is a syntactic device, such as substitutes, logical connecters, time and place relaters and structural parallelism \[I\]. For example, in Japanese substitutes---</Paragraph> <Paragraph position="3"> mentioned), and logical connecters--~ (and), ~ (or), m--~ (secondly) belong to this device.</Paragraph> <Paragraph position="4"> Sevbo studied lexlcal parallelism in normalized text, where substitutes were replaced by their lexical equivalents and complex sentences were decomposed into successive simple sentences(clauses).</Paragraph> <Paragraph position="5"> She traced the repetition patterns of lexical items in Subject/Predicate oppossitio n. She assumes here that the syntactic subject or its dependent, direct or indirect, corresponds to &quot;Subject(old information) of elementary thought&quot; and the syntactic predicate or its dependent to &quot;Predicate(new information) of elementary thought&quot;J2\].</Paragraph> <Paragraph position="6"> In Japanese, sentence components occur in any positions before predicate and old information or topic is placed, as a rule, at/near the beginning of a sentence\[3\]. In the following discussion we analyze the repetition of lexical items in an unnormalized text without regard to their syntactic functions, parts of speech and topic/comment distinctions, assuming that the lexical equivalents at/near the beginning of the sentences function as the keywords in indicating the sentence connections and the contents of a text.</Paragraph> <Paragraph position="7"> Nouns do not inflect and most verbs and adjectives have the unchanging stems and inflectional suffixes in Japanese. The important concepts and technical terms (noun, verb or 340 Y. SAKAMOTO and T. OKAMOTO adjective stems) are written in Kanji (Chinese ideographs) or Katakana(square Japanese syllabary). Katakana is used to transcribe foreign technical terms. Hiragana(Japanese cursive syllabary), on the other hand, is used to write post-positional particles and suffixes, denoting case, topic, mood, tense aspect etc. In view of these facts we define lexical items as a word or phrase in Kanji and Katakana.</Paragraph> <Paragraph position="8"> We have studied lexical parallelisms in a short tale\[4\], in technical and scientific texts\[5,6\], based upon Sevbo's approach. The purpose of the present paper is to obtain the characteristics of lexical parallelism in Japanese technical and scientific texts and to explore the possibilities of utilizing these characteristics for automatic content analysis.</Paragraph> <Paragraph position="9"> Five text samples are used for experiment and discussion. They are the essays on &quot; Ultrasonic amplification&quot;(Text A), &quot;Brain and automaton&quot;(Text B), &quot;Petrochemical industry&quot;(Text C), &quot;Chemical industry in Japan&quot;(Text D) and &quot;Between organism and inanimate matter&quot;(Text E).</Paragraph> </Section> <Section position="4" start_page="339" end_page="339" type="metho"> <SectionTitle> 2. LEXICAL PARALLELISM RATIO </SectionTitle> <Paragraph position="0"> is t~e determinable maximum number of the ~entence connections in a text, N being the total number of the sentences in the text: t is type of lexical repetition and w is the position, i.e. the sequence number from the beginning of the sentence.</Paragraph> <Paragraph position="1"> The experiments were carried out to obtain the characteristics of the lexical parallelism in sample texts on computer and by hand.</Paragraph> <Paragraph position="2"> In eomputer experiment lexieal items, i.e. the sequence in Kanji or Katakana, were identified and segmented by machine character codes without syntactic and morphological analysis. Then the sentence connections of type 1(identical repetition) are determined in each position and lexical parallelism ratios are obtained(Table I). On the same samples the optimal sentence connections are determined manually and the lexical parallelism ratios were calculated(Table 2). Except for Text E, the totals of the ratios amount to 72-83%(cf. Table 2) and in computer experiment the ratios of type I in the initial position amount to 57-68%(0f. Table I). And moreover, the initial lexieal items(w=1) show the maxima in most samples in Table I and by far the highest value in all samples in Table 2, and they decrease with increasing w in Table 2. It is The sentence connection of type t clear from the results that lexical in position w is determined between parallelism plays an important role in the given j-th sentence Sj and the the intersentential dependency and i-th sentence Si( i < j ), if and only lexical items at the beginning of the if Si is the nearest preceding sentences are the most relevant sentence which contains the lexical lexical parallelism indicators. item, lexically equivalent to the w-th lexical item from the beginning of the 3. LEXICAL PARALLELISM INDICATOR given sentence Sj through the type t DISTANCE repetition( t = 1,2,3; w = 1,2,3,4,5).</Paragraph> <Paragraph position="3"> The repetitions of type 1,2,3 As an example, intersentential correspond to the identical, partly dependency determined manually in Text identical, lexico-semantic A, which is the essay on &quot;Ultrasonic repetitions, respectively, amplification&quot; with 123sentences in The lexical equivalents in SJ and four paragraphs, is shown in Table 3 Si are called lexieal parallelism and Figure I. The lexical parallelism indicators, and Sj is called a indicator distances are shown as well. dependent on Si. Lexical parallelism indicator Lexical parallelism ratio of type t distance is defined as follows: in position w is defined as follows: t</Paragraph> <Paragraph position="5"> where n is the number of the where D is lexical parallelism determined connections in a text: N-I indicator distance: t is type of</Paragraph> <Paragraph position="7"> lexical repetition: w is position of the lexical indicator: i and j are sequence numbers of the governor sentence and dependent sentence respectively.</Paragraph> <Paragraph position="8"> The distance is supposed to represent the semantic extent of the lexical parallelism indicators, or better the concepts referred by them. In Figure I a diagonal unit distance line indicates the hypothetical situation, where every sentence depends on the immediately preceding sentence. Data show a tendency to distribute near this line in all samples.</Paragraph> <Paragraph position="9"> Lexical parallelism indicators show the progress of the author's thought in the text in Table 3. Sevbo pointed out the significance of the indicators with large D in indicating the contents of paragraphs and texts. The lexical items with large D are supposed to be the important topics, to which the author of the text returnes after commenting on another topics. In the example the items with large D(D>IO) were shown in Figure 2. These indicators are distributed among paragraphs. For example, the indicator ~i~(ultrasonic wave) extends over 15 sentences(from 9th to 24th) within paragraph 2, which ranges from 2nd to 4Oth sentence, and the indicator ~ (traveling-wave tube) extends over 22 sentences(100th-122nd) within paragraph 4(85th-123rd) as well. The indicator ~m (traveling-wave amplification) covers paragraph 3 completely, ranging from the 41th sentence, or the first sentence of the paragraph, through the 67th sentence to 85th sentence, or the first sentence of the next paragraph. In short, these indicators divide the text into the three paragraphs.</Paragraph> <Paragraph position="10"> In addition, they reflect appropriately the contents of paragraphs in the sample text, as suggested by the fact that they are partly identical with the following paragraph names: &quot;Introduction&quot;(paragraph I), &quot;What is the ultrasonic wave?&quot;(paragraph 2), &quot;Microwave and traveling-wave tube&quot;(paragraph 3) and &quot;Ultrasonic wave and traveling-wave amplification&quot;(paragraph 4).</Paragraph> <Paragraph position="11"> These data suggest that the indicator with large D may be useful as keywords to the contents of a text.</Paragraph> </Section> class="xml-element"></Paper>