File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/w04-1018_relat.xml

Size: 3,600 bytes

Last Modified: 2025-10-06 14:15:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1018">
  <Title>Chinese Text Summarization Based on Thematic Area Detection</Title>
  <Section position="3" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> The research of automatic summarization begins with H.P.Luhn's work. By far, a large number of scholars have taken part in the research and had many achievements. Most of the researchers have concentrated on the sentence-extraction summarization method (the so-called shallower approach) (Wang et al., 2003; Nomoto and Matsumoto, 2001; Gong and Liu, 2001), but not the sentence-generation method (the so-called deeper approach)(Yang and Zhong., 1998). On the one hand, it is caused by the high complexity and the severe limitation of practical fields of rational natural language processing technology and knowledge engineering technology. On the other hand, it is closely associated with the great achievements in many fields of natural language processing by statistical research methods, machine learning methods and pattern recognition methods in recent years (Mani, 2001).</Paragraph>
    <Paragraph position="1"> The summarization method of sentence-extraction can roughly be divided into two kinds: supervised and unsupervised (Nomoto and Matsumoto, 2001). Generally, the realization of the former relies on plenty of manual summaries, that is so-called &amp;quot;Gold Standards&amp;quot; which help determining the relevant parameters of the statistical model for summarization. However, not all people believe that manual summaries are reliable, so the researchers have begun to investigate the general unsupervised method, which can avoid the requirement of support of manual summaries. Nevertheless it is soon discovered that the summaries produced by this method can't cover all the themes and have great redundancy at the same time. Usually, it can only cover those intensively distributed themes while neglects others. So researchers in Nanjing University proposed a summarization method based on the analysis of the discourse structure to overcome these problems (Wang et al., 2003). By making statistics of the reduplicated words in the adjacent paragraphs of the document, the semantic distances among them can be worked out. Then analyse the thematic structure of the document and extract sentences from each theme to form a summary. It is ideal to employ this method while dealing with those documents with standard discourse structure, because it can effectively avoid the problems caused by the summarization method without discourse structure analysis. Yet when the writing style of a document is rather free and the distribution of the themes is variable, that is the same theme can be distributed in several paragraphs not adjacent to each other, then the use of this method can't be equally effective.</Paragraph>
    <Paragraph position="2"> To deal with a lot of Chinese documents which have free style of writing and flexible themes, a sentence-extraction summarization method created by detecting thematic areas is tried following such work as (Nomoto and Matsumoto, 2001; Salton et al., 1996; Salton et al., 1997; Carbonell and Goldstein, 1998; Lin and Hovy, 2000). The thematic areas detection in a document is obtained through the adaptive clustering of paragraphs (cf.</Paragraph>
    <Paragraph position="3"> Moens et al. 1999), so it can overcome in a certain degree the defects of the above methods in dealing with the documents with rather flexible theme distribution.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML