File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1102_intro.xml

Size: 3,763 bytes

Last Modified: 2025-10-06 14:02:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1102">
  <Title>A Practical Text Summarizer by Paragraph Extraction for Thai</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The growth of electronic texts is becoming increasingly common. Newspapers or magazines tend to be available on the World-Wide Web. Summarizing these texts can help users access to the information content more quickly. However, doing this task by humans is costly and time-consuming. Automatic text summarization is a solution for dealing with this problem.</Paragraph>
    <Paragraph position="1"> Automatic text summarization can be broadly classified into two approaches: abstraction and extraction. In contrast to abstraction that requires using heavy machinery from natural language processing (NLP), including grammars and lexicons for parsing and generation (Hahn and Mani, 2000), extraction can be easily viewed as the process of selecting relevant excerpts (sentences, paragraphs, etc.) from the original document and concatenating them into a shorter form. Thus, most of recent works in this research area are based on extraction (Goldstein et al., 1999). Although one may argue that extraction approach makes the text hard to read due to the lack of coherence, it also depends on the objective of summarization. If we need to generate summaries that can be used to indicative what topics are addressed in the original document, and thus can be used to alert the uses as the source content, i.e., the indicative function (Mani et al., 1999), extraction approach is capable of handling this kind of tasks.</Paragraph>
    <Paragraph position="2"> There have been many researches on text summarization problem. However, in Thai, we are in the initial stage of developing mechanisms for automatically summarizing documents. It is a challenge to summarize these documents, since they are extremely different from documents written in English. Similar to Chinese or Japanese, for the Thai writing system, there are no boundaries between adjoining words, and also there are no explicit sentences boundaries within the document. Fortunately, there is the use of the paragraph structure in the Thai writing system, which is indicated by indentations and blank lines. Therefore, extracting text spans from Thai documents at the paragraph level is a more practical way.</Paragraph>
    <Paragraph position="3"> In this paper, we propose a practical approach to Thai text summarization by extracting the most relevant paragraphs from the original document. Our approach considers both the local and global properties of these paragraphs, which their meaning will become clear later. We also present an efficient approach for solving Thai word segmentation problem, which can enhance a basic word segmentation algorithm yielding more useful output. We provide experimental evidence that our approach achieves acceptable performance. Furthermore, our approach does not require the external knowledge other than the document itself, and be able to summarize general text documents.</Paragraph>
    <Paragraph position="4"> The remainder of this paper is organized as follows. In Section 2, we review some related work and contrast it with our work. Section 3 describes the preprocessing for Thai text, particularly on word segmentation. In Section 4, we present our approach for extracting relevant paragraphs in detail, including how to find clusters of significant words, how to discover relations of paragraphs, and an algorithm for combining these two approaches. Section 5 describes our experiments. Finally, we conclude in Section 6 with some directions of future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML