File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1102_metho.xml

Size: 14,548 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1102">
  <Title>A Practical Text Summarizer by Paragraph Extraction for Thai</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Preprocessing for Thai Text
</SectionTitle>
    <Paragraph position="0"> The first step for working with Thai text is to tokenize a given text into meaningful words, since the Thai writing system has no delimiters to indicate word boundaries. Thai words are not delimited by spaces. The spaces are only used to break the idea or draw readers' attention. In order to determine word boundaries, we employed the longest matching algorithm (Sornlertlamvanich, 1993). The longest matching algorithm starts with a text span that could be a phrase or a sentence. The algorithm tries to align word boundaries according to the longest possible matching character compounds in a lexicon. If no match is found in the lexicon, it drops the right-most character in that text according to the morphological rules and begins the same search. If a word is found, it marks a boundary at the end of the longest word, and then begins the same search starting at the remainder following the match.</Paragraph>
    <Paragraph position="1"> In our work, the lexicon contained 32675 words.</Paragraph>
    <Paragraph position="2"> However, the limitation of this algorithm is that if the target words are compound words or unknown words, it tends to produce incorrect results. For example, a compound word is segmented as the following: null `ngkh krs ithth imn usychn  `ngkh kr _s ithth i_mn u_sy _chn Since this compound word does not appear in the lexicon, it becomes small useless words after the word segmentation process. We further describe an efficient approach to alleviate this problem by using an idea of phrase construction (Ohsawa et al., 1998). Let wi be a word that is firstly tokenized by using the longest matching algorithm. We refer to w1w2 ...wn as a phrase candidate, if n &gt; 1, and no punctuation and stopwords occur between w1 and wn. It is well accepted in information retrieval community that words can be broadly classified into content-bearing words and stopwords. In Thai, we found that words that perform as function words can be used in place of stopwords similar to English.</Paragraph>
    <Paragraph position="3"> We collected 253 most frequently occurred words for making a list of Thai stopwords.</Paragraph>
    <Paragraph position="4"> Given a phrase candidate consisting of n words, we can generate a set of phrases in the following form:</Paragraph>
    <Paragraph position="6"> For example, if a phrase candidate consists of four words, w1w2w3w4, we then obtain W = {w1w2,w1w2w3,w1w2w3w4,w2w3,w2w3w4,w3w4}.</Paragraph>
    <Paragraph position="7"> Let l be the number of set elements that can be computed from l = (n*(n[?]1))/2 = (4*3)/2 = 6.</Paragraph>
    <Paragraph position="8"> Since we use both stopwords and punctuation for bounding the phrase candidate, this approach produces a moderate number of set elements.</Paragraph>
    <Paragraph position="9"> Let V be a temporary lexicon. After building all the phrase candidates in the document and generating their sets of phrases, we can construct V by adding phrases that the number of occurrences exceeds some threshold. This idea is to exploit redundancy of phrases occurring in the document.</Paragraph>
    <Paragraph position="10"> If a generated phrase frequently occurs, this indicates that it may be a meaningful phrase, and should be included in the temporary lexicon using for resegmenting words.</Paragraph>
    <Paragraph position="11"> We denote U to be a main lexicon. After obtaining the temporary lexicon V , we then re-segment words in the document by using U [?] V . With using the combination of these two lexicons, we can recover some words from the first segmentation. Although we have to do the word segmentation process twice, the computation time is not prohibitive. Furthermore, we obtain more meaningful words that can be extracted to form keywords of the document.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Generating Summaries by Extraction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Finding Clusters of Significant Words
</SectionTitle>
      <Paragraph position="0"> In this section, we first describe an approach for finding clusters of significant words in each paragraph to calculate the local clustering score. Our approach is reminiscent of Luhn's approach (1959) but uses the other term weighting technique instead of the term frequency. Luhn suggested that the frequency of a word occurrence in a document, as well as its relative position determines its significance in that document. More recent works have also employed Luhn's approach as a basis component for extracting relevant sentences (Buyukkokten et al., 2001; Lam-Adesina and Jones, 2001). This approach performs well despite of its simplicity. In our previous work (Jaruskulchai et al., 2003), we also applied this approach for summarizing and browsing Thai documents through PDAs.</Paragraph>
      <Paragraph position="1"> Let b be a subset of a continuous sequence of words in a paragraph, {wu ...wv}. The subset b is called a cluster of significant words if it has these characteristics: * The first word wu and the last word wv in the sequence are significant words.</Paragraph>
      <Paragraph position="2"> * Significant words are separated by not more than a predefined number of insignificant words.</Paragraph>
      <Paragraph position="3"> For example, we can partition a continuous sequence of words in a paragraph into clusters as shown in Figure 1. The paragraph consists of twelve words. We use the boldface to indicate positions of significant words. Each cluster is enclosed with brackets. In this example, we define that a cluster is created whereby significant words are separated by not more than three insignificant words. Note that many clusters of significant words can be found in the paragraph. The highest score of the clusters found in the paragraph is selected to be the paragraph score. Therefore, the local clustering score for paragraph si can be calculated as follows:</Paragraph>
      <Paragraph position="5"> where ns(b,si) is the number of bracketed significant words, and n(b,si) is the total number of bracketed words.</Paragraph>
      <Paragraph position="6"> We can see that the first important step in this process is to mark positions of significant words for identifying the clusters. Our goal is to find topical words, which are indicative of the topics underlying the document. According to Luhn's approach, the term frequencies is used to weight all the words.</Paragraph>
      <Paragraph position="7"> The other term weighting scheme frequently used is TFIDF (Term Frequency Inverse Document Frequency) (Salton and Buckley, 1988). However, this technique needs a corpus for computing IDF score, causing the genre-dependent problem for generic text summarization task.</Paragraph>
      <Paragraph position="8"> In our work, we decide to use TLTF (Term Length Term Frequency) term weighting technique (Banko et al., 1999) for scoring words in the document instead of TFIDF. TLTF multiplies a monotonic function of the term length by a monotonic function of the term frequency. The basic idea of TLTF is based on the assumption that words that are used more frequently tend to be shorter. Such words are not strongly indicative of the topics underlying in the document, such as stopwords. In contrast, words that are used less frequently tend to be longer. One significant benefit of using TLTF term weighting technique for our task is that it does not require any external resources, only using the information within the document.</Paragraph>
      <Paragraph position="10"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Discovering Relations of Paragraphs
</SectionTitle>
      <Paragraph position="0"> We now move on to describe an approach for discovering relations of paragraphs. Given a document D, we can represent it by an undirected graph G = (V,E), where V = {s1,...,sm} is the set of paragraphs in that document. An edge (si,sj) is in E, if the cosine similarity between paragraphs si and sj is above a certain threshold, denoted a.</Paragraph>
      <Paragraph position="1"> A paragraph si is considered to be a set of words {wsi,1,wsi,2,...,wsi,t}. The cosine similarity between two paragraphs can be calculated by the following formula:</Paragraph>
      <Paragraph position="3"> The graph G is called the text relationship map of D (Salton et al., 1999). Let dsi be the degree of node si. We then refer to dsi as the global connectivity score. Generating a summary for a given document can be processed by sorting all the nodes with dsi in decreasing order, and then extracting n top-ranked nodes, where n is the target number of paragraphs in the summary.</Paragraph>
      <Paragraph position="4"> This idea is based on Salton et al.'s approach that also performs extraction at the paragraph level. They suggested that since a highly bushy node is linked to a number of other nodes, it has an overlapping vocabulary with several paragraphs, and is likely to discuss topics covered in many other paragraphs.</Paragraph>
      <Paragraph position="5"> Consequently, such nodes are good candidates for extraction. They then used a global bushy path that is constructed out of n most bushy nodes to form the summary. Their experimental results on encyclopedia articles demonstrates reasonable results. However, when we directly applied this approach for extracting paragraphs from moderately-sized documents, we found that using only the global connectivity score is inadequate to measure the informativeness of paragraphs in some case. In order to describe this situation, we consider an example of a text relationship map in Figure 2. The map is</Paragraph>
      <Paragraph position="7"/>
      <Paragraph position="9"> but using a = 0.20.</Paragraph>
      <Paragraph position="10"> constructed from an online newspaper article.1 The similarity threshold a is 0.1. As a result, edges with similarities less than 0.1 do not appear on the map. Node P4 obtains the maximum global connectivity score at 9. However, the global connectivity score of nodes P3, P5, and P6 is 7, and nodes P7 and P8 is 6, which are slightly different. When we increase the threshold a = 0.2, we obtain a text relationship map as shown in Figure 3. Nodes P4 and P7 now achieve the same maximum global connectivity score at 5.</Paragraph>
      <Paragraph position="11"> Nodes P3, P5, and P6 get the same score at 4.</Paragraph>
      <Paragraph position="12"> From above example, it is hard to determine that  Our preliminary experiments with many other documents lead to the suggestion that the global connectivity score of nodes in the text relation map tends to be slightly different on some document lengths.</Paragraph>
      <Paragraph position="13"> Given a compression rate (ratio of the summary length to the source length), if we immediately extract these nodes of paragraphs, many paragraphs with the same score are also included in the summary. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Combining Local and Global Properties
</SectionTitle>
      <Paragraph position="0"> In this section, we present an algorithm that takes advantage of both the local and global properties of paragraphs for generating extractive summaries.</Paragraph>
      <Paragraph position="1"> From previous sections, we describe two different approaches that can be used to extract relevant paragraphs. However, these extraction schemes are based on different views and concepts. The local clustering score only captures the content of information within paragraphs, while the global connectivity score mainly considers the structural aspect of the document to evaluate the informativeness of paragraphs. This leads to our motivation for unifying good aspects of these two properties. We can consider the local clustering score as the local property of paragraphs, and the global connectivity score as the global property. Here we propose an algorithm that combines the local clustering score with the global connectivity score to get a single measure reflecting the informativeness of each paragraph, which can be tuned according to the relative importance of properties.</Paragraph>
      <Paragraph position="2"> Our algorithm proceeds as follows. Given a document, we start by eliminating stopwords and extracting all unique words in the document. These unique words are used to be the document vocabulary. Therefore, we can represent a paragraph si as a vector. We then compute similarities between all the paragraph vectors using equation (3), and eliminate edges with similarities less than a threshold in order to build the text relationship map. This process automatically yields the global connectivity scores of the paragraphs. Next, we weight each word in the document vocabulary using TLTF term weighting technique. All the words are sorted by their TLTF scores, and top r words are selected to be significant words.</Paragraph>
      <Paragraph position="3"> We mark positions of significant words in each paragraph to calculate the local clustering score. After obtaining both scores, for each paragraph si, we can compute the combination score by using the following ranking function:</Paragraph>
      <Paragraph position="5"> where Gprime is the normalized global connectivity score, and Lprime is the normalized local clustering score. The normalized global connectivity score Gprime can be calculated as follows:</Paragraph>
      <Paragraph position="7"> where dmax is the degree of the node that has the maximum edges using for normalization, resulting the score in the range of [0,1]. Using equation (2), Lprime is given by:</Paragraph>
      <Paragraph position="9"> where Lmax is the maximum local clustering score using for normalization. Similarly, it results this score in the range of [0,1]. The parameter l is varied depending on the relative importance of the components Gprime and Lprime. Therefore, we can rank all the paragraphs according to their combination scores in decreasing order. We finally extract n top-ranked paragraphs corresponding to the compression rate, and rearrange them in chronological order to form the output summary.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML