File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1046_metho.xml

Size: 7,650 bytes

Last Modified: 2025-10-06 14:14:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1046">
  <Title>Knowledge-based Automatic Topic Identification</Title>
  <Section position="3" start_page="0" end_page="308" type="metho">
    <SectionTitle>
2 The Power of Generalization
</SectionTitle>
    <Paragraph position="0"> In order to count concept frequency, we employ a concept generalization taxonomy. Figure 1 shows a possible hierarchy for the concept digital computer.</Paragraph>
    <Paragraph position="1"> According to this hierarchy, if we find iaptop and hand-held computer, in a text, we can infer that the text is about portable computers, which is their parent concept. And if in addition, the text also mentions workstation and mainframe, it is reasonable to say that the topic of the text is related to digital computer.</Paragraph>
    <Paragraph position="2"> Using a hierarchy, the question is now how to find the most appropriate generalization. Clearly we cannot just use the leaf concepts -- since at this level we have gained no power from generalization. On the other hand, neither can we use the very top concept -- everything is a thing. We need a method of identifying the most appropriate concepts somewhere in middle of the taxonomy. Our current solution uses</Paragraph>
    <Section position="1" start_page="308" end_page="308" type="sub_section">
      <SectionTitle>
2.1 Branch Ratio Threshold
</SectionTitle>
      <Paragraph position="0"> We call the frequency of occurrence of a concept C and it's subconcepts in a text the concept's weight 2.</Paragraph>
      <Paragraph position="1"> We then define the ratio T~,at any concept C, as follows: null 7~ = MAX(weight of all the direct children of C) SUM(weight of all the direct children of C) 7~ is a way to identify the degree of summarization informativeness. The higher the ratio, the less concept C generalizes over many children, i.e., the more it reflects only one child. Consider Figure 2. In case (a) the parent concept's ratio is 0.70, and in case (b), it is 0.3 by the definition of 7~. To generate a summary for case (a), we should simply choose Apple as the main idea instead of its parent concept, since it is by far the most mentioned. In contrast, in case (b), we should use the parent concept Computer Company as the concept of interest. Its small ratio, 0.30, tells us that if we go down to its children, we will lose too much important information. We define the branch ratio threshold (T~t) to serve as a cutoff point for the determination of interestingness, i.e., the degree of generalization. We define that if a concept's ratio TC/ is less than 7~t, it is an interesting concept.</Paragraph>
    </Section>
    <Section position="2" start_page="308" end_page="308" type="sub_section">
      <SectionTitle>
2.2 Starting Depth
</SectionTitle>
      <Paragraph position="0"> We can use the ratio to find all the possible interesting concepts in a hierarchical concept taxonomy.</Paragraph>
      <Paragraph position="1"> If we start from the top of a hierarchy and proceed downward along each child branch whenever the branch ratio is greater than or equal to 7~t, we will eventually stop with a list of interesting concepts. We call these interesting concepts the interesting wave front. We can start another exploration of interesting concepts downward from this interesting wavefront resulting in a second, lower, wavefront,  and so on. By repeating this process until we reach the leaf concepts of the hierarchy, we can get a set of interesting wavefronts. Among these interesting 2According to this, a parent concept always has weight greater or equal to its maximum weighted direct children. A concept itself is considered as its own direct child.</Paragraph>
      <Paragraph position="2"> (io)</Paragraph>
      <Paragraph position="4"> wavefronts, which one is the most appropriate for generation of topics? It is obvious that using the concept counting technique we have suggested so far, a concept higher in the hierarchy tends to be more general. On the other hand, a concept lower in the hierarchy tends to be more specific. In order to choose an adequate wavefront with appropriate generalization, we introduce the parameter starting depth, l)~. We require that the branch ratio criterion defined in the previous section can only take effect after the wavefront exceeds the starting depth; the first subsequent interesting wavefront generated will be our collection of topic concepts. The appropriate ~Da is determined by experimenting with different values and choosing the best one.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="308" end_page="309" type="metho">
    <SectionTitle>
3 Experiment
</SectionTitle>
    <Paragraph position="0"> We have implemented a prototype system to test the automatic topic identification algorithm. As the concept hierarchy, we used the noun taxonomy from WordNet 3 (Miller et al., 1990). WordNet has been used for other similar tasks, such as (Resnik, 1993) For input texts, we selected articles about information processing of average 750 words each out of Business Weck (93-94). We ran the algorithm on 50 texts, and for each text extracted eight sentences containing the most interesting concepts.</Paragraph>
    <Paragraph position="1"> How now to evaluate the results? For each text, we obtained a professional's abstract from an online service. Each abstract contains 7 to 8 sentences on average. In order to compare the system's selection with the professional's, we identified in the text the sentences that contain the main concepts mentioned in the professional's abstract. We scored how many sentences were selected by both the system and the professional abstracter. We are aware that this evaluation scheme is not very accurate, but it serves as a rough indicator for our initital investigation.</Paragraph>
    <Paragraph position="2"> We developed three variations to score the text  sentences on weights of the concepts in the interesting wavefront.</Paragraph>
    <Paragraph position="3">  1. the weight of a sentence is equal to the sum of weights of parent concepts of words in the sentence.</Paragraph>
    <Paragraph position="4"> 2. the weight of a sentence is the sum of weights of words in the sentence.</Paragraph>
    <Paragraph position="5"> 3. similar to one, but counts only one concept in null stance per sentence.</Paragraph>
    <Paragraph position="6"> To evaluate the system's performance, we defined three counts: (1) hits, sentences identified by the algorithm and referenced by the professional's abstract; (2) mistakes, sentences identified by the algorithm but not referenced by the professional's abstract; (3) misses, sentences in the professional's abstract not identified by the algorithm. We then borrowed two measures from Information Retrieval research: null</Paragraph>
    <Paragraph position="8"> The closer these two measures are to unity, the better the algorithm's performance. The precision measure plays a central role in the text summarization problem: the higher the precision score, the higher probability that the algorithm would identify the true topics of a text. We also implemented a simple plain word counting algorithm and a random selection algorithm for comparision.</Paragraph>
    <Paragraph position="9"> The average result of 50 input texts with branch ratio threshold 4 0.68 and starting depth 6. The average scores 5 for the three sentence scoring variations are 0.32 recall and 0.35 precision when the system produces extracts of 8 sentences; while the random selection method has 0.18 recall and 0.22 precision in the same experimental setting and the plain word counting method has 0.23 recall and 0.28 precision.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML