File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0111_metho.xml
Size: 13,295 bytes
Last Modified: 2025-10-06 14:14:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0111"> <Title>I Automatic Suggestion of Significant Terms for a Predefined Topic</Title> <Section position="3" start_page="131" end_page="135" type="metho"> <SectionTitle> 2. METHODOLOGY </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="131" end_page="132" type="sub_section"> <SectionTitle> 2.1 Manual versus Automatic Term Suggestion </SectionTitle> <Paragraph position="0"> TO manually select significant terms for a predefined topic, the domain expert first creates a topic focused sample from one specific source or a combination of sources. Then, he or she reads the documents, providing a relevance judgment (i.e. a reader-assigned score) to each document. By carefully examining relevant documents in the focused sample, a list of terms that are deemed to be significant for the definition of the topic is identified. In many cases, it is possible that the domain expert would introduce some terms based on his or her own professional knowledge about the topic. These terms may be highly prominent for the topic, yet may not necessarily occur in the focused sample.</Paragraph> <Paragraph position="1"> For automatic suggestion of topical terms, initial attempts were made using the sample documents the domain expert created. The results were not impressive. The statistical Information generated from the sample documents was not rich and sufficient enough for any discriminative judgment. Our experience showed that, to draw terms that are reflective of a given topic, a much larger and more general base sample is required. Such a base sample should be randomly sampled from the same source as the focused sample and it should contain an array of different topics. Once the baseline statistics are generated from both data collections, a meaningful comparison could spot terms that occur with unusual frequency in the focused sample. These terms would constitute good candidates for topically sensitive terminological units (Steier and Belew 1994).</Paragraph> </Section> <Section position="2" start_page="132" end_page="133" type="sub_section"> <SectionTitle> 2.2 Focused Sample and Base Sample </SectionTitle> <Paragraph position="0"> For our experiments of automatic term suggestion, we selected a predefined topic called &quot;European Politics and Business&quot;. The focused sample was originally created by the domain expert using the 1988 United Press International (UPI). Table 1 presents statistical information about this dataset. After reading each of the relevant documents found in the focused sample, the domain expert manually determined 347 topical terms. Table 2 provides the statistical breakdown of these terms.</Paragraph> <Paragraph position="1"> Since the focused sample was drawn from the source of 1988 UPI, the construction of its corresponding base sample was also initiated from the same source of the same year. Our experiments demonstrated that, in order to obtain a random assortment of topics to be included in the base sample, it may be meaningful to sample documents from the time pedod before and after the focused documents. Therefore, the final base sample was created by randomly drawing documents from the years of 1987, 1988 and 1989. The size of this dataset is about 27 times larger than the sample data file (see Table 1 ).</Paragraph> <Paragraph position="2"> Though the ratio between the focused and base samples was arbitrary, in order to generate meaningful statistics, we felt that the base sample should be at least 20 times larger in size than the focused sample. (For the sake of discussion, hereafter, we may sometimes refer to the focused sample as &quot;focused&quot; and the base sample as &quot;base&quot;.)</Paragraph> </Section> <Section position="3" start_page="133" end_page="133" type="sub_section"> <SectionTitle> 2.3 Experimental Procedure </SectionTitle> <Paragraph position="0"> The general method we adopted is as follows. First, we identified statistically significant terms from both samples. Next, a comparison algorithm was applied to these two sets of terms to single out those that were common to both samples, yet whose patterns of occurrences differed between these two samples. Finally, we analyzed and presented this set of terms as content odented candidates for the predefined topic, in this case &quot;European Politics and Business&quot;.</Paragraph> <Paragraph position="1"> The terms suggested are split into three categones: single word terms, two-word terms and multi-word terms (or phrases). The following three sections descnbe in detail the methods for generating each of the three categories.</Paragraph> </Section> <Section position="4" start_page="133" end_page="134" type="sub_section"> <SectionTitle> 2.4 Suggesting Single Word Terms </SectionTitle> <Paragraph position="0"> Automatically suggesting single word terms as being topically oriented has been most challenging. Our experiments indicated that the ffirst order&quot; statistics, probability and entropy alone, are not sufficient for gathering information about the topicality of a word in running text.</Paragraph> <Paragraph position="1"> The information in both measurements is essentially equivalent since entropy is just the log inverse of probability.</Paragraph> <Paragraph position="2"> We found that the &quot;second-order&quot; statistics, such as vadance or standard deviation of term frequencies across documents, provide greater insight into topicality. We selected the interval between the occurrences of a word as the basis for analysis. Our intuitions led us to believe that topical single words should appear more frequently and more regularly, i.e. at approximately even intervals, in the focused sample than in the base sample. The focused sample represents, more or less, a topical sublanguage set while the base sample a general language set. Unlike probability and entropy statistics which yield average scores for the whole document, the use of interval makes it possible to get an &quot;instantaneous&quot; measure at any location in the document. More specifically, an interval can be measured &quot;instantaneously&quot; at any point in the text between the occurrences of a particular word. Though using interval alone might still not be sufficient for identifying word topicality, it allowed us to measure the vadance which would help identify words that were always changing in their rate of occurrences.</Paragraph> <Paragraph position="3"> Thus, three scores were generated for each word: the mean log interval, the standard deviation of the mean log interval, and the normalized standard deviation of the mean log interval. The use of a log scale for these measurements is to minimize the effect of unduly large variations in words with long mean intervals. The normalized standard deviation is produced by simply dividing the raw standard deviation by the mean log interval. In most cases, raw standard deviation is found to be larger for words having long mean intervals. In order to compare the standard deviations across words of different intervals, we found this normalization process quite useful.</Paragraph> <Paragraph position="5"> After scores were generated for all the words in both the focused sample and the base sample, score comparisons between the two samples were carried out in two ways: comparing the intervals and comparing the standard deviations.</Paragraph> <Paragraph position="6"> To compare the intervals, the =base&quot; mean log interval was subtracted from the &quot;focused&quot; mean log Interval and divided by the raw standard deviation from the base sample. The result represents the change of mean log intervals. More explicitly, it yields the number of standard deviations that the &quot;focused&quot; mean log interval is different from the =base&quot; mean log interval. The more negative;the value, the more significant the change, and the more prominent the word would appear in the focused sample.</Paragraph> <Paragraph position="7"> To compare the standard deviations, the normalized =base&quot; standard deviation was subtracted from the normalized &quot;focused&quot; standard deviation. The difference symbolizes how the word is distributed in ~e focused sample. The more negative the value is, the more &quot;bursty&quot; the word is distributed, and the more likely it is content oriented since &quot;content words tend to appear in 'bursts&quot; (Church and Mercer 1993).</Paragraph> <Paragraph position="8"> If a single word term is found in both data samples and it receives negative scores from both interval and standard deviation comparisons, it would be included in the suggested list as being topical onented.</Paragraph> </Section> <Section position="5" start_page="134" end_page="135" type="sub_section"> <SectionTitle> 2.5 Suggesting Two-Word Terms </SectionTitle> <Paragraph position="0"> The method for suggesting two-word terms tumed out to be much simpler than that for single word terms though the same techniques are equally applicable. Here, the traditional mutual information score was used. As stated in Church, et al. (1991) and elsewhere, the mutual information measurement can be expressed as:</Paragraph> <Paragraph position="2"> where p(wlw2) is the frequency in the data collection of the two-word compound (wl ,w2); and p(wl) and p(w2) the frequency of the word constituents. The highest mutual information score indicates that the individual probabilities are low while the two words occur together frequently.</Paragraph> <Paragraph position="3"> Two steps led to our automatic suggestion of topic-oriented two-word terms. First, the mutual information score was computed for each pair of words that occur in each of the two samples. To capture topicality, we were only interested in pairs of words with high mutual information scores.</Paragraph> <Paragraph position="4"> Therefore, any pair which contained =closed class&quot; words, such as determiners, prepositions, auxiliaries, or single letters, digit numbers, or overly common verbs like &quot;give&quot;, &quot;take&quot;, etc., were excluded. Such an exclusion not only helped getting pairs of words with high mutual information scores, but also sped up computation significantly. A threshold value was also set such that if any two-word unit occurred less than 3 times in the sample or received a mutual information score lower than 6.0, it was eliminated and would not participate in the next comparison measurement.</Paragraph> <Paragraph position="5"> With the mutual information scores in hand, a &quot;delta&quot; score was generated by subtracting the &quot;base&quot; mutual information score from the ffocused&quot; mutual information score. Topically, prominent two-word terms normally have lower scores in the focused sample that is &quot;keyed&quot; to their topic. This is because the constituent words distribute in wider range of contexts. The probability of them occurring separately increases relative to the probability of them occurring together (Steier and Belew 1994). Therefore, the more negative the &quot;delta&quot; score, the more topically sensitive the two-word term is.</Paragraph> <Paragraph position="6"> If a two-word term occurs in both data samples and receives a negative &quot;delta&quot; score, it would be included in the suggested list as being topically onented.</Paragraph> </Section> <Section position="6" start_page="135" end_page="135" type="sub_section"> <SectionTitle> 2.6 Suggesting Multi-Word Terms </SectionTitle> <Paragraph position="0"> When automatically suggesting content two-word terms, we looked at the mutual information scores for adjacent words. For multi-word terms, the mutual information score was calculated for non-adjacent words. Our intuitions led us to believe that if there is a significant statistical linkage, i.e. a high mutual information score, between such a pair of words, it is highly possible that they belong to a larger linguistic component.</Paragraph> <Paragraph position="1"> Our first step was to compute mutual information scores for a word unit separated by a distance of two (i.e. having one unspecified word separating them). Two cdteda apply when selecting &quot;interesting&quot; word units. Their mutual information score must be 10 or greater. Following the observations by Steier and Belew (Steier and Belew1994), we only selected pairs which received lower mutual information score in the focused sample than in the base sample.</Paragraph> <Paragraph position="2"> Once an &quot;interesting&quot; word unit of distance two was selected, a concordance was built of all sentences containing that word unit. These sentences were compared for matching text. If a stdng of text was found to include that word unit and, at the same time, occur most frequently in the concordance, its leading and trailing &quot;closed-set&quot; words (if any) were chopped off. The remaining text stdng was presented as a suggested multi-word term.</Paragraph> </Section> </Section> class="xml-element"></Paper>