File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-1125_abstr.xml

Size: 17,034 bytes

Last Modified: 2025-10-06 13:42:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1125">
  <Title>A Measure of Term Representativeness Based on the Number of Co-occurring Salient Words</Title>
  <Section position="1" start_page="0" end_page="3" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We propose a novel measure of the representativeness (i.e., indicativeness or topic specificity) of a term in a given corpus. The measure embodies the idea that the distribution of words co-occurring with a representative term should be biased according to the word distribution in the whole corpus. The bias of the word distribution in the co-occurring words is defined as the number of distinct words whose occurrences are saliently biased in the co-occurring words. The saliency of a word is defined by a threshold probability that can be automatically defined using the whole corpus.</Paragraph>
    <Paragraph position="1"> Comparative evaluation clarified that the measure is clearly superior to conventional measures in finding topic-specific words in the newspaper archives of different sizes.</Paragraph>
    <Paragraph position="2"> Introduction Measuring the representativeness (i.e., the informativeness or domain specificity) of a term  is essential to various tasks in natural language processing (NLP) and information retrieval (IR). Such a measure is particularly crucial to automatic dictionary construction and IR interfaces to show a user words indicative of topics in retrievals that often consist of an intractably large number of documents (Niwa et al. 2000).</Paragraph>
    <Paragraph position="3"> This paper proposes a novel and effective measure of term representativeness that reflects the bias of the words co-occurring with a term. In the following, we focus on extracting topic words from an archive of newspaper articles.</Paragraph>
    <Paragraph position="4"> In the literature of NLP and IR, there have been a number of studies on term weighting, and these are strongly related to measures of term  A term is a word or a word sequence.</Paragraph>
    <Paragraph position="5"> representativeness (see section 1). In this paper we employ the basic idea of the 'baseline method' proposed by Hisamitsu (Hisamitsu et al. 2000). The idea is that the distribution of words co-occurring with a representative term should be biased according to the word distribution of the whole corpus. Concretely, for any term T and any measure M for the degree of bias of word occurrences in D(T), a set of words co-occurring with T, according to those of the whole corpus D  is an archive of newspaper articles and D(T) is defined as the set of all articles containing T.</Paragraph>
    <Paragraph position="6"> The normalization of M(D(T)) is done by a function B  are very different. We denote this normalized value by NormM(D(T)).</Paragraph>
    <Paragraph position="7"> Hisamitsu et al. reported that NormM(D(T)) is very effective in capturing topic-specific words when M(D(T)) is defined as the distance between two word distributions P D(T) and P  (see subsection 1.2), which we denote by Dist(D(T)).</Paragraph>
    <Paragraph position="8"> Although NormDist(D(T)) outperforms existing measures, it has still an intrinsic drawback shared by other measures, that is, words which are irrelevant to T and simply happen to occur in D(T) --- let us call these words non-typical words --- contribute to the calculation of M(D(T)). Their contribution accumulates as background noise in M(D(T)), which is the part to be offset by the baseline function. In other words, if M(D(T)) were to exclude the contribution of non-typical words, it would not need to be normalized and would be more precise. This consideration led us to propose a different approach to measure the bias of word occurrences in a discrete way: that is, we only take words whose occurrences are saliently biased in D(T) into account, and let the number of such words be the degree of bias of word occurrences in D(T). Thus, SAL(D(T), s), the number of words in D(T) whose saliency is over a threshold value s, is expected to be free from the background noise and sensitive to number of major subtopics in D(T). The essential problem now is how to define the saliency of bias of word occurrences and the threshold value of saliency. This paper solves this problem by giving a mathematically sound measure. Furthermore, it is shown that the optimal threshold value can be defined automatically. The newly defined measure SAL(D(T), s) outperforms existing measures in picking out topic-specific words from newspaper articles.</Paragraph>
    <Paragraph position="9">  1. Brief review of term representativeness measures</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
1.1 Conventional measures
</SectionTitle>
      <Paragraph position="0"> Regarding term weighting, various measures of importance or domain specificity of a term have been proposed in NLP and IR domains (Kageura et al. 1996). In his survey, Kageura introduced two aspects of a term: unithood and termhood. Unithood is &amp;quot;the degree of strength or stability of syntagmatic combinations or collocations,&amp;quot; and termhood is &amp;quot;the degree to which a linguistic unit is related to (or more straightforwardly, represents) domain-specific concepts.&amp;quot; Kageura's termhood is therefore what we call representativeness here.</Paragraph>
      <Paragraph position="1"> Representativeness measures were first introduced in the context of determining indexing words for IR (for instance, Salton et al. 1973; Spark-Jones et al. 1973; Nagao et al. 1976). Among a number of measures introduced there, the most commonly used one is tf-idf proposed by Salton et al.</Paragraph>
      <Paragraph position="2"> There are a variety of modifications of tf-idf (for example, Singhal et al. 1996) but all share the basic feature that a word appearing more frequently in fewer documents is assigned a higher value.</Paragraph>
      <Paragraph position="3"> In NLP domains several measures concentrating on the unithood of a word sequence have been proposed. For instance, the mutual information (Church et al. 1990) and log-likelihood ratio (Dunning 1993; Cohen 1995) have been widely used for extracting word bigrams. Some measures for termhood have also been proposed, such as Imp (Nakagawa 2000), C-value and NC-value (Mima et al. 2000).</Paragraph>
      <Paragraph position="4"> Although certain existing measures are widely used, they have major problems as follows: (1) classical measures such as tf-idf are so sensitive to term frequencies that they fail to avoid uninformative words that occur very frequently; (2) measures based on unithood cannot handle single-word terms; and (3) the threshold value for a term to be considered as being representative is difficult to define or can only be defined in an ad hoc manner. It is reported that measures defined by the baseline method do not have these problems (Hisamitsu et al. 2000).</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
1.2 Baseline method
</SectionTitle>
      <Paragraph position="0"> The basic idea of the baseline method stated in introduction can be summarized by the famous quote (Firth 1957) : &amp;quot;You shall know a word by the company it keeps.&amp;quot; This is interpreted as the following hypothesis: For any term T, if the term is representative, word occurrences in D(T), the set of words co-occurring with T, should be biased according to the word distribution in D  .</Paragraph>
      <Paragraph position="1"> This hypothesis is transformed into the following procedure: Given a measure M for the bias of word occurrences in D(T) and a term T, calculate M(D(T)), the value of the measure for D(T). Then compare  of M(D) when D is a randomly chosen document set of size #D(T).</Paragraph>
      <Paragraph position="2"> Here, as stated in introduction, D(T) is considered to be the set of all articles containing term T. Hisamitsu et al. tried a number of measures for M, and found that using Dist(D(T)), the distance between the word distribution P D(T) in D(T) and the word distribution P  in the whole corpus D  is effective in picking out topic-specific words in newspaper articles. The value of Dist(D(T)) can be defined in various ways, and they found that using log-likelihood ratio (see Dunning 1993) worked best which is represented as follows:</Paragraph>
      <Paragraph position="4"> are the frequency of a word w  ))}, where T varies over &amp;quot;cipher&amp;quot;, &amp;quot;do&amp;quot;, and &amp;quot;economy&amp;quot;, and D rand varies over a wide numerical range of randomly sampled articles. This figure shows that Dist(D(&amp;quot;do&amp;quot;)) is smaller than Dist(D(&amp;quot;electronic&amp;quot;)), which reflects our linguistic intuition that words co-occurring with &amp;quot;electronic&amp;quot; are more biased than those with &amp;quot;do&amp;quot;. However, Dist(D(&amp;quot;cipher&amp;quot;)) is smaller than Dist(D(&amp;quot;do&amp;quot;)), which contradicts our linguistic intuition. This is why values of Dist(D(T)) are not directly used to compare the representativeness of terms.</Paragraph>
      <Paragraph position="5"> Figure 1(a) Baseline curve and sample word distribution This phenomenon can be explained by the curve, referred to as the baseline curve, composed of</Paragraph>
      <Paragraph position="7"> )}. The curve indicates that a part of Dist(D(T)) systematically varies depending only on #D(T) and not on T itself. It indicates the very notion of background noise stated in introduction, and by offsetting this part using the baseline function B Dist (#D(T)), which approximates the baseline curve, the graph is converted into that of  frequent terms, such as &amp;quot;do&amp;quot; are treated in a special way: that is, if the number of documents in D(T) is larger than a threshold value N  , which was calculated from the average number of words contained in a document, N  documents are randomly chosen from D(T). This is because the coordinates of the point corresponding to &amp;quot;do&amp;quot; differ in Fig. 1(a) and Fig. 1(b). As stated in introduction, Hisamitsu et al. (2000) reported on that the superiority of NormDist(D(T)), normalized Dist(D(T)), in picking out topic-specific words over various measures including existing ones and other ones developed by using the baseline method. Figure 1(b)</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Effect of Normalization
1.3 Reconsideration of normalization
</SectionTitle>
      <Paragraph position="0"> The effectiveness of the baseline method's normalization indicates that Dist(D(T)) can be decomposed into two parts, one depending on T itself and another depending only on the size of D(T), which is considered to be background noise. The essence of the baseline method is to make the background noise explicit as a baseline function and to offset the noise by using the baseline function. To put it the other way round, if a term representativeness measure is designed so that this noise part does not exist in the first place, there is no need for the baseline function and calculation of representativeness becomes much simpler. More importantly, the precision of the measure itself should improve.</Paragraph>
      <Paragraph position="1"> The definition of Dist(D(T)) shows, as with other measures, that every word in D(T) contributes to the value of Dist(D(T)). This explains why background noise, B Dist (#D(T)), grows as #D(T) increases. One way to improve this situation is to eliminate the contribution of non-typical (see introduction) words. The simplest way to archive this is to focus only on saliently occurring words (precisely, words whose occurrences are saliently biased in D(T)) and let the number of words whose saliency is over a threshold value s, denoted by SAL(D(T), s), be the degree of bias of word  occurrences in D(T). SAL(D(T), s) should reflect the richness of subtopics in D(T) and should be free from the contribution of non-typical words in D(T). Thus, we need to define the saliency of occurrences of a word and a threshold value with which the occurrences of a word in D(T) is determined as salient.</Paragraph>
      <Paragraph position="2">  2. Term representativeness measure based on the number of co-occurring salient words</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.1 A measure of word occurrence saliency
</SectionTitle>
      <Paragraph position="0"> To define saliency of occurrences of a word w in D(T), we employ a probabilistic measure proposed by Hisamitsu et al. (2001) as follows: Let the total number (occurrences) of words in the whole corpus be N, the number (occurrences) of words in D(T) be n, the frequency of w in the whole corpus be K, and the frequency of w in D(T) be k. Denote the probability of &amp;quot;No less than k red balls are contained in n balls that are arbitrarily chosen from N balls containing K red balls&amp;quot; by hgs(N, K, n, k). Then the saliency of w in D(T) is defined as [?]log(hgs(N, K, n, k))  .</Paragraph>
      <Paragraph position="1"> Note that the probability of &amp;quot;k red balls are contained in n balls arbitrarily chosen from N balls containing K red balls&amp;quot;, which we denote as hg(N, K, n, k), is a hypergeometric distribution with variable k. We denote the value [?]log(hgs(N, K, n, k)) by HGS(w). HGS(w) is expressed as follows:  The reason why HGS(v) should be defined by [?]hgs(N, K, n, k) instead of [?]hg(N, K, n, k) is that the value of [?]hg(N, K, n, k) itself cannot tell whether occurrence of v k-times is saliently frequent or saliently infrequent. Only hgs(N, K, n, l), the sum of hg (N, K, n, l) over l (k[?]l[?]min{n,K}) can tell which is the case since the sum indicates how far the event &amp;quot;v occurs k-times in D(w)&amp;quot; is from the extreme event &amp;quot;v occurs min{n,K} times in D(w)&amp;quot;.</Paragraph>
      <Paragraph position="2"> value of HGS(w)= [?]log(hgs(N, K, n, k)) is always meaningful between any combination of N, K, n, and k. HGS(w) can be calculated very efficiently using an approximation technique (Hisamitsu et al. 2001).</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
2.2 Definition of SAL(D(T), s)
</SectionTitle>
      <Paragraph position="0"> Now we can define SAL(D(T), s) using the saliency measure defined above and a parameter s [?] 0: },)(|)({)),(( swHGSTDwDIFFNUMsTDSAL [?][?]= where DIFFNUM(X) stands for the number of distinct items in set X. That is, SAL(D(T), s) is the number of distinct words in D(T) whose saliency of occurrence is not less than s. For instance, using the 1996 archive of Nihon Keizai Shimbun (a Japanese financial newspaper), SAL(D(&amp;quot;Aum</Paragraph>
      <Paragraph position="2"> is the threshold value stated in subsection 1.2.</Paragraph>
      <Paragraph position="3"> This strongly suggests that SAL(D(T), s) can discriminate topic-specific words from non-topical words.</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.3 Optimizing threshold of saliency
</SectionTitle>
      <Paragraph position="0"> Note that SAL(D(T), 0) gives the number of distinct words in D(T), and as s increases to [?], SAL(D(T), s) becomes a constant function (zero). If we straightforwardly follow the baseline method, we have to construct the baseline function B SAL(D(T), s) for varying s and test the performance of NormSAL(D(T), s), the normalized SAL(D(T), s). There are, however, a problem that B</Paragraph>
      <Paragraph position="2"> be precisely approximated because SAL(D(T), s) is a discrete-valued function.</Paragraph>
      <Paragraph position="3"> By considering the meaning of the baseline function, we can solve the problem of determining the optimal value of saliency parameter s without approximating baseline functions. That is, since the baseline function is considered as background noise to be offset, the best situation should be that the baseline function is a constant-valued function while SAL(D(T), s) is a non-trivial function (i.e., not a constant function). If there exists s  be precisely approximated by using analytical functions, it can be seen that B SAL(D(T), s) changes from a monotone increasing function to a monotone decreasing function when s is greater than about 110, and the graph of B SAL(D(T), 110) is roughly parallel to the x-axis. Considering the meaning of baseline functions again, this means that s</Paragraph>
      <Paragraph position="5"> optimal value of saliency and that SAL(D(T), 110) can be used without normalization and is the most effective SAL. The important thing here is that this procedure to find the optimal value of s can be done automatically because it only requires random sampling of documents and curve fitting. Section 3 experimentally confirms the superiority of SAL(D(T),</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML