File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1047_intro.xml
Size: 20,661 bytes
Last Modified: 2025-10-06 14:00:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1047"> <Title>A Method of Measuring Term Representativeness - Baseline Method Using Co-occurrence Distribution -</Title> <Section position="3" start_page="320" end_page="325" type="intro"> <SectionTitle> 2. E,isting measures of representative~kess </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="320" end_page="320" type="sub_section"> <SectionTitle> 2.1 Overview </SectionTitle> <Paragraph position="0"> Various methods for mea:.;uring the inforlnativcness or domain specificity of a word have been proposed in the donmins of IR and term extraction in NLP (see the survey paper by Kageura 1996). Ill characterizing a term, Kagcura introduced the concepts of &quot;unithood&quot; and &quot;termhood&quot;: unithood is &quot;the degree of strength or stability of syntagnmtic combinations or collocations,&quot; and termhood is &quot;tile degree to which a linguistic unit is related to (or more straightR)rwardly, represents) domain-specific concepts.&quot; Kageura's termhood is therefore what we call representativeness here.</Paragraph> <Paragraph position="1"> Representativeness lneasurcs were first introduced in till IR domain for determining indexing words. The simplest measure is calculated fi'om only word frequency within a document, For example, tile weight 1 o of word w~ in document d/is defined by</Paragraph> <Paragraph position="3"> wherc./ii is tilt: frequency of word wi in document (\]i (Sparck-Jolms 1973, Noreauh ct al. 1977). More elaborate measures for tcrmhood combine word frequency within a document and word occurrence over a whole corpus. For instance, (/:/4/; the most comlnonly used measure, was originally defined its N IoIcl\[ \[,i = ./;, x log(--), where iV, and N,,,,~ are, respectively, tile number of documents containing word wg and the total number of documents (Salton et al. 1973). There are a wlriety or&quot; definitions of C/idJl but its basic feature is that a word appearing more flequently in fewer documents is assigned a higher value. If documents are categorized beforehand, we can use a more sophisticated measure based on the X-' test of the hypothesis that an occurrence of&quot; the target word is independent of categories (Nagao et al. 1976).</Paragraph> <Paragraph position="4"> Research on automatic term extraction in NLP domains has led to several measures for weighting terms mainly by considering the unithood of a word sequence. For instance, mutual information (Church ct al. 1990) and the log-likelihood (Dunning 1993) methods for extracting word bigrams have been widely used.</Paragraph> <Paragraph position="5"> Other measures for calculating the unithood of n-grains have also been proposed (Frantzi et al.</Paragraph> <Paragraph position="6"> 1996, Nakagawa et al. 1998, Kita et al. 1994).</Paragraph> </Section> <Section position="2" start_page="320" end_page="320" type="sub_section"> <SectionTitle> 2.2 Problems </SectionTitle> <Paragraph position="0"> Existing measures suffer from at least one of the following problems: (1) Classical measures sucll as t/-idjare so sensitive to term frequencies that they fail to avoid very frequent non-informative words.</Paragraph> <Paragraph position="1"> (2) Methods using cross-category word distributions (such as the Z-' method) can be applied only if documents in a corpus are categorized. (3) Most lneasures in NLP domains cannot treat single word terms because they use the unithood strength of multiple words.</Paragraph> <Paragraph position="2"> The threshold wdue lbr being representative is dcfincd in all ad hoc manner.</Paragraph> <Paragraph position="4"> The scheme that we describe here measures that are free of these problems. 3. Baseline method for defining representativeness measures</Paragraph> </Section> <Section position="3" start_page="320" end_page="321" type="sub_section"> <SectionTitle> 3.1 Basic idea </SectionTitle> <Paragraph position="0"> This subsection describes the method we developed for defining a measure of term representativeness.</Paragraph> <Paragraph position="1"> Our basic idea is smmnarized by tile lhmous quote (Firth 1957) : &quot;You shall k~ow a wo~zt l~y the coml)alT); ir Iwup.v.&quot; We interpreted this as the following working hypolhesis: For any term T, if the term is representative, D(T), the setofall documents containing T, should have some characteristic property compared to the &quot;average&quot; To apply this hypothesis, we need to specify a measure to obtain some &quot;property&quot; of a document set and the concept of &quot;average&quot;. Thus, we converted this hypothesis into the following procedure: Choose a measure M characterizing adocumentset. FortermT, calculate M(D(T)), the value of the measure for D(T). Then compare M(D(T)) with B~(#D(T)), where #D(T)is the number of words contained in #D(T), and B,~ estimates the value of M(D) when D is a randomly chosen document set of size #D(T).</Paragraph> <Paragraph position="2"> Here, M measures the property and BM estinmtes the average. The size of a document set is defined as the number of words it contains.</Paragraph> <Paragraph position="3"> We tried two measures as M. One was the number of different words (referred to here as DIFFNUM) appearing in a document set. Teramoto conducted an experiment with a snmll corpus and reported that DIFFNUM was useful for flicking out important words (Teramoto et al. 1999) under the hypothesis that the number of different words co-occurring with a topical (representative) word is snmllcr than that with a generic word. The other measure was the distance between the word distribution in D(T) and the word distribution in the whole corpus Do. The distance between the two distributions can be measured in various ways, and we used the log-likelihood ratio as in Hisalnitsu et al. 1999, and denote this rneasure as LLR. Figure 2 plots (#D, M(D))s when M is DIFFNUM or LLR, where D varies over sets of randomly selected documents of various sizes from the articles in Nikkei-Shinbun 1996.</Paragraph> <Paragraph position="4"> For measure M, we define Rep(T, M), the representativeness of T, by normalizing M(D(T)) by BM(#D(T)). The next subsection describes the construction of By and the normalization.</Paragraph> </Section> <Section position="4" start_page="321" end_page="322" type="sub_section"> <SectionTitle> 3.2 Baseline function and normalization </SectionTitle> <Paragraph position="0"> Using the case of LLR as an example, this subsection explains why nornmlization is necessary and describes the construction of a baseline function.</Paragraph> <Paragraph position="1"> Figure 3 superimposes coordinates {(#D(7), Values of DIFFNUM and LLR for randomly chosen document set.</Paragraph> <Paragraph position="2"> II ~,~'' i &quot; &quot; ~ .... over ~-ytc pner), qi(year), )J (month), i~cJ~-ll~7~ (read), -- (one), j- ~ (do), and ~}: i>~/(economy). Figure 3 shows that, for example, LLR(D(J-~)) is smaller than LLR(D( ~,~ }J5 )), which reflects our linguistic intuition that words co-occurring with &quot;economy&quot; are more biased than those with &quot;do&quot;. However, LLR(DOI~-',3-)) is smaller than LLR(D(.PSJ/-I~6)) and smaller even than LLR(D@O-~)). This contradicts our linguistic intuition, and is why values of LLR are not dircctly used to compare the representativeness of terms. This phenomenon arises because LLR(D(~) generally increases as #\])(7) increases. We therefore need to use some form of normalization to offset this underlying tendency. We used a baseline function to normalize the values. In this case, Bu,(o) was designed so that it approximates the curve in Fig. 3. From the definition of the distance, it is obvious that Bu.t~(0) = Bu.R(#Do) = 0. At the limit when #1)(~--+ o% Bu.R(') becomes a monotonously increasing function.</Paragraph> <Paragraph position="3"> The curve could be approxinmted precisely through logarithmic linear approximation near (0, 0). ~lb make an approximation, up to 300 documents are randomly sampled at a time. (Let each randomly chosen document set be denoted by D. The number of sampled documents are increased from one to 300, repeating each number up to five times.) Each (#D, LLR(D)) is converted to (log(#D), Iog(LLR(D))).</Paragraph> <Paragraph position="4"> The curve formulated by the (log(#D), log(LLR(D))) values, which is very close to a straight line, is further divided into nmltiple parts and is part-wise approximated by a linear function. For instance, in the interval I = {x \[ 10000 _<x < 15,000}, Iog(LLR(D)) could be approximated by 1.103 + 1.023 x log(#D) with R e = 0.996.</Paragraph> <Paragraph position="5"> For LLR, we define Rep(T, LLR), the representativeness of T by normalizing LLR(D(7)) by Bu.R(#D(7)) as follows:</Paragraph> <Paragraph position="7"> For instance, when we used Nihon Keizai Shimbun 1996, The average of I OOx(log(LLR(D)) ~log(BLue (#D)) - 1), Avr, was -0.00423 and the standard deviation, cs, was about 0.465 when D varies over randomly selected doctuncnt sets. l';very observed wflue fell within Avs'4-4er and 99% ot' observed values fell within Avl+-3cs. This hapfmlled in all corpora (7 orpora) we tested. Theretbrc, we can de:fine the threshold of being representative as,</Paragraph> <Paragraph position="9"> Baseline and sample word distribution</Paragraph> </Section> <Section position="5" start_page="322" end_page="322" type="sub_section"> <SectionTitle> 3.3 Treatment of very frequent terms </SectionTitle> <Paragraph position="0"> So \['ar we have been unable to treat extremely frequent terms, such as -~-~ (do). We therefore used random sampling to calculalc tile 1@1)(77 LLR) of a very li'cquent lerm T. II' the munbcr ot' documents in D(7) is larger than a threshold wdue N, which was calculated froln the average number of words contained in a document, N docnmcnts arc randomly chosen from D(2) (we used N = 150). This subset is denoted D(T) and Re/)(7; LLR) is delined by 100 x (log(LLR(D(7))) /log(BL~,Se (#1)(7))) -- 1).</Paragraph> <Paragraph position="1"> This is effcctivc because wc can use a well-approximated part of the baseline curve; it also reduccs thc amount of calctflation required.</Paragraph> <Paragraph position="2"> Rep(T, M) has the t bllowing advantages by virtue of its definition: (1) Its definition is mathematically clear.</Paragraph> <Paragraph position="3"> (2) It can compare high-frequency terms with lowficqucncy terms.</Paragraph> <Paragraph position="4"> (3) The threshold value of being representative can be defined systematically.</Paragraph> <Paragraph position="5"> (4) It can be applied to n-gram terms for any n. 4. Experiments</Paragraph> </Section> <Section position="6" start_page="322" end_page="324" type="sub_section"> <SectionTitle> 4.1 Ewfluation of monograms </SectionTitle> <Paragraph position="0"> Taldng topic-word selection for a navigation window for IR (see Fig. 1) into account, we cxamined the relation bctwecn the value of Rel)(7, M) and a manual classification of words (monograms) extracted from 158,000 articles (excluding special-styled non-sentential articles such as company-personnel-aflhir articles) in the 1996 issties of the Nildcei Shinbun.</Paragraph> <Paragraph position="1"> We randolnly chose 20,000 words from 86,000 words having doculnent ficquencies larger than 2, thcn randomly chose 2,000 of them and classified these into thrce groups: class a (acceptable) words uscfill for the navigation window, class d (delete) words not usethl for the navigation window, ,and class u (uncertain) words whose usefulness in the navigation window was either neulral or difficult to judge. In the classification process, a judge used the DualNA VI system and examined the informativeness of each word as guidance. Classification into class d words was done conservatively because the consequences of removing informative words from lhc window are more serious than those of allowing useless words to appear.</Paragraph> <Paragraph position="2"> 3hblc I shows part of the chtssification of thc 2,000 words. Words marked &quot;p&quot; arc proper nouns. The difference between propcr nouns in class a and proper nouns in other classes is that the former arc wcllknown. Most words classified as &quot;d&quot; are very common verbs (such as-,J-~(do) and {J~s-~(have)), adverbs, demonstrative pronouns, conjunctions, and numbers. It is thereti)rc impossible to define a stop-word list by only using parts-of-spccch bccausc ahnost all parts-of speech appear in class d words.</Paragraph> <Paragraph position="3"> To evaluate the effectiveness of several lneasures, we compared the ability of each measure to gather (avoid) representative (non-representative) terms. We randomly sorted thc 20,000 words and then compared the results with the restllts of sorting by other criteria: Rep(., LLR), Rep(., DIFFNUM), (f (tern~ liequency), and tfid.fi The comparison was done by nsing the accunmlated number of words marked by a specified class that appeared in the first N (1 _< N_< 2,000) words. The definition we used for tj- idf was Nlota\[ .t/- ira= 4771775 x log N(r ' where T is a term, TF(7) is the term frequency of 7, Nt,,,<,l is the number of total documents, and N(7) is the number of documents that contain 7: Figure 4 compares, for all the sorting criteria, tile accumulated number of words marked &quot;a&quot;. The total number of class a words was 911. Rep( o, LLR) clearly outperformed the other measures. Although Rep(., DIFFNUM) outperformed .tf and tf-idf up to about the first 9,000 monograms, it otherwise under-performed them. If we use the threslaold value of Rep(., LLR), from the first word to the 1,511th word is considered representative. In this case, the recall and precision of the 1,511 words against all class a words were 85% and 50%, respectively.</Paragraph> <Paragraph position="4"> When using tf-idf the recall and precision of the first 1,511 words against all class a words were 79% and 47%, respectively (note that tJ'-idfdoes not have a clear threshold value, though).</Paragraph> <Paragraph position="5"> Although the degree of out-performance by Rep(., LLR) is not seemingly large, this is a promising result because it has been pointed out that, in the related domains of term extraction, existing measures hardly outperform even the use of frequency (for example, Daille et al. 1994, Caraballo et al. 1999) when we use this type of comparison based on the accumulated numbers.</Paragraph> <Paragraph position="6"> Figure 5 compares, for all the sorting criteria, the accumulated number of words marked by d (454 in total), in this case, fewer the number of words is better. The difference is far clearer in this case: Rep(., LLR) obviously outperformed the other measures. In contrast, tfidJ and frequency barely outperformed random sorting. Rep(., DIFFNUM) outperformed tfand (f-idfuntil about the first 3,000 monograms, but under-performed otherwise.</Paragraph> <Paragraph position="7"> Figure 6 compares, for all the sorting criteria, the accumulated number of words marked ap (acceptable proper nouns, 216 in total ). Comparing this figure with Fig. 4, we see that the out-performance ofRep(., LLR) is more pronounced.</Paragraph> <Paragraph position="8"> Also, Rep(., DIFFNUM) globally outperformed tf and tf-idf while the performance of(land tf-idfwcre nearly the same or even worse than with random sorting.</Paragraph> <Paragraph position="10"> In the experiments, proper nouns generally have a high Rep-value, and some have particularly high scores. Proper nouns having particularly high scores are, for instance, the names ofsumo wrestlers or horses. This is because they appear in articles with special formats such as sports reports.</Paragraph> <Paragraph position="11"> We attribute the difference of the performance between Rep(., LLR) and RED(., DIFFNUM) to the quantity of information used. Obviously information on the distribution of words in a document is more comprehensive than that on the number of different words. This encourages us to try other measures of document properties that incorporate even more precise information.</Paragraph> </Section> <Section position="7" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 4.2 Picking out fl'equeni non-representative </SectionTitle> <Paragraph position="0"> monograms When we concentrate on the nlost fi-equent terms, Re/)(., DIFFNUM) outperfomlcd Rep(., LLR) in the following sense. We marked &quot;clearly non-representative terms&quot; in the 2,000 most frequent monograms, then counted the number of marked terms that were assigned Rt7)-values smaller than the threshold value of a specified representativeness ulcasurc.</Paragraph> <Paragraph position="1"> The total number of checked terms was 563, and 409 of them are identified as non-representative by Rep(', LER). On the other hand, Rep( deg, DIFFNUM) identified 453 terms as non--representative.</Paragraph> </Section> <Section position="8" start_page="324" end_page="324" type="sub_section"> <SectionTitle> 4.3 Rank correlation between measures </SectionTitle> <Paragraph position="0"> We investigated the rank-correlation of the sorting results for the 20,000 terms used in the experiments described in subsection 4.1. Rank correlation was measured by Spearman's method and Kendall's method (see Appendix) using 2,000 terms randomly selected from the 20,000 terms. Table 2 shows the correlation between Rep(,, LLR) and other measures.</Paragraph> <Paragraph position="1"> It is interesting that the ranking by Rep(., LLR) and that by Rep(., DIFFNUM) had a very low correlation, even lower than with (f or (fidf This indicates that a combination of Rep(., LLR) and Rep(,, DIFFNUM) should provide a strong discriminative ability in term classification; this possibility deserves further investigation.</Paragraph> </Section> <Section position="9" start_page="324" end_page="325" type="sub_section"> <SectionTitle> 4.4 Portability of baseline functions </SectionTitle> <Paragraph position="0"> We examined the robustness of thc baseline fimctions; that is, whether a baseline function defined from a corpus can be used for normalization in a different corpus. This was investigated by using Re/)(., LLR) with seven different corpora. Seven baseline functions were defined from seven corpora, then were used for normalization for defining Rep(., LLR) in the corpus used in the experiments described in subesction 4.1. The per%rmance of the Re/)(,, LLR)s defined using the difl'erent baseline flmctions was compared in the same way as in the snbsection 4. l. The seven corpora used to construct baseline fhnctions were as follows: NK96-ORG: 15,8000 articles used in the experiments in 4.1 NK96-50000:50,000 randomly selected articles from Ihe whole corpus N K96 (206,803 articles of Nikkei-shinhun 1996) N K96-100000: I 0(},000 randomly selected articles fn}m N K96 NK96-200000: 2{}0,00(} randomly selcctcd articles fiom NK96 NK98-1580{}0:158,0{}(} randomly selecled articles from articles in Nikkei-xhinhun 1998 N('- 158000:158,{}00 randomly selected abstracts of academic papers I\]'Olll NACSIS corptl:.; (Kando ct al. 1999) NC-:\LI.: all abstracts (333,003 abstracts) in the NACSIS coq)us. Statistics on their content words are shown in Table 3.</Paragraph> <Paragraph position="1"> accumulated number of words marked &quot;a&quot; (see subsection 4.1). The pertbrmancc decreased only slightly when the baseline defned from NC-ALL was used. In other cases, the difl'erences was so small that they were almost invisible ill Fig. 7. The same results were obtained when using class d words and class ap words.</Paragraph> <Paragraph position="2"> Sorting results on class a words We also examined the rank correlations between the ranking that resulted from each representativeness measure in the same way as described in subsection 4.2 (see Table 4). They were close to 100% except when combining the Kendall's These resnhs suggest that a baseline function constructed from a corpus can be used to rank terms in considerably different corpora. This is particularly useful when we are dealing with a corpus silnilar to a known corpus but do not know the precise word distributions in the corpus. The same tdnd of robustness was observed when we used Re/)(&quot;, DIFFNUM). This baseline thnction robustness is an important tbature of measures defined using the baseline based.</Paragraph> </Section> </Section> class="xml-element"></Paper>