XML Viewer - c00-1059

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1059_metho.xml
Size: 19,905 bytes
Last Modified: 2025-10-06 14:07:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1059">
  <Title>Corpus-dependent Association Thesauri for Information Retrieval</Title>
  <Section position="3" start_page="0" end_page="406" type="metho">
    <SectionTitle>
2 Automatic Generation of an Associa-
</SectionTitle>
    <Paragraph position="0"> tion Thesaurus from a Corpus</Paragraph>
    <Section position="1" start_page="0" end_page="404" type="sub_section">
      <SectionTitle>
2.1 Outline of the thesaurus generation
</SectionTitle>
      <Paragraph position="0"> method The proposed thcsaurus generation method consists of term extraction, co-occurrence data extraction, and correlation analysis, as shown in Fig.  A thesaurus should consist of terms, each representing a domain-specific concept. Most of tile terms representing important concepts are nol, lllS, simple or compound, that frequently occur in the corpus. ThereR)re, we extract both simple nonns and compound nouns whose occurrence frequencies exceed a predetermined threshold. We also use a list of stop words since frequently occurring nouns are not always terms.</Paragraph>
      <Paragraph position="1"> Compound nouns are identified by a pattern matching method using a part-of-speech sequence pattern. Naturally, the pattern is language specific. The following is a pattern for Japanese compound nouns:</Paragraph>
      <Paragraph position="3"> A problem in extracting compound nouns is that a word sequence matched to the above pattern, which actually defines just the type of noun phrase, is not always a term. We filter out some kind of non-term noun phrases by using a list of stop words for the first and last elements of compound nouns. Stop words for the first element of compound nouns include referential nouns (e.g.</Paragraph>
      <Paragraph position="4"> jouki (above-mentioned)) and determiner nouns (e.g. kaku (each)). Stop words for the last element of compound nouns include time/place nouns (e.g. nai (inside)) and relational nouns (e.g.</Paragraph>
      <Paragraph position="5"> koyuu (peculiar)).</Paragraph>
      <Paragraph position="6"> Another importmat problem we are confronted with in term extraction is the structural ambiguity of compound nouns. For our purpose, we need to extract non-maximal compound nouns as well sts lnaxinlal comp()und nouns. Here a non-maximal compound noun means one that occurs as a part of a larger conlpound noun, and a nlaximal compound nonn means one thstt occurs not as a part of a larger compound noun. We must disambiguate tile structure of compound nouns to correctly extract non-maximal compound nouns. We have developed a statistical disambiguation method, the detail and evaluation of which are described in 2.2.</Paragraph>
    </Section>
    <Section position="2" start_page="404" end_page="404" type="sub_section">
      <SectionTitle>
2.1.2 Co-occurrence data evtraction
</SectionTitle>
      <Paragraph position="0"> Our purpose is to collect pairs of semantically or contextually associated terms, no matter what kind of association. So we extract co-occurrence in a window. That is, every pair of terms occurring together within a window is extracted as the window is moved through a text.</Paragraph>
      <Paragraph position="1"> The window size can be specified rather arbitrarily. Considering our purpose, the window should accommodate a few sentences. At tile same time, the window size should not be too large from the viewpoint of computational load.</Paragraph>
      <Paragraph position="2"> Therefore, 20 to 50 words, excluding function words, seems to be an appropriate value.</Paragraph>
      <Paragraph position="3"> Note that we filter out any pair of words co-occurring within a compound noun. If such pairs were included in co-occurrence data, they would show high correlation. However, they would be redundant because compound nouns are treated as entities in our thesaurus.</Paragraph>
      <Paragraph position="4">  As a correlation measure between terms, we use mutual information (Church and Hanks 1990).</Paragraph>
      <Paragraph position="5"> The mutual inlbrmation between terms t~ and t i is defined by the following formula:</Paragraph>
      <Paragraph position="7"> where f(t~) is the occurrence frequency of term t~, and g(ti,ti ) is the co-occurrence frequency of terms t~ and tj. A rnaxinmm nunrber of associat-.</Paragraph>
      <Paragraph position="8"> ed terms for each term is predetermined as well as a threshold for tile mutual information, and associated terms are selected based on the de-o scending order of mutual information.</Paragraph>
      <Paragraph position="9"> Mutual infornaation involves a problem in that it is overestimated for low-frequency terms (I)unning 1993). Therefore, we determine whether two terms are related to each other by a log-likelihood ratio test, and we filter out pairs of terms that do not pass the test.</Paragraph>
    </Section>
    <Section position="3" start_page="404" end_page="405" type="sub_section">
      <SectionTitle>
2.2 Disambiguation of compound noun
</SectionTitle>
      <Paragraph position="0"> structure 2.2.1 Disantbiguation based on coitus statistics Our disanabiguation method is described below for tile case of a compound noun consisting of three elelnents. A compound noun W~W2W 3 has two possible structures: WI(W~W3) and (W~W,)W 3. We deternfine its structure based on the occurrence t}equencies of maxilnal compound nouns as follows: If tile maximal compound noun W~W3 occurs more frequently than tile inaxinml compound noun W~W,, then tile  structure Wl(W2W3) is preferred. On the condeg trary, if the maximal COlnpound noun W~W2 occurs more frequently than the maximal compound noun W2W3, then the structure (W~W2) W3 is preferred.</Paragraph>
      <Paragraph position="1"> The generalized disambiguation rnle is as follows: If a compound noun CN includes two compound noun candidates CN~ and CN2, which are incompatible with each other, and the maximal compound noun CN~ occurs more frequently than the maximal compound noun CN&gt; then a structure of CN including CN~ is preferred to a structure of CN including CN,.</Paragraph>
      <Paragraph position="2"> We have two alternatives regarding the range where we count occurrence frequencies of maximal compound nouns. One is global-statistics which means that frequencies are counted in the whole corpus and they are used to disambiguate all compound nouns in the corpus. The other is local-statistics which means that frequencies are counted in each document in the corpus and they are used to disambiguate compound nouns in the corresponding document.</Paragraph>
      <Paragraph position="3">  statistics We evaluated both the global-statistics-based disambiguation and the local-statistics-based disambiguation by using a 23.7-M Byte corpus consisting of 800 patent documents. Table l(a) shows comparative examples of these methods. Evah, ation results for the 200 highest-frequency maximal COlnpound nouns consisting of three or Local-statistics-based more words are summa ~ rized in Table l(b).</Paragraph>
      <Paragraph position="4"> 14,921 words (73.7%) They prove that the local-statistics-based 5,332 words (26.3%) disambiguation method 20,253 words (100%) is superior to the global-statistics-based disambiguation method.</Paragraph>
      <Paragraph position="5"> Note that in the local-statistics-based disambiguation method, we resorted to the global-statistics when local-statistics were not available. The percentage of cases the local-statistics were not available was 25.1 percent.</Paragraph>
      <Paragraph position="6"> (Kobayasi et al. 1994) proposed a disambiguation method using collocation information and semantic categories, and reported that the structure of compound nouns was disambiguated at 83% accuracy. Note that their accuracy was calculated for compound nouns including unambiguous compound nouns, i.e. those consisting of only two words. If it were calculated for com-. pound nouns consisting three or more words, it would be less than that of our method. Thus, we can conclude that our local-statistics-based method compares quite well with rather sophisticated previous methods.</Paragraph>
    </Section>
    <Section position="4" start_page="405" end_page="406" type="sub_section">
      <SectionTitle>
2.3 Prototype and an experiment
</SectionTitle>
      <Paragraph position="0"> We implemented a prototype thesaurus generator in which the local-statistics-based method was used to disambiguate the structure of compound nouns. Using this thesaurus generator, we got a thesaurus consisting of 38,995 terms froln a 61-M Byte corpus consisting of almost 48,000 articles in the financial pages of a Japanese newspaper. In this experiment, the threshold for occurfence frequencies of terms in the term extraction step was set to 10, and the window size in the co-occurrence data extraction step was set to 25.</Paragraph>
      <Paragraph position="1">  The abow+&amp;quot; rtm took 5.4 hours on a HP9000 C200 workstation. The tlarouglaput is tolerable from a practical point of view. We should also note that a thesaurus can be updated as efficiently as it can be initially generated. Because \ve can run the first two steps (extraction of terms and extraction of co-occurrence data) in accumulative fashion, and we only need to run the third step over again when a considerable amount of terms and co-occurrence data are accunmlated.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="406" end_page="409" type="metho">
    <SectionTitle>
3 Navigation in an Association Thesau-
FUS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="406" end_page="406" type="sub_section">
      <SectionTitle>
3.1 Purpose and outline of the proposed
</SectionTitle>
      <Paragraph position="0"> thesaurus navigator A big problem with toclafs information retrieval systems based on search techniques is that they require users, who may not know exactly what thcy arc looking for, to explicitly describe their information needs. Another problem is that mismatched vocabularies between users and the corpus would bring poor retrieval results.</Paragraph>
      <Paragraph position="1"> To solve these problems, we propose a corptts-~ dependent association-thesaurus navigator enabling users to efficiently explore information through a corpus.</Paragraph>
      <Paragraph position="2"> Users' requirements are summarized as fob lows: They want to grasp the overall information structure of a domain.</Paragraph>
      <Paragraph position="3"> They want to know what topics or subdeg.</Paragraph>
      <Paragraph position="4"> domains arc contained in the corpus.</Paragraph>
      <Paragraph position="5"> - They want to know terms that appropriately describe their w~gue information needs.</Paragraph>
      <Paragraph position="6"> To meet the above requirements, our pro,posed thesaurus navigator has novel functions such as clustering of related terms, generation of a thesaurus overview, and zoom-in on a sub-domain of interest. A conceptual image of thesaurus navigation using these ftmctions is shown in Fig. 2. A typical informatio,&gt; exploration session proceeds as follows.</Paragraph>
      <Paragraph position="7"> At the beginning, the system displays an overview of a corpus-dependent thesaurus so that users can easily enter the information space of the corpus. The overview is a kind of summary of the corpus, it consists of clusters of generic terms of the domain, and makes it easy to understand what topics or sub-domains are contained in the corpus. Looking at the thesaurus over- null view, the users can select one or a few term clusters they have interest in, and the screen will zoom in on the cluster(s). The zoomed view consists of a nulnber of clusters, each including more specific terms than those in the overview. Users can repeat this zoom-in operation until they reach term clusters representing sufficiently specific topics.</Paragraph>
    </Section>
    <Section position="2" start_page="406" end_page="407" type="sub_section">
      <SectionTitle>
3.2 Functions of the thesaurus navigator
3.2.1 Clustering of related terms
</SectionTitle>
      <Paragraph position="0"> We made a preliminary experiment to evaluate standard agglomerative clustering algorithms including the single-linkage method, the con&gt; pletedinkage method, and the group-average-linkage method (Eldqamdouchi and Willett 1989)~ Among them, the group--average-linkage method resulted in the best results, ttowever, several potential clusters tended to merge into a large one when we repeated the merge operation until a predetermined number of clusters were obtained.</Paragraph>
      <Paragraph position="1"> Accordingly, we use the group-average-linkage method with an upper limit on the size of a cluster. null  Our method tot generating a thesaurus overview consists of major-term extraction and term clustering. The m~0or-term extracting algorithm, which is carried out beforehand in batch mode, is described below. See 3.2.1 for the term clustering algorithnl.</Paragraph>
      <Paragraph position="2"> An overview of the thesaurus should consist of generic terms included in the corpus, flowever, we do not have a definite criterion for get&gt; eric terms. So we collect m~oor terms from the corpus as follows. The number of m~!jor terms,  denoted by M below, was set to 300 in the prototype. null i) Determine a characteristic term set for each doculnent.</Paragraph>
      <Paragraph position="3"> Calculate the weight w~j of term tj for document 4 according to the tf-idf (term frequency - inverse document frequency) formula. Then select the first re(i) terms in the descending order of u, u for each document d,, where re(i), the number of characteristic terms for document 4, is set to 20% of the total number of distinct terms in 4. It is also limited to between 5 and 50.</Paragraph>
      <Paragraph position="4"> ii) Select major terms in the corpus.</Paragraph>
      <Paragraph position="5"> Select the first M terms in the descending order of the frequency of being contained in the characteristic term setsdeg 3.2.3 Zoom-in on a term cluster of interest Our method for zooming in on a term cluster consists of term-set expansion and term cluster~ ing. The term-set expanding algorithm is de~ scribed below. See 3.2.1 for the term clustering algorithm.</Paragraph>
      <Paragraph position="6"> A user-specified term set To = {t~, 6 ..... t,,,} is expanded into a term set T,. consisting of M terms as follows. M was set to 300 in the prototype. i) Set the initial value of 7&amp;quot;,. to 7&amp;quot;,,. ii) While IT,.I&lt; M for i = 1, 2 .... do; While IE, I &lt; Mforj = 1,2, ..., m do; Add the tenn having the i-th highest correlation with tj to T,,; end; end; The reason why the above-described procedure implements the zoom-in is that generic terms tend to have higher correlation with semigeneric terms than with specific terms. Assuming that high-frequency terms are generic and low-frequency terms are specific, we examined the distribution of terms by the distance from the major terms and the average occurrence frequency of terms for each distance. Here the distance is the length of the shortest path in a graph that is obtained by connecting every pair of associated terms with an edge. Table 2 shows the results for the example thesaurus mentioned in 2.3.</Paragraph>
      <Paragraph position="7"> According to it, the average occurrence frequency decreases with the distance from the major terms. Therefore, starting from an overview, our method is likely to produce more and more specific views.</Paragraph>
    </Section>
    <Section position="3" start_page="407" end_page="409" type="sub_section">
      <SectionTitle>
3.3 Prototype and an experiment
</SectionTitle>
      <Paragraph position="0"> We developed a prototype as a client/server system. The thesaurus navigator is available on WWW browsers. It also has an interface to text-retrieval engines, through which a term cluster is transferred as a query.</Paragraph>
      <Paragraph position="1"> Test use was made with the example thesaurus mentioned in 2.3. The response time for the zoom-in operation during the navigation sessions was about 8 seconds. This is acceptable given the rich information provided by the clustered view. Note that the response time is almost independent of the size of the thesaurus or corpus, because the number of temls to be clustered is always constant, as described in 3.2.2 and 3.2.3~ An example from navigation sessions is shown in Fig. 3. It demonstrates the useflflness of the corpus-dependent thesaurus navigation as a front-end for text retrieval. The effectiveness of our thesaurus navigator is summarized as folo lows.</Paragraph>
      <Paragraph position="2"> - Improved accessibility to text retrieval systems: Users are not necessarily required to input terms to describe their information need.</Paragraph>
      <Paragraph position="3"> They need only select from among terms pre~ sented on the screen. This makes text re-.</Paragraph>
      <Paragraph position="4"> trieval systems accessible even for those having vague information needs, or those unfamiliar with the domain.</Paragraph>
      <Paragraph position="5"> - Improved navigation efficiency: The unit of users' cognition is a topic rather than a term. That is, they can recognize a topic from a cluster of terms at a glance.</Paragraph>
      <Paragraph position="6"> Therefi)re, they can efficiently navigate through an information space.</Paragraph>
      <Paragraph position="7">  An overview of the thesaurus was di.sTJko,ed.</Paragraph>
      <Paragraph position="8"> Then the user selected the.fi/ih and seventh cluste#w which he was interested in: {China, col!\[L'rence, Asia, cooperation, meeting, Vietnam, region, development, technology, environment}, and/economy export, aid, toward, summit, Soviet Union, debt, Russia, reconstruction}. This means that the user was interested in &amp;quot;development assistance to developing countries or areas &amp;quot;.</Paragraph>
      <Paragraph position="9"> The Jil?h aml seventh cluste~w J)'om (a) were shown close up, and clustelw indicating more .Vwc~/ic domains were presented.</Paragraph>
      <Paragraph position="10"> The user couM undelwtand which topics the respective clustelw suggested: &amp;quot;Economic assis'tance for the development of the Asia-Pat(lie region &amp;quot;, &amp;quot;Global environmental problems &amp;quot;, &amp;quot;bTternational debt problems &amp;quot;, &amp;quot;'Mattetw on China &amp;quot;, &amp;quot;Energy resource development&amp;quot;, and so on. Since he wax e.vwcially interested in &amp;quot;International debt problems &amp;quot;, he selected the third cluster {debt, Egypt, Paris Club, creditor nation, q\[licial debt, de/'erred, Poland, .fin'eign debt, reimbursement, Paris, club, pro,merit, .fi, reign}.</Paragraph>
      <Paragraph position="11"> The third cluster fi'om (I 0 was shown close up. The resulting screen gave the user a choice el'many sT~ecific terms relevant to &amp;quot;htternational debt problems &amp;quot;, although not all oJ'the clustel:v indicated spec!/ic topics. The user was able to retrieve documents by simply selectinPS terms o/ interest./i'om those displayed on the screen.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="409" end_page="409" type="metho">
    <SectionTitle>
4 Comparison with related work
</SectionTitle>
    <Paragraph position="0"> Let us make a briefcolnparison with related work.</Paragraph>
    <Paragraph position="1"> Both scatter/gather document clustering (Cutting et al. 1992; Hearst and Pedersen 1996) and Kohonen's self-organizing map (Lin et al. 1991; Lagus et al. 1996; Kohonen 1998) enable exploration through a corpus. While they treat a corpus as a collection of documents, we treat it as a collection of terms. Therefore our method can elicit finer information structure than these previous methods, and moreover, it can be applied to a corpus that includes multi-topic doculnents.</Paragraph>
    <Paragraph position="2"> Our method compares quite well with the previotis methods for throughput and response time.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML