File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2007_metho.xml

Size: 14,488 bytes

Last Modified: 2025-10-06 14:08:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-2007">
  <Title>A Novel Approach to Semantic Indexing Based on Concept</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Semantic Indexing Based on Concept
</SectionTitle>
    <Paragraph position="0"> Current approaches to index weighting for information retrieval are based on the statistic method. We propose an approach that changes the basic index term weighting method by considering semantics and concepts of a document. In this approach, the concepts of a document are understood, and the semantic indexes and their weights are derived from those concepts.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 System Overview
</SectionTitle>
      <Paragraph position="0"> We have developed a system that performs the index term weighting semantically based on concept vector space. A schematic overview of the proposed system is as follows: A document is regarded as a complex concept that consists of various concepts; it is recognized as a vector in concept vector space. Then, each concept was extracted by lexical chains(Morris, 1988 and 1991). Extracted concepts and lexical items were scored at the time of constructing lexical chains. Each scored chain was represented as a concept vector in concept vector space, and the overall text vector was made up of those concept vectors. The semantic importance of concepts and words was normalized according to the overall text vector. Indexes that include their semantic weight are then extracted.</Paragraph>
      <Paragraph position="1"> The proposed system has four main components:  + Lexical chains construction + Chains and nouns weighting + Term reweighting based on concept + Semantic index term extraction  The former two components are based on concept extraction using lexical chains, and the latter two components are related with the index term extraction based on the concept vector space, which will be explained in the next section.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Lexical Chains and Concept Vector Space
Model
</SectionTitle>
      <Paragraph position="0"> Lexical chains are employed to link related lexical items in a document, and to represent the lexical cohesion structure in a document(Morris, 1991). In accordance with the accepted view in linguistic works that lexical chains provide representation of discourse structures(Morris, 1988 and 1991), we assume that</Paragraph>
      <Paragraph position="2"> each lexical chain is regarded as a concept that expresses the meaning of a document. Therefore, each concept was extracted by lexical chains.</Paragraph>
      <Paragraph position="3"> For example, Figure 1 shows a sample text composed of five chains. Since we can not deal all the concept of a document, we discriminate representative chains from lexical chains. Representative chains are chains delegated to represent a representative concept of a document. A concept of the sample text is mainly composed of representative chains, such as chain 1, chain 2, and chain 3. Each chain represents each different representative concept: for example man, machine and anesthetic.</Paragraph>
      <Paragraph position="4"> As seen in Figure 1, a document consists of various concepts. These concepts represent the semantic content of a document, and their composition generates a complex composition. Therefore we suggest the concept space model where a document is represented by a complex of concepts. In the concept space model, lexical items are discriminated by the interpretation of concepts and words that constitute a document.</Paragraph>
      <Paragraph position="5"> Definition 1 (Concept Vector Space Model) Concept space is an n-dimensional space composed of n-concept axes. Each concept axis represents one concept, and has a magnitude of Ci. In concept space, a document T is represented by the sum of n-dimensional concept vectors, ~Ci.</Paragraph>
      <Paragraph position="7"> Although each concept that constitutes the overall text is different, concept similarity may vary. In this paper, however, we assume that concepts are mutually independent without consideration of their similarity.</Paragraph>
      <Paragraph position="8"> Figure 2 shows the concept space version of the sample text.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Concept Extraction Using Lexical Chains
</SectionTitle>
      <Paragraph position="0"> Lexical chains are employed for concept extraction.</Paragraph>
      <Paragraph position="1"> Lexical chains are formed using WordNet and asso-</Paragraph>
      <Paragraph position="3"> ciated relations among words. Chains have four relations: synonym, hypernyms, hyponym, meronym.</Paragraph>
      <Paragraph position="4"> The definitions on the score of each noun and chain are written as definition 2 and definition 3.</Paragraph>
      <Paragraph position="5"> Definition 2 (Score of Noun) Let NRkNi denotes the number of relations that noun Ni has with relation k.</Paragraph>
      <Paragraph position="6"> SRkNi represents the weight of relation k. Then the score SNOUN(Ni) of a noun Ni in a lexical chain is defined as:</Paragraph>
      <Paragraph position="8"> where k 2 set of relations.</Paragraph>
      <Paragraph position="9"> Definition 3 (Score of Chain) The score SCHAIN(Chx) of a chain Chx is defined as:</Paragraph>
      <Paragraph position="11"> Representative chains are chains delegated to represent concepts. If the number of the chains was m, chain Chx, should satisfy the criterion of the definition 4.</Paragraph>
      <Paragraph position="12"> Definition 4 (Criterion of Representative Chain) The criterion of representative chain, is defined as:</Paragraph>
      <Paragraph position="14"/>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Information Quantity and Information Ratio
</SectionTitle>
      <Paragraph position="0"> We describe a method to normalize the semantic importance of each concept and lexical item on the concept vector space. Figure 3 depicts the magnitude of the text vector derived from concept vectors C1 and C2. When the magnitude of vector C1 is a and that of vector C2 is b, the overall text magnitude is pa2 +b2.</Paragraph>
      <Paragraph position="1">  Each concept is composed of words and its weight wi. In composing the text concept vector, the part that vector C1 contributes to a text vector is x, and the part that vector C2 contributes is y. By expanding the vector space property, the weight of lexical items and concepts was normalized as in definitions 5 and definition 6.</Paragraph>
      <Paragraph position="2"> Definition 5 (Information Quantity, Ohm) Information quantity is the semantic quantity of a text, concept or a word in the overall document information. OhmT , OhmC, OhmW are defined as follows. The magnitude of concept vector Ci is SCHAIN(Chi):</Paragraph>
      <Paragraph position="4"> The text information quantity, denoted by OhmT , is the magnitude generated by the composition of all concepts. OhmCi denotes the concept information quantity.</Paragraph>
      <Paragraph position="5"> The concept information quantity was derived by the same method in which x and y were derived in Figure 3. OhmWj represents the information quantity of a word. PsWjjT is illustrated below.</Paragraph>
      <Paragraph position="6"> Definition 6 (Information Ratio, Ps) Information ratio is the ratio of the information quantity of a comparative target to the information quantity of a text, concept or word. PsCjT , PsWjC and PsWjT are defined as follows:</Paragraph>
      <Paragraph position="8"> The weight of a word and a chain was given when forming lexical chains by definitions 2 and 3. PsWjjCi denotes the information ratio of a word to the concept in which it is included. PsCijT is the information ratio of a concept to the text. The information ratio of a word to the overall text is denoted by PsWijT .</Paragraph>
      <Paragraph position="9"> The semantic index and weight are extracted according to the numerical value of information quantity and information ratio. We extracted nouns satisfying  that represents the content of a document is defined as follows:</Paragraph>
      <Paragraph position="11"> Although in both cases information quantity is the same, the relative importance of each word in a document differs according to the document information quantity. Therefore, we regard information ratio rather than information quantity as the semantic weight of indexes. This approach has an advantage in that we need not consider document length when indexing because the overall text information has a value 1 and the weight of the index is provided by the semantic information ratio to overall text information value, 1, whether a text is long or not.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In this section we discuss a series of experiments conducted on the proposed system. The results achieved below allow us to claim that the lexical chains and concept vector space effectively provide us with the semantically important index terms. The goal of the experiment is to validate the performance of the proposed system and to show the potential in search performance improvement.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Standard TF vs. Semantic Indexing
</SectionTitle>
      <Paragraph position="0"> Five texts of Reader's Digest from Web were selected and six subjects participated in this study. The texts were composed of average 11 lines in length(about five to seventeen lines long), each focused on a specific topic relevant to exercise, diet, holiday blues,yoga, and weight control. Most texts are related to a general topic, exercise. Each subject was presented with five short texts and asked to find index  terms and weight each with value from 0 to 1. Other than that, relevancy to a general topic, exercise, was rated for each text. The score that was rated by six subjects is normalized as an average.</Paragraph>
      <Paragraph position="1"> The results of manually extracted index terms and their weights are given in Table 1. The index term weight and the relevance score are obtained by averaging the individual scores rated by six subjects. Although a specific topic of each text is different, most texts are related to the exercise topic. The percent agreement to the selected index terms is shown in Table 2(Gale, 1992). The average percent agreement is about 0.86. This indicates the agreement among subjects to an index term is average 86 percent.</Paragraph>
      <Paragraph position="2"> We compared these ideal result with standard term frequency(standard TF, S-TF) and the proposed semantic weight. Table 3 and Figures 4-6 show the comparison results. We omitted a few words in representing figures and tables, because standard TF method extracts all words as index terms. From Table 3, subjects regarded exercise, back, and pain as index terms in Text 1, and the other words are recognized as relatively unimportant ones. Even though exercise was mentioned only three times in Text 1, it had considerable semantic importance in the document; yet its standard TF weight did not represent this point at all, because the importance of exercise was the same as that of muscle, which is also mentioned three times in a text. The proposed approach, however, was able to  differentiate the semantic importance of words. Figure 4 shows the comparison chart version of Table 3, which contains three weight lines. As the weight line is closer to the subject weight line, it is expected to show better performance. We find from the figure that the semantic weight line is analogous to the manually weighted value line than the the standard TF weight line is.</Paragraph>
      <Paragraph position="3"> Figures 5 and 6 show two of four texts(Text2, Text3, Text4, Text5). Figures on the other texts are omitted due to space consideration. In Figure 5, pound is mentioned most frequently in a text, consequently, standard TF rates the weight of pound very high. Nevertheless, subjects regarded it as unimportant word. Our approach discriminated its importance and computed its weight lower than diet and exerciese. From the results, we see the proposed system is more analogous to the user weight line than the standard TF weight line.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Applicability of Search Performance
Improvements
</SectionTitle>
      <Paragraph position="0"> When semantically indexed texts are probed with a single query, exercise, the ranking result is expected to be the same as the order of the relevance score to the general topic exercise, which was rated by subjects.</Paragraph>
      <Paragraph position="1"> Table 4 lists the weight comparison to the index term exercise of five texts, and the subjects' relevance rate to the general topic exercise. Subjects' relevance rate is closely related with the subjects' weight to the index term, exericise. The expected ranking result is as following Table 5. TF weight method hardly discerns the subtle semantic importance of each texts, for example, Text1 and Text2 have the same rank. Length normalization(LN) and standard TF discern each texts but fail to rank correctly. However, the proposed indexing method provides better ranking results than the other TF based indexing methods.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Conclusion
</SectionTitle>
      <Paragraph position="0"> In this paper, we intended to change the basic indexing methods by presenting a novel approach using a concept vector space model for extracting and weighting indexes. Our experiment for semantic indexing supports the validity of the presented approach, which is capable of capturing the semantic importance of  a word within the overall document. Seen from the experimental results, the proposed method achieves a level of performance comparable to major weighting methods. In an experiment, we didn't compared our method with inverse document frequency(IDF) yet, because we will develop more sophisticated weighting method concerning IDF in future work.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML