File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1098_metho.xml

Size: 8,174 bytes

Last Modified: 2025-10-06 14:08:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1098">
  <Title>A Trigger Language Model-based IR System</Title>
  <Section position="3" start_page="0" end_page="21" type="metho">
    <SectionTitle>
2 Trigger Language Model based IR
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
System
2.1 Inter-relationship of Indexing Words
</SectionTitle>
      <Paragraph position="0"> In order to find out the inter-relationship of words in some specific context, we consider the co-occurring times of different words within fixed sized text window of the document. When the co-occurring time is large enough, we think that relationship is meaningful. Mutual Information is a common tool to be applied under this situation. So we compute the mutual information as following:</Paragraph>
      <Paragraph position="2"> where denotes the size of the vocabulary, is the co-occurring times of word and within sized window in training set. is the count of the word appearing in the training set and is the count of word appearing in the training set.</Paragraph>
      <Paragraph position="4"> We use the corpus provided by IR task of NTCIR2 (NTCIR 2002) as the training set to compute the mutual information of words. This corpus contains nearly 100 thousands news articles encoding in BIG5 charset. We think the mutual information which is larger than 25 is meaningful. Considering the stop words in document or query are useless to represent the content, we remove 200 highest frequent words from the document before computation. Table 1 shows some examples with higher mutual information.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Algorithm of Triggered Words by
Query
</SectionTitle>
      <Paragraph position="0"> Generally speaking, a word always represents many different meanings and its exact meaning adopted in specific topic can be determined by the co-occurring words in its context. Different meaning of a word often lead to the different vocabulary set of related word. In order to find out the exact meaning of the words contained by the query in IR system, we design the algorithm to compute the triggered vocabularies of query. It is just these triggered words that show the exact meaning of the words in query in some specific context and help fix down the topic of query more clearly. The basic idea behind the algorithm is as following: By computing the mutual information, we can derive the relative words of a query word. All these words mean the semantically related vocabularies of the query word under different contexts. We propose that if the intersection of the derived related words of different words in query is not null, the words in the intersection is useful to judge the exact meaning of the words in query.</Paragraph>
      <Paragraph position="1"> At the same time, the more times an intersection word appears in related vocabulary set of different query word, the higher the weight of this word to fix down the topic of the query is. So we design the following algorithm to compute the triggered vocabulary set of query: Algorithm 1:Triggered vocabularies by query Input: Vocabulary set I of query word and its co-occurring words after removing the stop words in the query.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="21" type="sub_section">
      <SectionTitle>
2.3 Similarity Computation of Query and
Document
</SectionTitle>
      <Paragraph position="0"> We use the similar strategy with Ponte language model method (Ponte and Croft 1998) to compute the similarity between the query and the document. That is, we firstly construct the simple language model according to the statistical information of vocabulary and then compute the generative probability of the query.</Paragraph>
      <Paragraph position="1"> The difference is that the trigger language model method takes the context information of a word into account. So we compute the triggered words set of query according to algorithm 1.This way we get the triggered vocabulary set</Paragraph>
      <Paragraph position="3"> This set contains the words triggered by query and it is these triggered words that determine the exact meaning of the vocabularies in query among the several optional choices. This helps fix down the topic of query more clearly.</Paragraph>
      <Paragraph position="4"> Introducing the triggered words factor into the document language model, we can form the trigger language model based information retrieval system.</Paragraph>
      <Paragraph position="5"> The similarity of query and document can be computed as following:  qqqQ = denotes query and is the length of the query; )(Ql (2) denotes the trigger language model of  is the weight parameter of words in a document. Here means the account of the words appearing in the</Paragraph>
      <Paragraph position="7"> (5) denotes the probability of being triggered by the document word .When 2 words are same, the probability equals 1. If they are different and the word belongs to the triggered vocabulary set of query, the probability equals the according parameter in the ,otherwise the probability is 0.</Paragraph>
      <Paragraph position="9"> is used for data smoothing; here denotes times of query word appearing in document set and denotes the total length of documents which contains the</Paragraph>
      <Paragraph position="11"/>
    </Section>
  </Section>
  <Section position="4" start_page="21" end_page="21" type="metho">
    <SectionTitle>
3 Experiment Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
3.1 Corpus
</SectionTitle>
      <Paragraph position="0"> The corpus we used to evaluate the performance of our proposed trigger language model IR system is the document set offered by the traditional Chinese Document set of NTCIR3 for the IR task. The corpus consists of 381681 news articles from Hong Kong and Taiwan with varied topics. After the word segmentation, the document set contains 150700953 words. Among them,127519 different words are the entries of the vocabulary. The average length of each document is 394.</Paragraph>
      <Paragraph position="1"> The 50 queries offered by NTCIR3 IR task are contained in a XML file and each query consists of following elements: Topic Number(NUM),Topic Title(TITLE),Topic question(DESC),Topic Narrative(NARR) and Topic Concepts(CONC). In order to make it easer to compare the performance of the different IR methods, we adopt the Topic Question field as the query and regard the top 1000 retrieval documents as the standard result of the experiment.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
3.2 Analysis of Experiment Results
</SectionTitle>
      <Paragraph position="0"> We design 3 relative experiments to evaluate the trigger language model IR method: vector space model, Ponte language model based method and the trigger language model approach.</Paragraph>
      <Paragraph position="1"> Precision and recall are two main evaluation parameters. As for the trigger model IR method, the optimal size of the text window is 20 content words and the mutual information over 25 is regarded as the meaningful information.</Paragraph>
      <Paragraph position="2"> Experiment results can be seen in table 2.</Paragraph>
      <Paragraph position="3"> The data of column % in table 2 shows the performance improvement of Ponte language model compared with vector space model. The data tells us that the precision of language model based method increased 10% and recall increased nearly 13.7%. The data of column %[?] in table 2 shows the performance improvement of trigger language model compared with Ponte language model method. From the data we can see that the precision of trigger language model increased 12% and recall increased nearly 10.8%. We can draw the conclusion that the trigger language model has improved the performance greatly.</Paragraph>
      <Paragraph position="4"> The performance comparison can be showed more clearly in figure 1.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML