File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1115_metho.xml
Size: 26,946 bytes
Last Modified: 2025-10-06 14:08:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1115"> <Title>Temporal Ranking for Fresh Information Retrieval</Title> <Section position="4" start_page="4" end_page="4" type="metho"> <SectionTitle> 3. Cooperative Search Engine </SectionTitle> <Paragraph position="0"> First, we explain a basic idea of CSE. In order to minimize the update interval, every web site basically makes indices via a local indexer. However, these sites are not cooperative yet. Each site sends the information about what (i.e. which words) it knows to the manager. This information is called Forward Knowledge (FK), and is Meta knowledge indicating what each site knows. FK is the same as FI of Ingrid. When searching, the manager tells which site has documents including any word in the query to the client, and then the client sends the query to all of those sites. In this way, since CSE needs two-pass communication at searching, the retrieval time of CSE becomes longer than that of a centralized search engine.</Paragraph> <Paragraph position="1"> CSE consists of the following components (see Figure 1).</Paragraph> <Paragraph position="2"> null Location Server (LS): It manages FK exclusively.</Paragraph> <Paragraph position="3"> Using FK, LS performs Query based Site Selection described later. LS also has Site selection Cache (SC) which caches results of site selection.</Paragraph> <Paragraph position="4"> null Cache Server (CS): It caches FK and retrieval results. LS can be thought of as the top-level CS. It realizes Next 10 searches by caching retrieval results.</Paragraph> <Paragraph position="5"> Furthermore, it realizes a parallel search by calling LMSE mentioned later in parallel.</Paragraph> <Paragraph position="6"> null Local Meta Search Engine (LMSE): It receives queries from a user, sends it to CS (User I/F in Figure 2), and does local search process by calling LSE mentioned later (Engine I/F in Figure 2). It works as the Meta search engine that abstracts the difference between LSEs.</Paragraph> <Paragraph position="7"> null Local Search Engine (LSE): It gathers documents locally (Gatherer in Figure 2), makes a local index (Indexer in Fig. 2), and retrieves documents by using the index (Engine in Figure 2). In CSE, Namazu[1] can be used as a LSE. Furthermore we are developing an original indexer designed to realize high-level search functions such as parallel search and phrase search.</Paragraph> <Paragraph position="8"> Namazu has widely used as the search services on various Japanese sites.</Paragraph> <Paragraph position="9"> Next, we explain how the update process is done. In CSE, Update I/F of LSE carries out the update process periodically. The algorithm for the update process in CSE is as follows. 1. Gatherer of LSE gathers all the documents (Web pages) in the target Web sites using direct access(i.e. via NFS) if available, using archived access(i.e. via CGI) if it is available but direct access is not available, and using HTTP access otherwise.</Paragraph> <Paragraph position="10"> Here, we explain archived access in detail. In archived access, a special CGI that provides mobile agent place functions is used. A mobile agent is sent to that place. The agent archives local files, compresses them and sends back to the gatherer.</Paragraph> <Paragraph position="11"> 2. Indexer of LSE makes an index for gathered documents by parallel processing based on Boss-Worker model.</Paragraph> <Paragraph position="12"> 3. Update phase 1: Each LMSE i updates as follows.</Paragraph> <Paragraph position="13"> 3.1. Engine I/F of LMSE</Paragraph> <Paragraph position="15"> obtains from the corresponding LSE the total number N</Paragraph> <Paragraph position="17"> of all the documents, the set</Paragraph> <Paragraph position="19"> of all the words appearing in some documents, and the number n k,i of all the documents including word k, and sends to CS all of them together with its own URL.</Paragraph> <Paragraph position="20"> 3.2. CS sends all the contents received from each LMSE i to the upper-level CS. The transmission of the contents is terminated when they reach the top-level CS (namely, LS).</Paragraph> <Paragraph position="21"> 3.3. LS calculates the value of idf(k) = log([?]N</Paragraph> <Paragraph position="23"/> <Paragraph position="25"> (d,k) is a score of document d containing k, D is the set of all the documents in the site, and sends to CS all of them together with its own URL.</Paragraph> <Paragraph position="26"> 4.3. CS sends all the contents received from each LMSE</Paragraph> <Paragraph position="28"> to the upper-level CS. The transmission of the contents is terminated when they reach the top-level CS (namely, LS).</Paragraph> <Paragraph position="29"> Note that the data transferred between each module are mainly used for distributed calculation to obtain the score based on the tf*idf method. We call this method the distributed tf*idf method. The score based on the distributed tf*idf method is calculated at the search process. So we will give the detail about the score when we explain the search process in CSE.</Paragraph> <Paragraph position="30"> For the good performance of the update process, the performance of the search process is sacrificed in CSE. Here we explain how the search process in CSE is done.</Paragraph> <Paragraph position="31"> 1. When LMSE 0 receives a query from a user, it sends the query to CS.</Paragraph> <Paragraph position="32"> 2. CS obtains from LS all the LMSEs expected to have documents satisfying the query.</Paragraph> <Paragraph position="33"> 3. CS sends the query to each of all LMSEs obtained. 4. Each LMSE searches documents satisfying the query by using LSE, and returns the result to CS. 5. CS combines with all the results received from LMSEs, and returns it to LMSE</Paragraph> <Paragraph position="35"/> </Section> <Section position="5" start_page="4" end_page="4" type="metho"> <SectionTitle> 6. LMSE </SectionTitle> <Paragraph position="0"> displays the search result to the user.</Paragraph> <Paragraph position="1"> .Here, we describe the design of scalable architecture for the distributed search engine, CSE.</Paragraph> <Paragraph position="2"> In CSE, at searching time, there is the problem that communication delay occurs. Such a problem is solved by using following techniques.</Paragraph> <Paragraph position="3"> null Look Ahead Cache in Next 10 Search[3] To shorten the delay on search process, CS prepares the next result for the Next 10 search. That is, the search result is divided into page units, and each page unit is cached in advance by background process without increasing the response time.</Paragraph> <Paragraph position="4"> null Score based Site Selection (SbSS)[4] In the Next 10 search, the score of the next ranked document in each site is gathered in advance, and the requests to the sites with low-ranked documents are suppressed. By this suppression, the network traffic does not increase unnecessarily. For example, there are more than 100,000 domain sites in Japan.</Paragraph> <Paragraph position="5"> However, by using this technique, about ten sites are sufficient to requests on each continuous search.</Paragraph> <Paragraph position="6"> null Global Shared Cache (GSC)[5] A LMSE sends a query to the nearest CS. Many CS may send same requests to LMSEs. So, in order to globally share cached retrieval results among CSs, we proposed Global Shared Cache (GSC). In this method, LS memories the authority CS There is at least one CS in CSE in order to improve the response time of retrieval. However, the cache becomes invalid soon because the update interval is very short in CSE. Valuable first page is also lost. Therefore, we need persistent cache, which holds valid cache data before and after updating. In this method, there are two update phases. At first update phase, each LMSE sends the number of documents including each word to LS, and LS detects idf of each word. At second update phase, preliminary search is performed using new idfs in order to update caches.</Paragraph> <Paragraph position="7"> null Query based Site Selection(QbSS)[7][8] CSE supports Boolean search based on Boolean formula. In Boolean search of CSE, the operations and, or, and and-not are available. Let S be the set of target sites for search queries A and B, respectively. Then, the set of target sites for queries A and B, A or B, and A and-not B are S , respectively. By this selection of the target sites, the number of messages in search process is saved.</Paragraph> <Paragraph position="8"> These techniques are used as follows: if the previous page of Next 10 search has been</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.1 Temporal Query </SectionTitle> <Paragraph position="0"> Here, we describe the temporal queries used to support the retrieval of temporal information. CSE currently supports Boolean queries for keywords, and temporal queries in addition to keyword queries. Temporal queries are used to select documents existing at certain times or within certain time intervals.</Paragraph> <Paragraph position="1"> A temporal query is an expression of a time point or a time interval. First, we define a time point expression.</Paragraph> <Paragraph position="2"> Several conventional search engines can retrieve documents modified in some days or some months. However, this level of granularity is not sufficient for retrieving fresh information. A fresh information retrieval system has to retrieve documents modified within a matter of minutes at least. CSE updates the index within a few minutes independent of the scale of the system. In the near future, we expect to allow retrieval in real time, which is ideal for the purpose of fresh information retrieval. Therefore, we employ the second as the granularity of a chronon.</Paragraph> <Paragraph position="3"> A computer stores time as an integer which is represented as the number of seconds after 1970-01-01 00:00:00 GMT.</Paragraph> <Paragraph position="4"> However, it is not natural for a human to count time using only seconds, so in this paper we represent time as the following expression.</Paragraph> <Paragraph position="5"> Y/M/D/h/m/s Here, Y is the year in A.D., M is the numerical month (1-12), D is the day in a month (1-31), h is the hour (0-23), m is the minute(0-59), s is the second(0-59). If each granularity is omitted, it denotes an initial value. For an example, Y is Y/1/1/0/0/0.</Paragraph> <Paragraph position="6"> Furthermore, a time which is prefixed with a minus sign denotes the difference from the current time.</Paragraph> <Paragraph position="7"> -Y/M/D/h/m/s For example, -1/6 is a year and 6 months ago. If the accepted temporal query is negative, it is added to the current time. A negative temporal query is provided for the user's convenience.</Paragraph> <Paragraph position="8"> Next, we define the attributes of a document and their symbols as time point variables.</Paragraph> <Paragraph position="9"> /c the created time of the document /e the effective modified time of the document /m the last modified time of the document /now the current time Here, the effective modified time of the document denotes the last modified time where the content of the version is nearly equal to that of the current version. We will describe how to calculate /e in section 4.2. In the immutable document model, /m is used, and in the mutable document model, /c is used. The relationship of /c[?] /e[?] /m[?] /now is always true.</Paragraph> <Paragraph position="10"> The following queries exist concerning time points t</Paragraph> <Paragraph position="12"> Here, time point queries are compared with each other in the smallest granularity even if they form an elliptical representation.</Paragraph> <Paragraph position="13"> A time interval is represented as [t ] is easy for us to understand. In Allen's temporal interval logic, which lacks the concept of a time point, it is not clear whether both edges of the time interval are included in the range of the time interval or not. In our system, we allow an elliptical representation of a time interval such as [T] = [T,T+1], where T+1 denotes the increment of the smallest explicit granularity, e.g. [2000]=[2000,2001], [2002/1/31]=[2002/1/31,2002/2/1]. The lifetime of the document is represented as [/c,/now]. As mentioned in section 2.1, there are a large number of relationships between Allen's time intervals. However, they can all be reduced to relationships between time points and the functions giving the start point and the end point of the time interval. For this reason, CSE does not support interval queries but only point queries.</Paragraph> <Paragraph position="14"> Next, we discuss whether a temporal query is mixed with a keyword query or not. In the case of mixing, the semantics of a query is simple but its implementation is complex.</Paragraph> <Paragraph position="15"> Conversely, without mixing, the semantics of a query is complex but it can be implemented easily. For example, we can use the following query if mixing is allowed.</Paragraph> <Paragraph position="16"> FIFA World Cup and (((Korea or Japan) and (/c in [2002])) or (France and (/c in [1998]))) This query searches for both documents that describe the World Cup held in Korea and Japan in 2002 and documents that describe the World Cup held in France in 1998.</Paragraph> <Paragraph position="17"> On the other hand, if mixing is not allowed, the following query could be used.</Paragraph> <Paragraph position="18"> FIFA World Cup and (Korea or Japan or France) /c in [2002] or /c in [1998] Here, the relationship between keyword query and temporal query is conjunctive. This query searches for documents that describe both the World Cup of France and the World Cup of Korea and Japan in 1998 or 2002. In the latter method, a document describing Korea and Japan in 1998 and another document describing France in 2002 may both be retrieved. Therefore, we employ the former method.</Paragraph> <Paragraph position="19"> Temporal query TQ is represented with BNF as follows: constant, and TC is a temporal query. Note that TC alone cannot be the temporal query TQ. This is because all documents may be selected if only TC is the query, and such retrieval is not useful. Especially in distributed search engines, a traffic overload may occur because sites are not selected. TC is used to select from the result of Q using a temporal condition.</Paragraph> <Paragraph position="20"> The time in a temporal query is not the time interval where information is current but the time point of the origin of information. Therefore, the query =/now cannot match any document. The query </now can match the same documents as a non-temporal information retrieval.</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.2 Content based Freshness </SectionTitle> <Paragraph position="0"> For a user who wants to know what was fresh at a certain point in time, it is useful to display a list of documents that were fresh at that time. However, selecting documents according to the last modification time recorded by the file system is not appropriate because even if the last change to a document was only the correction of a slight typographical error, the document is regarded as having new content at that modification time. On the other hand, adopting the time when each document was published on the network is also undesirable because we cannot recognize that a document was fresh at the point in time when the content of the document was completely changed.</Paragraph> <Paragraph position="1"> These shortcomings arise from the policy of treating the freshness of a document without taking into account the change of the meaning of the content. Unfortunately it is difficult to determine whether the content of a document has largely changed or not. In this paper, we propose an alternative method of determining the change in content of a document, by using the change in TF*IDFs for keywords appearing in it. In CSE, a retrieval result is displayed to the user as a list of documents ranked according to TF*IDF for the retrieval query. In the same way as other search engines adopting TF*IDF ranking, if an OR search for all keywords is requested to CSE, all documents are ranked according to the largest TF*IDF for a keyword appearing in each document, which implies that we can think of a document as containing information regarding the keyword for which TF*IDF is the largest. Therefore, when the keyword having the largest TF*IDF is changed by editing a document, the content of the document is thought of as having changed, and the document is then ranked according to the keyword that has the largest TF*IDF after the change. The proposed method for determining whether or not the content of a document has changed obeys this policy of TF*IDF ranking.</Paragraph> <Paragraph position="2"> The concret For an if th docu</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.3 Tempo </SectionTitle> <Paragraph position="0"> e algorithm for the method is as follows: y time, e keyword that has the largest TF*IDF in the ment has changed, then update the time stamp of the document being fresh to be the current time.</Paragraph> <Paragraph position="1"> ral Ranking ng means sorting retrieved results. Conventional nes sort retrieved results in the descending order ment scores. However, in temporal information temporal ranking is required. In temporal ranking, a search engine sorts retrieved results in order of t time. Here, assume that ranking method is on Boolean formula of keywords in a query. poral ranking, QbSS and SbSS work well as same ell in score based ranking. These effects are zed as table 2. In first column of table 2, there are ng order: newer and older. Here, top e newest one in newer order, and it is the oldest one . In second column, there are two kind of basic</Paragraph> <Paragraph position="3"> . The third column, he relation of Tc in a query to total time min, max] of a server. Total time interval includes ified times of all documents in a server. Finally, in lumn effect, several site selection techniques well are listed. When QbSS works well, the site means that SbSS works well.</Paragraph> <Paragraph position="4"> stent Cache) means that SbSS does not work well ay work. SbSS works well if max is the time of top he newer order or if min is the time of top item in rder. A query is sent to the server iff either SbSS or is a key technique for scalability. SbSS does not if non-temporal query includes either AND or . However, in temporal query, SbSS may work a temporal query includes AND and AND-NOT.</Paragraph> <Paragraph position="5"> mplex time interval query can be reduced nge of one dimension of time. For an example, ]. In this way, all time interval query can be reduced to simple time point query in table 2. Therefore, SbSS is efficient in temporal ranking. However, SbSS does not work well if both temporal queries and non-temporal queries are combined. From such a point of view, temporal query should not be used with non-temporal query. Although SbSS is not effective, PC may work well. This is because PC works well if the query has already been retrieved once.</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> 5. Implementation </SectionTitle> <Paragraph position="0"> In this section, we describe the implementation of fresh information retrieval.</Paragraph> <Paragraph position="1"> In CSE, LMSE searches for documents by calling LSE. LSE must support TF based scoring (not TF*IDF). Namazu, one of the most popular small search engines in Japan supports TF scoring. We assumed Namazu is used as the implementation of LSE in our system.</Paragraph> <Paragraph position="2"> LSE constructs an index when updating occurs. Here, LSE changes TF of an index even if documents are slightly modified. This is the original behavior of LSE.</Paragraph> <Paragraph position="3"> LMSE has yet another index. After LSE has finished updating LSE's index, LMSE extracts TF values from each document in LSE's index, and compares each TF value from LMSE's index and LSE's index. If they are different, LMSE copies the TF value of the document from LSE to LMSE's index, and changes the publish timestamp of a document to be the time LSE began the updating. Finally, LMSE extracts the highest scores of each word and range of timestamps (oldest and latest) of each document, and sends them to LS. Since LSE is used to search, slight changes to documents are reflected in their scores. However, the timestamp is replaced by the time recorded by LMSE.</Paragraph> <Paragraph position="4"> If a query includes a temporal expression, Query based Site Selection (QbSS)[7][8] is also used to select search target sites. Since LS has only the latest timestamps, LS cannot select sites. However, it is effective for fresh information retrieval, which is the main purpose of CSE.</Paragraph> <Paragraph position="5"> LMSE descends a query recursively, and requests a single keyword expression from LSE. LSE returns a result which is sorted in TF order. LMSE multiplies IDF, and carries out a set operation, selecting by temporal condition. The search results are sorted in order of scores by a specified ranking method. CS does not share the cache queues for different ranking methods.</Paragraph> </Section> <Section position="7" start_page="4" end_page="4" type="metho"> <SectionTitle> 6. Evaluations </SectionTitle> <Paragraph position="0"> At first, we will show that the distributed search engine can retrieve fresh information. In paper[2], we compared update intervals in the same document set between CSE and a centralized search engine which used Namazu and wget. A centralized search engine spent 2 hours and 20 minutes, whereas CSE finished in a few minutes. CSE did not fail to search for fresh information within the bounds of these few minutes.</Paragraph> <Paragraph position="1"> Assume that there are three documents, A, A' and A'', which have similar subjects, and a fourth document, B, on a different subject. Let the documents which are mixed be A and A', A'', B, in the ratio of t:1[?]t as tA+(1[?]t)A', tA+(1[?]t)A'', tA+(1[?]t)B. Fig. 3 shows the relationship between t and the maximum values of TF*IDF. Here, the subjects of A, A', A'' and B are emacs, mule, xemacs and vi respectively. The order of closeness to the subject of emacs is mule < xemacs < vi. Words which have the maximum TF*IDF value in each document are changed at t=2 in mule, which has a similar subject to emacs. In vi, which has quite a different subject, the maximum TF*IDF word changed at t=3.</Paragraph> <Paragraph position="2"> Therefore, it will be judged that the content was changed if 20 to 30% of documents were changed, when the variation of the content is detected by the maximum value of TF*IDF.</Paragraph> </Section> <Section position="8" start_page="4" end_page="4" type="metho"> <SectionTitle> 7. Related Works </SectionTitle> <Paragraph position="0"> There are two types of temporal information retrieval: retrieving documents by time and displaying documents in the order of time. Namazu[1], Goo, Infoseek, NAVER[11], Google and so on can be used to search documents by time.</Paragraph> <Paragraph position="1"> Namazu searches HTML documents with HTTP headers and e-mail like documents by using a regular expression involving time. Since these documents have a date: field in their header, they can easily be searched by time. However, normal HTML documents without headers have no date: fields. In HTML documents with a header, the date: field often denotes the time that they were downloaded. For this reason, Namazu can not search web documents by time.</Paragraph> <Paragraph position="2"> In Goo, a user can select before/after a particular date.</Paragraph> <Paragraph position="3"> Goo searches for the newest information since Goo does not distinguish between different versions of a document.</Paragraph> <Paragraph position="4"> However, searching documents by date is not efficient for fresh information retrieval. Searching by second, or at the most by minute, is required.</Paragraph> <Paragraph position="5"> In Infoseek, a user can also select before/after a particular date, and Infoseek supports searching by a range of dates.</Paragraph> <Paragraph position="6"> NAVER supports specifying a range of months in document search mode which searches for non-HTML documents such as MS Word, Excel files, PDF and so on.</Paragraph> <Paragraph position="7"> However, specifying a range of months is completely unsuitable. Furthermore, NAVER does not support specifying a particular date or month.</Paragraph> <Paragraph position="8"> In Google, a user can select past 3, 6, 12 months in Advanced Search mode. However, this is not as efficient as NAVER.</Paragraph> <Paragraph position="9"> Among those mentioned above, Infoseek is most similar to fresh information retrieval, however the freshness is insufficient because Infoseek only supports specifying documents by date.</Paragraph> <Paragraph position="10"> Namazu, FreshEye and NAVER display search results in order of time. They can also display results in increasing or decreasing order. Other search engines such as Yahoo, AltaVista, Excite and Lycos do not support searching by time.</Paragraph> <Paragraph position="11"> In the field of databases, there is much work regarding temporal database management[12]. The Valid Web[13] realizes temporal retrieval by specifying the valid time of web documents using XML. However, no HTML documents are able to specify a valid time.</Paragraph> <Paragraph position="12"> Although search engines are a kind of database, few experiments have been conducted on retrieving temporal information. One of the reasons is the search engine architecture. The search engines mentioned above all have a centralized architecture. Centralized search engines spend a lot of time gathering documents. Therefore, it is difficult for these search engines to collect temporal information.</Paragraph> <Paragraph position="13"> However, with distributed search engines, almost real-time retrieval is practical since they do not need to gather documents over the network.</Paragraph> <Paragraph position="14"> A number of distributed search engines exist, such as Whois++[9], Harvest[10], GlOSS and so on. Whois++ and Harvest use forward knowledge. Forward knowledge is also used in CSE, however, these systems have no limitation on retrieval response time. CSE realizes regular response time regardless of its scale. In addition, these search engines do not support temporal information retrieval.</Paragraph> </Section> class="xml-element"></Paper>