File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/x93-1008_abstr.xml
Size: 7,848 bytes
Last Modified: 2025-10-06 13:47:58
<?xml version="1.0" standalone="yes"?> <Paper uid="X93-1008"> <Title>INQUERY System Overview</Title> <Section position="2" start_page="57" end_page="58" type="abstr"> <SectionTitle> 4. Accomplishments </SectionTitle> <Paragraph position="0"> The TIPSTER evaluations have demonstrated that the INQUERY approach to retrieval and routing is both effective and efficient. We have shown that the probabilistic framework is portable, trainable and improvable. The extensibility and robustness of this approach are further demonstrated in technology transfer efforts involving INQUERY. Apart from these general accomplishments, however, we can be more specific about the lessons that have been learned in the major areas of work.</Paragraph> <Paragraph position="1"> Indexing: * Stopwords are sometimes necessary (e.g. Ms. The, sit-in). In order to reduce the number of queries that fail due to incomplete indexing, we have extended the basic indexing algorithms to include indexing of stopwords when they are capitalized but not at the start of sentences, and when they are joined to other words.</Paragraph> <Paragraph position="2"> * It is sometimes difficult to decide what is a word (e.g. numbers, special characters). We have carried out a number of experiments to determine the most flexible and efficient way of indexing word tokens.</Paragraph> <Paragraph position="3"> * &quot;Real&quot; paragraph boundaries often do not indicate content shift in documents. Our experiments with paragraph-based retrieval indicate that real paragraphs are no more effective than text windows, although there are collection-specific exceptions, such as detecting Wall Street Journal articles that cover multiple short topics.</Paragraph> <Paragraph position="4"> * Feature recognition can add significant overhead.</Paragraph> <Paragraph position="5"> We have done the first experiments on the impact of including simple extraction (e.g. company names, dates, locations) in the indexing process. We have attempted to reduce the indexing overhead to a minimum in order to support high-volume updates (e.g. routing).</Paragraph> <Paragraph position="6"> * &quot;Justified&quot; stemming is hard (effective stems vs. understandable stems). We have developed a new stemming algorithm that produces much more understandable stems than the current standard (Porter). The effectiveness of the new algorithm varies across collections and we are continuing to work towards consistent improvements.</Paragraph> <Paragraph position="7"> Query processing: * Sophisticated query processing produces significant improvements. We have developed a variety of query processing techniques that together have improved the overall system effectiveness considerably. * Large queries resulted in less emphasis on word disambiguation. The TIPSTER topics are much longer than typical IR queries and the mutual disambiguation produced by the presence of so many terms has made word sense disambiguation a marginal technique. We continue to study this technique with other collections.</Paragraph> <Paragraph position="8"> * Large queries also makes query expansion difficult. Simple query expansion techniques, such as using a general thesaurus, are not effective in this environment. null * Automatic query expansion still looks promising.</Paragraph> <Paragraph position="9"> Query expansion based on the WORDFINDER system has produced the most significant results of their type to date.</Paragraph> <Paragraph position="10"> * Extracting query terms/phrases from the narrative is hard. The abstract nature of the narrative section of the TIPSTER topics makes automatic processing difficult. This is a promising area for future research.</Paragraph> <Paragraph position="11"> * Feature extraction/recognition is most effective in narrow domains. Our experiments with including extraction in the indexing and retrieval process showed only small effectiveness improvements in TIPSTER. Our experience with other collections have shown more promise.</Paragraph> <Paragraph position="12"> * Manual queries can improve results, but usually only in combination with automatically processed queries. In general, we have shown that automatically processed queries are competitive with hand-crafted queries.</Paragraph> <Paragraph position="13"> Retrieval: * Improvements depend heavily on baseline. In general, it is easy to get large percentage improvements if a baseline search with poor performance is used. The improvements obtained using INQUERY are often small, but the overall effectiveness is consistently better than other approaches.</Paragraph> <Paragraph position="14"> * Phrases were not as effective as on other collections. Despite considerable efforts on developing the phrase model, the overall improvements obtained using phrases are small (but consistent). We are continuing to work on this issue.</Paragraph> <Paragraph position="15"> * Estimation was a significant problem for large, heterogeneous databases. The estimation functions used for previous, smaller test collections proved to be inadequate for TIPSTER. We have developed new forms of these functions which have resulted in significant improvements.</Paragraph> <Paragraph position="16"> * Paragraph-level retrieval can help in combination with document-level retrieval. In full-text collections, the notion of a local match and a global match is important. We have shown that paragraph-level matching can produce significant improvements in effectiveness in two situations. One is when paragraph-level connections between query concepts are specified manually, the other is when automatic paragraph-level matching is combined (in the INQUERY framework) with document-level matching.</Paragraph> <Paragraph position="17"> Routing: * Automatic construction of routing profiles has consistcntly outperformed manually specified profiles in our experiments. Combining these forms of profile results in further improvements.</Paragraph> <Paragraph position="18"> * Proximity pairs are important in the automatic profile. We compared simple word-based learning with learning structure in the form of phrase and paragraph-level proximities and found that the structured profiles perform better.</Paragraph> <Paragraph position="19"> * Routing is different than relevance feedback. Techniques that were superior in relevance feedback experiments (small amounts of training data), have not been the best in routing experiments (large amounts of training data).</Paragraph> <Paragraph position="20"> * Amount of training data has significant, but limited effect on performance. We have shown that good performance can be obtained with limited amounts of training data or relevance judgements. This has important implications for practical applications. Japanese: * INQUERY works well with Japanese with minor changes. The only differences between the Japanese and English versions of INQUERY are the modules for the morphological processing and the interface. The operators in the query language are identical, although there is some evidence that a Japanesespecific phrase operator may be more effective. Character-based indexing can be competitive with word-based indexing. Indexing all characters is much faster than segmenting Japanese into words, and our retrieval experiments have shown the effectiveness levels to be similar. The best performance was obtained using combinations of both representations. null Query processing is at least as important in Japanese as in English. Parsing a Japanese query and constructing the appropriate INQUERY query from that parse is a crucial part of getting effective performance, particularly with character-based indexing. Experiments with different approaches to query processing are continuing as more topics are obtained.</Paragraph> </Section> class="xml-element"></Paper>