File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/x93-1007_metho.xml
Size: 18,424 bytes
Last Modified: 2025-10-06 14:13:31
<?xml version="1.0" standalone="yes"?> <Paper uid="X93-1007"> <Title>DOCUMENT DETECTION SUMMARY OF RESULTS</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. 12-MONTH EVALUATION </SectionTitle> <Paragraph position="0"> The work done for the 12-month evaluation was mainly a scaling effort. Not all data was available so only partial results were completed. In particular, the University of Massachusetts turned in 7 runs using the adhoc topics, with experiments trying different parts of the topic to automatically create the query, and also adding phrases.</Paragraph> <Paragraph position="1"> Additionally they tried some manually edited queries.</Paragraph> <Paragraph position="2"> HNC Inc. turned in 4 runs using the adhoc topics, with experiments also using different parts of the topic to automatically generate queries. Additionally they tried various types of &quot;bootstrapping&quot; methodologies to generate context vectors. Syracuse University turned in no runs, but had completed the extensive design work as proposed in their timeline. The University of Massachusetts also did 4 runs on the routing topics, but the lack of good training data made this very difficult. In general the results for the systems was good, with the University of Massachusetts outperforming HNC Inc. on the adhoc runs, but it was felt by all that this evaluation represented a very &quot;baseline&quot; effort. For these reasons, no graphs of these resuits will be presented.</Paragraph> </Section> <Section position="4" start_page="0" end_page="33" type="metho"> <SectionTitle> 3. 18-MONTH EVALUATION </SectionTitle> <Paragraph position="0"> By the 18-month mark, the systems had finished much more extensive sets of experiments. The University of Massachusetts continued to investigate the effects of using different parts of the topic for the adhoc runs, but this time trying different combinations using the inference net methodology. Figure 1 shows three INQRY runs for the adhoc topics done for the 18-month evaluation. The plot for INQRYV represents results from queries created automatically using most of the fields of the topics. The INQRYJ results are from the same queries, but including phrases and concept operators. The INQRYQ results show the effects of manually editing the INQRYJ queries, with those modifications restricted to eliminating words and phrases, adding additional words and phrases from the narrative field, and inserting paragraph-level operators around words and phrases. As can be seen, the use of phrases in addition to single terms helped somewhat, but the results from manual modification of the queries were the best adhoc runs.</Paragraph> <Paragraph position="1"> Figure 1 also shows the two HNC adhoc results for the 18-month evaluation period. These runs represent final results from many sets of bootstrapping runs in which HNC evolved techniques for completely automatically creating context vectors for the documents. The plot marked HNC2aV represents the results using these context vectors for retrieval and using automatically built queries from the concepts field of the topic. The HNC2aM results adds a rough Boolean filter to the process, requiring that three terms in the concepts field match terms in the documents before the context vectors are used for ranking. This Boolean filter provided considerable improvements.</Paragraph> <Paragraph position="2"> Figure 2 shows the routing results for the 18-month evaluation. The three plots for the INQRY system represent a baseline method (INQRYA, same as INQRYJ adhoc approach), and two modifications to this method. The IN-QRYP results show the effect of manually modifying the query, similar to the method used in producing the INQRYQ adhoc results. The plot for INQRYR shows the first results from probabilistic query expansion and reweighting performed automatically using relevance feedback techniques on the training data. Both methods improve results, with the automatic feedback method resuits approaching the manual-modification method, especially at the high recall area of the curve.</Paragraph> <Paragraph position="3"> The HNC routing results shown on figure 2 represent the use of two different types of neural networks. The plot marked HNCrtl is the baseline result, created by using the adhoc methods similar to those used in run HNC2aV. The HNCrt2 results represent using neural network techniques to learn improved stem weights for the context vectors based on the training data. The HNCrt3 results come from using the training data to determine what type of routing query to use, i.e. an automated adhoc query (similar to HNC2aM), a manual query, or a query using the neural network techniques (HNCrt2). Clearly the neural network learning techniques significantly improve performance, with the per topic &quot;customization&quot; performance (HNCrt3) working best.</Paragraph> <Paragraph position="4"> In terms of system comparison, the University of Massachusetts runs were consistently better than the HNC runs for the adhoc topics, whereas for the routing topics both groups were similar. The results were a major improvement over their baseline (12-month) results for both groups.</Paragraph> <Paragraph position="5"> At the 18-month evaluation period, Syracuse University had the first stage of their system in operation and turned in results for the first time. The results for adhoc and routing are shown in figures 3 and 4. Since the results are for only a subset of the data used by the other contractors, they cannot be directly compared. Additionally since the results are only for the ffi/st stage of retrieval, which emphasizes high recall, they should not be viewed as the final results from the system.</Paragraph> <Paragraph position="6"> Figure 3 shows four Syracuse runs on the adhoc topics.</Paragraph> <Paragraph position="7"> The documents used are the subset of the collection having the Wall Street Journal only. The first three plots, DRsfcl, DRpnal, and DRtsal, represent three operations in the DR-LINK system. The first operation does a rough filtering operation on the data, only retrieving documents with suitable subject field codes. The next two operations locate proper nouns and look at document structure.</Paragraph> <Paragraph position="8"> There is a considerable improvement in performance between the first two operations. The fourth run (DRfull) used a manual version of the second stage to produce final results. These results are for only half the topics, so cannot be strictly compared to the first three runs, but they do indicate the potential improvements to precision that can be expected from the second stage.</Paragraph> <Paragraph position="9"> Figure 4 shows the same operations, and generally the same improvements, for the routing topics. In this case the subset of documents used was the AP newswire documents. The same three operations discussed above are plotted here. There was no second stage trial for the routing topics. These two graphs represent the baseline of the Syracuse system.</Paragraph> </Section> <Section position="5" start_page="33" end_page="41" type="metho"> <SectionTitle> 4. 24-MONTH EVALUATION </SectionTitle> <Paragraph position="0"> For the 24-month evaluation, all groups turned in many runs. The runs were much more elaborate, with many different types of parameters being tried. The University of Massachusetts tried 7 experiments with the adhoc topics, using complex methods of combining the topic fields, proximity restrictions, noun phrases, and paragraph operators. Additionally an automatically-built thesaurus was tried. They also did 15 runs with the routing topics, trying various experiments combining relevance feedback, query combinations, proximity operators and special phrase additions. HNC Inc. did 4 adhoc runs using various types of learned context vectors. Additionally they tried a simulated feedback query construction run. For routing they did 5 runs, trying multiple experiments in different combinations of adhoc and neural net approaches. Syracuse University turned in 10 runs for their &quot;upstream processing module&quot; (3 adhoc and 7 routing), trying various types of ranking formulas. Additionally they did 13 runs using the full retrieval system (4 adhoc and 9 routing). Full descriptions of these runs are given in the system overviews.</Paragraph> <Paragraph position="1"> Figures 5 through 12 show the results from the 24-month evaluation. Figures 5 and 6 show some of the adhoc resuits for the full collection, and figures 7 and 8 show some of the routing results. The results from Syracuse University on a smaller subset of the document collection are shown in figures 9 through 12.</Paragraph> <Paragraph position="2"> Figure 5 shows the recall/precision curves for the adhoc topics. The three INQRY runs include their baseline method (INQ009), which is same as the baseline method INQRYJ developed at the 18-month evaluation period.</Paragraph> <Paragraph position="3"> The first modification (INQ012) uses the inference net to &quot;combine&quot; weights from the documents and weights from the best-matching paragraphs in the document. The second modification (INQ015) shows the new term expansion method using an automatically-built thesaurus. Both modifications show some improvements over the baseline method.</Paragraph> <Paragraph position="4"> The three HNC runs shown on figure 5 include a baseline (HNCadl) that is similar to their best 18-month adhoc approach (HNC2aM), but that uses a required match of 4 terms rather than 3. The HNCad3 results show the effects of using a larger context vector of 512 terms rather than only 280 terms for the baseline results. This causes a slight improvement. The HNCad2 results are using some manual relevance feedback.</Paragraph> <Paragraph position="5"> The University of Massachusetts results are better than the HNC results, but there were improvements in both systems over the 18-month evaluation. Figure 6 shows the recall/fallout curves for the best runs of these two systems. Both plots show the same differences in performance, but it can be seen on the recall/fallout curve that both systems are retrieving at a very high accuracy. At a recall of about 60 percent (i.e. about 60 percent of the relevant documents have been retrieved) the precision of the INQRY results is about 30 percent. The fallout, however, is about 0.0004, meaning that most non-relevant documents are being properly screened out. This corresponds to a probability of false alarm rate of 0.0004 at this point, in ROC terminology.</Paragraph> <Paragraph position="6"> run marked INQ026 is the baseline run of the INQRY system and uses the same methodology as the adhoc INQ009 run. The other two runs add some type of relevance feed-back using the training documents. The plot marked INQ023 uses both relevance feedback and proximity operators to add 30 terms and 30 pairs of terms from the relevant documents to the query. The most complex~ run, INQ030, constructed the queries similarly to run INQ023, but additionally weighted the documents using a combination method similar to adhoc run INQ012. These runs represented the best results from many different experiments, and the relevance feedback gives significant improvement over the baseline runs.</Paragraph> <Paragraph position="7"> The HNC routing results also represent the best of many experiments. The results for HNCrt5 show the neural network learning using stem weighting, similar to HNCrt2 at the 18-month evaluation. The second two sets of results represent data fusion techiques, with HNCrtl being fusion of four types of retrievals, using the same combinations for all topics, and HNCrt2 using different combinations for different topics. The data fusion combinations both work well, but the per topic combination works the best, just as the less sophisticated version of this run worked best at the I8-month evaluation.</Paragraph> <Paragraph position="8"> Again the University of Massachusetts results were better than the HNC results, but with major improvements in both systems over the 18-month evaluation. Figure 8 shows the recall/fallout curves for the best runs of both groups.</Paragraph> <Paragraph position="9"> The Syracuse runs were on a subset of the full collection so are not directly comparable. However they also showed significant improvements over their 18-month baseline. Figure 9 shows three first-stage Syracuse runs, the results of trying different complex methods of combining the information (subject field code, text structure, and other information) that is detected in the first-stage modules. The results of this combination are passed to the second stage (figure 10). Note that due to processing errors there were documents lost between stages, and these official results are therefore inaccurate. Additionally only 19 topics (out of 50) are shown in figure I0. The improvements that could have been expected do not show because of these problems.</Paragraph> <Paragraph position="10"> Figures 11 and 12 show the Syracuse routing runs. The first stage runs show not only the combinations from the adhoc, but also additional ways of integrating the data.</Paragraph> <Paragraph position="11"> Again there were processing errors with the second stage results, and therefore no improvement is shown using the second stage.</Paragraph> </Section> <Section position="6" start_page="41" end_page="41" type="metho"> <SectionTitle> 5. COMPARISON WITH TREC RESULTS </SectionTitle> <Paragraph position="0"> How do the TIPSTER results compare with the TREC-2 results? Two of the TIPSTER contractors submitted resuits for TREC-2 and these can be seen in Figures 13 and 14. These figures show the best TREC-2 adhoc and routing results for the full collection. More information about the various TREC-2 runs can be found in the TREC-2 proceedings \[1\]. The results marked &quot;INQ001&quot; are the TIPSTER INQUERY system, using methods similar to their baseline TIPSTER INQ009 run. The &quot;dortQ2&quot;, &quot;Brkly3&quot; and &quot;crnlL2&quot; are all based on the use of the Cotnell SMART system, but with important variations. The &quot;cmlL2&quot; run is the basic SMART system, but using less than optimal term weightings (by mistake). The &quot;dortQ2&quot; results come from using the training data to find parameter weights for various query factors, whereas the &quot;Brkly3&quot; results come from performing statistical regression analysis to learn term weighting. The &quot;CLARTA&quot; system adds noun phrases found in an automatically-constructed thesaurus to improve the query terms taken from the topic. The plot marked &quot;HNCadl&quot; is the baseline adhoc run for the TIPSTER 24-month evaluation. The TIPSTER INQUERY system is one of the best performing systems for the TREC-2 adhoc topics.</Paragraph> <Paragraph position="1"> The routing results from TREC-2 (shown in figure 14) exhibit more differences between the systems. Again three systems are based on the Cornell SMART system. The plot marked &quot;crnlCl&quot; is the actual SMART system, using the basic Rocchio relevance feedback algorithms, and adding many terms (up to 500) from the relevant training documents to the terms in the topic. The &quot;dortPl&quot; results come from using a probabilistically-based relevance feed-back instead of the vector-space algorithm. These two systems have the best routing results. The &quot;Brkly5&quot; system uses statistical regression on the relevant training documents to learn new term weights. The &quot;cityr2&quot; results are based on a traditional probabilistic reweighting from the relevant documents, adding only a small number of new terms (10-25) to the topic. The &quot;INQ003&quot; results aim use probabilistic reweighting and add 30 new terms to the topics. The &quot;hnc2c&quot; results are similar to the HNCrtl fusion results for the 24-month TIPSTER evaluation.</Paragraph> <Paragraph position="2"> These plots mask important information as they are averages over the 50 adhoc or routing topics. Whereas often the averages show little difference between systems, these systems are performing quite differently when viewed on a topic by topic basis. Table 1 shows the &quot;top 8&quot; TREC-2 systems for each adhoc topic. The various system tags illustrate that a wide variety of systems do well on these topics, and that often a system that does not do well on average may perform best for a given topic. This is an inherent performance characteristic of information retrieval systems, and emphasizes the importance of getting beyond the averages in doing evaluation. Clearly systems that perform well on average reflect better overall methodologies, but often much can be learned by analyzing why a given system performs well or poorly on a given topic.</Paragraph> <Paragraph position="3"> This is where more work is needed with respect to analyzing the TIPSTER and TREC results.</Paragraph> <Paragraph position="4"> Tables 2 and 3 show some prefiminary analysis of two of the topics with respect to the TIPSTER contractors. Table 2 gives the ranks of the relevant documents retrieved either by the HNCrtl run or the INQ023 run. Clearly the HNC run is better for this topic, providing much higher ranks for most of the relevant documents. Note that five of the relevant documents were not retrieved by either system.</Paragraph> <Paragraph position="5"> Table 3 shows a sfighfly different view of the same phenomena, but for topic 121. There were a total of 55 relevant documents for this topic, with only 13 of them found by the TIPSTER systems. Table 3 fists those 13 documents, the rank at which they were regieved, and the &quot;tag&quot; of the system retrieving them. Note that for this topic the INQUERY system is performing better than the HNC system. These tables illustrate the varying performance of different methods across the topics. A major challenge facing each group is to determine which strategies are successful for most topics, and which strategies are successful only for some topics (including how to identify in advance this topic subse0.</Paragraph> </Section> class="xml-element"></Paper>