XML Viewer - n04-4013

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-4013_evalu.xml
Size: 6,081 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4013">
  <Title>Web Search Intent Induction via Automatic Query Reformulation</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Results and Analysis
</SectionTitle>
    <Paragraph position="0"> Our results are calculated on two metrics: relevance and predictivity, as described in the previous section.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Relevance Results
</SectionTitle>
      <Paragraph position="0"> The results of the evaluation are summarized in Table 2.</Paragraph>
      <Paragraph position="1"> The table reports four statistics for each of the systems compared. In the table, MSN is vanilla MSN search and QDSE is the system described in this paper.</Paragraph>
      <Paragraph position="2"> The rst row is probability of success using this system (number of successful searches divided by the number of total searches). The second line is the probability of success, given that you are only allowed to read the rst 20 results. Next, Avg. Success Cost, is the average cost of the relevant URL for that system. This cost averages only over the successes (queries for which a relevant URL was found). The next statistic, Avg. Cost, is the average cost including failures, where the cost of a failure is, in the case of vanilla MSN, the number of returned results and, in the case of QDSE, the cost of reading the top ve results, all the labels and one category expansion6 The last statistic, Avg. Mutual Cost, is the average cost for all pairs where both systems found a relevant document. The last line reports inter-annotator agreement as calculated over the 12 pairs, which is low due partly to the small sample size and partly to the fact that the intents themselves were still somewhat underspeci ed.7</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Predictivity Results
</SectionTitle>
      <Paragraph position="0"> We performed two calculations on the results of the predictivity annotations. In the rst calculation, we consider the relevance judgments on the QDSE system to be the gold standard. We calculated accuracy of choosing the correct rst category. This measures the extent to which  however, it has been observed (Dumais et al., 2001) that presenting users with structured results enables them to nd relevant documents more quickly; to do timed studies in the linearization is an unrealistic scenario, since one would never deploy the system in this con guration.</Paragraph>
      <Paragraph position="1"> Query: Soldering iron URL: www.siliconsolar.com/accessories.htm Intent: looking for accessories for soldering irons (but not soldering irons themselves) Query: Whole Foods URL: www.wholefoodsmarket.com/company/communitygiving.html Intent: looking for the Whole Foods Market's community giving policy Query: nal fantasy URL: www.playonline.com/ff11/home/ Intent: looking for a webforum for nal fantasy games Query: online computer course URL: www.microsoft.com/traincert/ Intent: looking for information on Microsoft Certi ed Technical Education centers  the oracle system is correct. On this task, accuracy was 0:54. The second calculation we made was to determine whether a user can predict, looking at the headers only, whether their search has been successful. In the task of simply identifying failed searches, accuracy was 0:70.</Paragraph>
      <Paragraph position="2"> Inter-annotator agreement for predictivity was somewhat low, with a kappa value of only 0:49.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Analysis
</SectionTitle>
      <Paragraph position="0"> As can be seen from Table 2, a user is less likely to nd a relevant query in the top 100 documents using the QDSE system than using the MSN system. However, this is an arti cial task: very few users will actually read through the top 100 returned documents before giving up. At a cutoff of 20 documents, the user is still more likely to succeed using MSN, but the difference is not nearly so large (note, however, that by cutting off at 20 in the QDSE linearization, the user will typically see only one result from each alternate query, thus heavily relying on the underlying search engine to do a good job). The rest of the numbers (not included for brevity) are consistent at 20.</Paragraph>
      <Paragraph position="1"> Moreover, as seen in the evaluation of the predictivity results, users can decide, with 70% accuracy, whether their search has failed having read only the category labels. This is in stark contrast to the vanilla MSN search where they could not know without reading all the results whether their search had succeeded.</Paragraph>
      <Paragraph position="2"> If one does not wish to give up on recall at all, we could simply list all the MSN search results immediately after the QDSE results. By doing this, we ensure that the probability of success is at least as high for the QDSE system. We can upper-bound the additional cost this would incur to the QDSE system by 4:15, yielding an upper bound of 13:2, still superior to vanilla MSN.</Paragraph>
      <Paragraph position="3"> If one is optimistic and is willing to assume that a user will know based only on the category labels whether or not their search has succeeded, then the relevant comparison from Table 2 is between Avg. Success Cost for QDSE and Avg. Cost for MSN. In this case, our cost of 4:7 is a factor of 5 better than the MSN cost. If, on the other hand, one is pessimistic and believes that a user will not be able to identify based on the category names whether or not their search has succeeded in the QDSE system, then the interesting comparison is between the Avg. Costs for MSN and QDSE. Both favor QDSE.</Paragraph>
      <Paragraph position="4"> Lastly, the reciprocal rank statistic at 20 results con rm that the QDSE system is more able to direct the user to relevant documents than vanilla MSN search.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML