File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1009_evalu.xml

Size: 10,810 bytes

Last Modified: 2025-10-06 13:58:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1009">
  <Title>Alternative Phrases and Natural Language Information Retrieval</Title>
  <Section position="6" start_page="8" end_page="8" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
5.1 Frequency of Alternative Phrases
</SectionTitle>
      <Paragraph position="0"> First, it is useful to determine how many queries contain alternative phrases in order to judge how large a problem this really is. Unfortunately, this is complicated by the fact that users, in general, know that such constructions are not understood by search engines, so they avoid them. In fact, even in NLIR systems, users often use keywords even though doing so performs worse than asking natural language questions. They do this because they do not trust the system. Work will be necessary to improve users' awareness of NL capabilities through advertising and by implementing new user interfaces. Work will also be needed to take more account of the fact that search is often an iterative and even interactive process. As noted by Lewis and Sparck-Jones (1996), the results of large-scale document-retrieval competitions such as the Text Retrieval Conference (TREC, 2000) do not necessarily re ect the experience many users have with retrieval systems.</Paragraph>
      <Paragraph position="1"> In the meantime, I have attempted to nd a baseline for this number by considering two corpora of human/human dialogs. The corpora are both from tutorial situations where a mentor helps a student through problems in the subject of Physics for one corpus (VanLehn et al., 1998; VanLehn et al., in press), and Basic Electricity and Electronics (Rose et al., 1999), for the other. Tutoring dialogs are an ideal place to look for data relevant to NLIR because they consist entirely of one party attempting to elicit information from the other. In some cases, it is the tutor eliciting information from the student, and in others it is the other way around.</Paragraph>
      <Paragraph position="2"> Table 1 shows the frequencies of some alternative phrases. A more illuminating statistic is how often alternative phrases appear in a single dialog.</Paragraph>
      <Paragraph position="3"> I consider a dialog to be a single session between the student and tutor where the discussion of each problem is considered a separate session. The table shows that in 269 total dialogs, each dialog contained, on average, 3.65 alternative phrases. If one only considers alternative phrases that occur in the context of a question, there are, on average, 1.54 alternative phrases per dialog. I consider an alternative phrase to be in a question context if it is in the same dialog turn as a question. Because tutors ask questions in order to lead the student to the answer, it is perhaps better to consider just the student data, where a quarter of the dialogs contained question contexts with alternative phrases.</Paragraph>
      <Paragraph position="4"> This data is not meant to be considered a rigorous result, but it is a strong indication that any query-answering system is likely to confront an alternative phrase during the course of interacting with a user. Furthermore, the data shows that during the course of the interaction it will be appropriate for the system to respond using an alternative phrase, which is important when considering more responsive search engines than are available today.</Paragraph>
      <Paragraph position="5"> It is also interesting to note that a wide variety of alternative phrases occur in this data. (9) contains some examples. Because excision words, especially other, are by far the most frequent, I have only considered them in the evaluation presented here.</Paragraph>
      <Paragraph position="6">  (9) a. The battery is the green cylinder right? I don't see anything negative other than #5.</Paragraph>
      <Paragraph position="7"> b. And what do you have in addition to voltage? c. Are there any other forces on that knob besides that one you've labeled W1?</Paragraph>
    </Section>
    <Section position="2" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
5.2 Potential Improvement
</SectionTitle>
      <Paragraph position="0"> Of interest is the potential for improving various search engines through a better way of treating alternative phrases. To start with, given a set of queries containing alternative phrases, it is important to see how well these systems perform without enhancement. Table 2 shows the performance of Alta Vista, Ask Jeeves, and the EK search engine on eight excision examples taken from the  documents and x are those which contain an answer to the query. z are the number of answers that are wrong because they are about the subject that was explicitly being excluded in the query.</Paragraph>
      <Paragraph position="1"> As the data shows, none of the search engines fare particularly well. The precision for all three is around 20%. That is, only about one in ve of the responses contain an answer to the query. Furthermore, from a half to two thirds of the incorrect responses were speci cally about the subject the query wanted to exclude, displaying little or no understanding of excision alternative phrases.</Paragraph>
      <Paragraph position="2"> The point is not to draw any conclusions about the relative merits of the search engines from this test. Rather, it is that each NLIR system shows room for improvement. Since I will demonstrate that improvement only for the Electric Knowledge search engine, next, this shows that that improvement is not due to exceptionally bad prior performance by the EK search engine.</Paragraph>
      <Paragraph position="3"> 5.3 Enhancing the EK search engine Table 3 shows the results of asking the EK search engine questions in three di erent forms: without an alternative phrase, with an alternative phrase that has not been translated, and with the alternative phrase translated as described in Section 4.2. The rst row of the table, for instance, refers to the questions in (10). The remaining sentences can be found in Bierner (2001). Although an implementation exists that is capable of performing the translation in Section 4.2, this was done by hand for this evaluation to abstract away from parsing issues.</Paragraph>
      <Paragraph position="4"> (10) What are some works by Edgar Allan Poe? What are some works by Edgar Allan Poe other than the Raven? What are some works by Edgar Allan Poe? :j: ANSWER NOT NEAR (j raven) Unfortunately, at the time of this evaluation, Electric Knowledge had taken down their public portal in favor of providing search for the web pages of speci c clients. This means that the large index used to process the queries in Table 2 was no longer available, and I therefore used di erent indices and di erent queries for this evaluation.</Paragraph>
      <Paragraph position="5"> I used indices of pages about American history and literature|each about 11,000 pages. These new, more specialized, indices have the bene t of abstracting this evaluation away from coverage issues (explaining the di erences in precision between Table 2 and Table 3).</Paragraph>
      <Paragraph position="6"> I created the questions in two ways. For several, I began by asking a question without an alternative phrase. I then added an alternative phrase in order to remove some responses I was not interested in. For example, when I asked Who are the romantic poets?, all responses were about female romantic poets. I therefore used the query Who are the romantic poets not including women? in the evaluation.</Paragraph>
      <Paragraph position="7"> Some queries were made without rst trying the non-alternative phrase version: What are some epics besides Beowulf?, for example. This variation re ects the fact that it is unclear which is more common, excluding clutter from a set of responses or a priori excluding cases that the questioner is not interested in. The queries also vary in their syntactic structure, information requested, and alternative phrase used. The purpose of varying the queries in these ways is to ensure that the results do not simply re ect a quirk in the EK search engine's implementation.</Paragraph>
      <Paragraph position="8"> For each query, Table 3 shows total, the number of documents returned; good, the number of true positives; and top 5, the number of true positives in the top ve returned documents. A true positive was given to a document if it contained an answer to the question, and half a point was given to a document that contained an obvious link to a document with an answer to the question. Precision is computed for all documents and for the top ve. It is important to note that the scores for the queries without alternative phrases are still computed with respect to the alternative phrase.</Paragraph>
      <Paragraph position="9"> That is, documents only about the \Raven&amp;quot; are considered false positives. In this way, we can view these scores as a baseline|what would have happened had the system simply removed the alternative phrase. This should be taken with a grain  of salt because, in many cases, I chose the query because there were documents to remove (as in the Romantics example). However, a concern was that the transformed query would cause numerous false negatives. This is not the case as seen by the fact that the precision of the transformed query is not lower than the baseline. In fact, in no example was the precision less than the baseline, and at worse, the precision remained the same.</Paragraph>
      <Paragraph position="10"> Performance on questions containing alternative phrases was quite poor, with an average of 15% precision. This is signi cantly worse than the transformed query, and even the baseline. The performance drop is due to the fact that the complex syntax of the query confuses the EK search engine's analysis. The EK search engine is forced to reduce the query to a simple set of keywords including the gure, which we were trying to avoid.</Paragraph>
      <Paragraph position="11"> Thus as predicted in the discussion of potential improvement, not accounting for alternative phrases can greatly increase the number of false positives containing the gure (FPF), the very thing the query is attempting to exclude. Table 4 shows that in the baseline case, where there is no alternative phrase, on average for 28% of the returned documents the only answer was the one we wanted to exclude. Adding the alternative phrase has the opposite of the intended e ect as the percentage of FPFs increases to 44% for reasons described above. Transforming the query, on the other hand, causes the desired e ect, more than halving the percentage of FPFs of the baseline.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML