File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1026_metho.xml

Size: 15,311 bytes

Last Modified: 2025-10-06 14:07:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1026">
  <Title>The Effectiveness of Dictionary and Web-Based Answer Reranking</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 TREC-10 Q&amp;A Track
</SectionTitle>
    <Paragraph position="0"> The main task of the TREC-10 (Voorhees and Harman 2002) QA track required participants to return a ranked list of five answers of no more than 50 bytes long per question that were supported by the TREC-10 QA text collection.</Paragraph>
    <Paragraph position="1"> The TREC-10 QA document collection consists of newspaper and newswire articles on TREC disks 1 to 5. It contains about 3 GB of texts. Test questions were drawn from filtered MSNSearch and AskJeeves logs. NIST assessors then sifted 500 questions from the filtered logs as test set.</Paragraph>
    <Paragraph position="2"> The questions were closed-class fact-based (&amp;quot;factoid&amp;quot;) questions such as &amp;quot;How far is it from Denver to Aspen?&amp;quot; and &amp;quot;What is an atom?&amp;quot;. Mean reciprocal rank (MRR) was used as the indicator of system performance. Each question receives a score as the reciprocal of the rank of the first correct answer in the 5 submitted responses. No score is given if none of the 5 responses contain a correct answer. MRR is then computed for a system by taking the mean of the reciprocal ranks of all questions.</Paragraph>
    <Paragraph position="3"> Besides MRR score, we are also interested in learning how well a system places a correct answer within the five responses regardless of its rank. We called this percent of correctness in the top 5 (PCT5). PCT5 is a precision related metric and indicates the upper bound that a system can achieve if it always places the correct answer as its first response.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Webclopedia: An Automated
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Question Answering System
</SectionTitle>
      <Paragraph position="0"> Webclopedia's architecture follows the principle outlined in Section 1. We briefly describe each stage in the following. Please refer to (Hovy et al.</Paragraph>
      <Paragraph position="1">  2002) for more detail.</Paragraph>
      <Paragraph position="2"> (1) Question Analysis: We used an in-house parser, CONTEX (Hermjakob 2001), to parse and analyze questions and relied on BBN's IdentiFinder (Bikel et al., 1999) to provide basic named entity extraction capability.</Paragraph>
      <Paragraph position="3"> (2) Document Retrieval/Sentence Ranking:  The IR engine MG (Witten et al. 1994) was used to return at least 500 documents using Boolean queries generated from the query formation stage. However, fewer than 500 documents may be returned when very specific queries are given. To decrease the amount of text to be processed, the documents were broken into sentences. Each sentence was scored using a formula that rewards word and phrase overlap with the question and expanded query words. The ranked sentences were then filtered by expected answer types (ex: dates, metrics, and countries) and fed to the  answer extraction module.</Paragraph>
      <Paragraph position="4"> (3) Candidate Answer Extraction: We again used CONTEX to parse each of the top N  sentences, marked candidate answers by named entities and special answer patterns such as definition patterns, and then started the ranking process.</Paragraph>
      <Paragraph position="5"> (4) Answer Ranking: For each candidate answer several steps of matching were performed. The matching process considered question keyword overlaps, expected answer types, answer patterns, semantic type, and the correspondence .</Paragraph>
      <Paragraph position="6"> of question and answer parse trees. Scores were given according to the goodness of the matching. The candidate answers' scores were compared and ranked.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
(5) Answer Reranking, Duplication Removal,
</SectionTitle>
    <Paragraph position="0"> and Answer output: For some special question type such as definition questions (e.g., &amp;quot;What is cryogenics?&amp;quot;), we used WordNet glosses or web search results to rerank the answers. Duplicate answers were removed and only one instance was kept to increase coverage. The best 5 answers were output. Answer reranking is the main topic of this paper. Section 4 presents these methods in detail.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="5" type="metho">
    <SectionTitle>
4 Dictionary and Web-Based
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
Answer Reranking
4.1 Definition Questions
</SectionTitle>
      <Paragraph position="0"> Compared to other question types, definition questions are special. They are typically very short and in the form of &amp;quot;What is|are (a|an) X?&amp;quot;, where X is a 1 to 3 words term  , for example: &amp;quot;What is autism?&amp;quot;, &amp;quot;What is spider veins?&amp;quot; and &amp;quot;What is bangers and mash?&amp;quot;. As we learned from past TREC experience, it was more difficult to find relevant documents for short queries. As stated earlier, over 20% of questions in TREC-10 were of definition type, which was a reflection of real user queries mined from the web search engine logs (Voorhees 2001). Several top performing systems in the evaluation treated this type of question as a special category and most of them used definition answer patterns. The best performing system, InsightSoft-M, (Soubbotin and Soubbotin 2001) used a set of six definition patterns including P1:{&lt;Q; is/are; [a/an/the]; A&gt;,</Paragraph>
      <Paragraph position="2"> [a/an/the]; Q; [comma/period]&gt;}, where Q is the term to be defined and A is the candidate answer.</Paragraph>
      <Paragraph position="3"> The InsightSoft-M system returned 88 correct responses based on these patterns. The runner up system (Harabagiu et al. 2001) used 12 answer patterns with extension of WordNet hypernyms.</Paragraph>
      <Paragraph position="4"> They did not report their success rate for TREC-10 but according to Pasca (2001)  , this set  Among the 102 TREC-10 definition questions, 81 asked the definition of one word; 19, two words; 2, three words.</Paragraph>
      <Paragraph position="5">  Among them 31 were extracted through pattern of patterns with WordNet extension extracted 59 out of 67 definition questions in TREC-8 and TREC-9.</Paragraph>
      <Paragraph position="6"> The success stories of these systems indicated that carefully crafted answer patterns were effective in candidate answer extraction. However, just applying answer patterns blindly might lead to disastrous results, as shown by Hermjakob (2002), since correct and incorrect answers were equally likely to match these patterns. For example, for the question &amp;quot;What is autism?&amp;quot;, the following answers are found in the TREC-10 corpus using the patterns described by  Obviously, patterns alone cannot distinguish which one is the best answer. Some other mechanisms are necessary. We propose two different methods to solve this problem. One is a dictionary-based method using WordNet glosses and the other is to go directly to the web and compile web glosses on the fly to help select the best answers. The effect of combining both methods was also studied. We describe these two methods in the following sections.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Dictionary-Based Reranking
</SectionTitle>
      <Paragraph position="0"> Using a dictionary to look up the definition of a term is the most straightforward solution for answering definition questions. For example, the definition of autism in the WordNet is: &amp;quot;an abnormal absorption with the self; marked by communication disorders and short attention span and inability to treat others as people&amp;quot;.</Paragraph>
      <Paragraph position="1"> However, we need to find a candidate answer string from the TREC-10 corpus that is equivalent to this definition. By inspection, we find that candidate answers , , and shown in the previous section are more compatible to the definition and seems to be the best one.</Paragraph>
      <Paragraph position="2"> To automate the decision process, we construct a definition database based on the WordNet noun matching and 27 were from WordNet hypernym expansion.</Paragraph>
      <Paragraph position="3"> .</Paragraph>
      <Paragraph position="4"> glosses. Closed class words are thrown away and each word w i in the glosses is assigned a gloss</Paragraph>
      <Paragraph position="6"> occurring in the WordNet noun glosses and N is total number of occurrences of all noun gloss words in the WordNet. The goodness of the matching M wn for each candidate answer is simply the sum of the weight of the matched word stems between its WordNet definition and itself. For example, candidate answer and autism's WordNet definition have these matches:  for each candidate answer is its original score multiplied by M wn . The final ranking is then sorted according to S</Paragraph>
      <Paragraph position="8"> answers are removed, and the top 5 answers are output. Table 1 shows the top 5 answers returned before and after applying dictionary-based reranking. It demonstrates that dictionary-based reranking not only pushes the best answer to the first place but also boosts other lower ranked good answers i.e. &amp;quot;a mental disorder&amp;quot; to the second place.</Paragraph>
      <Paragraph position="9"> Harabagiu et al. (2001) also used WordNet to assist in answering definition questions.</Paragraph>
      <Paragraph position="10"> However, they took the hypernyms of the term to be defined as the default answers while we used its glosses. The hypernym of &amp;quot;autism&amp;quot; is &amp;quot;syndrome&amp;quot;. In this case it would not boost the desired answer to the top but it would instead &amp;quot;validate&amp;quot; &amp;quot;Down's syndrome&amp;quot; as a good answer. Further research is needed to investigate the tradeoff between using hypernyms and glosses.</Paragraph>
      <Paragraph position="11"> WordNet glosses were incorporated in IBM's statistical question answering system as definition features (Ittycheriah et al. 2001).</Paragraph>
      <Paragraph position="12">  This is essentially inverse document (WordNet gloss entry) frequency (IDF) used in the information retrieval research.</Paragraph>
      <Paragraph position="13"> However, they did not report the effectiveness of the features in definition answer extraction.</Paragraph>
      <Paragraph position="14"> Out of vocabulary words is the major problem of dictionary-based reranking. For example, no WordNet entry is found for &amp;quot;e-coli&amp;quot; but searching the term &amp;quot;e-coli&amp;quot; at www.altavista.com and www.google.com yield the following: * E. coli is a food borne illness. Learn about prevention, symptoms and risks, detection, ... Risks Detection Recent Outbreaks Resources The term E. coli is an abbreviation for the bacteria</Paragraph>
      <Paragraph position="16"> * The E. coli Index (part of the WWW Virtual Library) - Description: Guide to information relating to the model organism Escherichia coli. From the WWW Virtual Library. (1</Paragraph>
      <Paragraph position="18"> This brings us to the web-based reranking method that we introduce in the next section.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="5" type="sub_section">
      <SectionTitle>
4.3 Web-Based Reranking
</SectionTitle>
      <Paragraph position="0"> The World Wide Web contains massive amounts of information covering almost any thinkable topic. The TREC-10 questions are typical instances of queries for which users tend to believe answers can be found from the web.</Paragraph>
      <Paragraph position="1"> However, the candidate answers extracted from the web have to find support in the TREC-10 corpus in order to be judged as correct otherwise they will be marked as unsupported.</Paragraph>
      <Paragraph position="2"> The search results of &amp;quot;e-coli&amp;quot; from two online search engines indicate that &amp;quot;e-coli&amp;quot; is an abbreviation for the bacteria Escherichia.</Paragraph>
      <Paragraph position="3"> However, to automatically identify &amp;quot;e-coli&amp;quot; as &amp;quot;Escherichia&amp;quot; from these two pages is the same QA problem that we set off to resolve. The only advantage of using the web instead of just the TREC-10 corpus is the assumption that the web contains many more redundant candidate answers due to its huge size. Compared to  corpus contains only about 979,000 articles.</Paragraph>
      <Paragraph position="4"> For a given question, we first query the web, apply answer extraction algorithms over a set of top ranked web pages (usually in the lower hundreds), and then rank candidate answers according to their frequency in the set. This assumes the more a candidate answer occurs in the set the more likely it is the correct answer. Clarke et al. (2001) and Brill et al. (2001) both applied this principle and achieved good results. Instead of using Webclopedia to extract candidate answers from the web and then project back to the TREC-10 corpus, we treat the web as a huge dynamic dictionary. We compile web glosses on the fly for each definition question and apply the same reranking procedure used in the dictionary-based method. We detail the procedure in the following.</Paragraph>
      <Paragraph position="5">  (1) Query a search engine (e.g., Altavista) with the term (e.g., &amp;quot;e-coli&amp;quot;) to be defined. (2) Download the first R pages (e.g., R = 70).</Paragraph>
      <Paragraph position="6"> (3) Extract context word</Paragraph>
      <Paragraph position="8"> defined from each page. Closed class words are  ignored. These context words are used as candidate web glosses.</Paragraph>
      <Paragraph position="9"> (4) The gloss weight</Paragraph>
      <Paragraph position="11"> w in the set of context words extracted in (3), N is the total number of training questions, and n</Paragraph>
      <Paragraph position="13"> This was the number that Google (www.google.com) advertised at its front page as of January 31, 2002.</Paragraph>
      <Paragraph position="14">  This is essentially TFIDF (product of term frequency and inverse document frequency) used in the information retrieval research.</Paragraph>
      <Paragraph position="15"> for each candidate answer is simply the sum of the weights of the matched word stems between its web gloss definition and itself. Only words with gloss weight Ts</Paragraph>
      <Paragraph position="17"> . The value of T serves as a cut-off threshold to filter out low confidence words.</Paragraph>
      <Paragraph position="18"> (6) The reranking score S</Paragraph>
      <Paragraph position="20"> duplicate answers are removed, and the top 5 answers are output. Table 2 shows the top 5 answers returned before and after applying web-based reranking for the question &amp;quot;What is Wimbledon?&amp;quot;. Google was used as the search engine with T=5, W=10, and R=70.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML