File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1025_metho.xml

Size: 23,632 bytes

Last Modified: 2025-10-06 14:07:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1025">
  <Title>Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge Question-Answering System</Title>
  <Section position="4" start_page="181" end_page="181" type="metho">
    <SectionTitle>
3 The Vector Space Model for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="181" end_page="181" type="sub_section">
      <SectionTitle>
Document Retrieval
</SectionTitle>
      <Paragraph position="0"> It is clear that a successful QA system will need some way to find the documents that are most relevant to the user's question. In a baseline system, we assume that standard IR techniques can be used for this task. In contrast to MURAX, however, we hypothesize that the vector space retrieval model will suffice. In the vector space model, both the question and the documents are represented as vectors with one entry for every unique word that appears in the collection. Each entry is the term weight, a real number that indicates the presence or absence of the word in the text. The similarity between a question vector, Q = ql,q2,... ,qn, and a document vector, D = dl, d2,..., tin, is traditionally computed using a cosine similarity measure:</Paragraph>
      <Paragraph position="2"> Using this measure, the IR system returns a ranked list of those documents most similar to the question.</Paragraph>
      <Paragraph position="3"> The Baseline QA System: The Smart Vector Space Model. For the IR component of the baseline QA system, we use Smart (Salton, 1971), a sophisticated text-processing system based on the vector space model and employed as the retrieval engine for a number of the top-performing systems at recent Text REtrieval Conferences (e.g. Buckley et al., 1998a, 1998b). Given a question, Smart returns a ranked list of the documents most relevant to the question. For the baseline QA system and all subsequent variations, we use Smart with standard term-weighting strategies I and do not use automatic relevance feedback (Buckley, 1995). In addition, the baseline system applies no linguistic filters. To generate answers for a particular question, the system starts at the beginning of the top-ranked document returned by Smart for the question and constructs five 50-byte chunks consisting of document text with stopwords removed.</Paragraph>
      <Paragraph position="4"> lWe use Lnu term weighting for documents and Itu term weighting for the question (Singhal et al., 1996).</Paragraph>
      <Paragraph position="5"> Evaluation. As noted above, we evaluate each variation of our QA system on 38 TREC8 development questions and 200 TREC8 test questions.</Paragraph>
      <Paragraph position="6"> The indexed collection is TREC disks 4 and 5 (with null out Congressional Records). Results for the baseline Smart IR QA system are shown in the first row of  questions and 29 out of 200 test questions correct.</Paragraph>
      <Paragraph position="7"> We judge the system correct if any of the five guesses contains each word of one of the answers. The final column of results shows the mean answer rank across all questions correctly answered.</Paragraph>
      <Paragraph position="8"> Smart is actually performing much better than its scores would suggest. For 18 of the 38 development questions, the answer appears in the top-ranked document; for 33 questions, the answer appears in one of the top seven documents. For only two questions does Smart fail to retrieve a good document in the top 25 documents. For the test corpus, over half of the 200 questions are answered in the top-ranked document (110); over 75% of the questions (155) are answered in top five documents. Only 19 questions were not answered in the top 20 documents.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="181" end_page="182" type="metho">
    <SectionTitle>
4 Query-Dependent Text
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="181" end_page="182" type="sub_section">
      <SectionTitle>
Summarization for Question
Answering
</SectionTitle>
      <Paragraph position="0"> We next hypothesize that query-dependent text summarization algorithms will improve the performance of the QA system by focusing the system on the most relevant portions of the retrieved documents. The goal for query-dependent summarization algorithms is to provide a short summary of a document with respect to a specific query. Although a number of methods for query-dependent text summarization are beginning to be developed and evaluated in a variety of realistic settings (Mani et al., 1999), we again propose the use of vector space  methods from IR, which can be easily extended to the summarization task (Salton et al., 1994): 1. Given a question and a document, divide the document into chunks (e.g. sentences, paragraphs, 200-word passages).</Paragraph>
      <Paragraph position="1"> 2. Generate the vector representation for the question and for each document chunk.</Paragraph>
      <Paragraph position="2"> 3. Use the cosine similarity measure to determine the similarity of each chunk to the question.</Paragraph>
      <Paragraph position="3"> 4. Return as the query-dependent summary the  most similar chunks up to a predetermined summary length (e.g. 10% or 20% of the original document).</Paragraph>
      <Paragraph position="4"> This approach to text summarization was shown to be quite successful in the recent SUMMAC evaluation of text summarization systems (Mani et al., 1999; Buckley et al., 1999). Our general assumption  here is that Ii~ approaches can be used to quickly and accurately find both relevant documents and relevant document portions. In related work, Chali et al. (1999) also propose text summarization techniques as a primary component for their QA system. They employ a combination of vector-space methods and lexical chaining to derive their sentence-based summaries. We hypothesize that deeper analysis of the summary extracts is better accomplished by methods from NLP that can determine syntactic and semantic relationships between relevant constituents. There is a risk in using query-dependent summaries to focus the search for answer hypotheses, however: if the summarization algorithm is inaccurate, the desired answers will occur outside of the summaries and will not be accessible to subsequent components of the QA system.</Paragraph>
      <Paragraph position="5"> The Query-Dependent Text Summarization QA System. In the next version of the QA system, we augment the baseline system to perform query-dependent text summarization for the top k retrieved documents. More specifically, the IR sub-system returns the summary extracts (sentences or paragraphs) for the top k documents after sorting them according to their cosine similarity scores w.r.t, the question. As before, no linguistic filters are applied, and answers are generated by constructing 50-byte chunks from the ordered extracts after removing stopwords. In the experiments below, k = 7 for the development questions and k = 6 for the test questions. 2 Evaluation. Results for the Text Summarization QA system using sentence-based summaries are shown in the second row of Table 1. Here we see a relatively small improvement: the system now answers four development and 45 test questions correctly. The mean answer rank, however, improves noticeably from 3.33 to 2.25 for the development corpus and from 3.07 to 2.67 for the test corpus. Paragraph-based summaries yield similar but slightly smaller improvements; as a result, sentence summaries are used exclusively in subsequent sections. Unfortunately, the system's reliance on query-dependent text summarization actually limits its potential: in only 23 of the 38 development questions (61%), for example, does the correct answer appear in the summary for one of the top k -- 7 documents.</Paragraph>
      <Paragraph position="6"> The QA system cannot hope to answer correctly any of the remaining 15 questions. For only 135 of the 200 questions in the test corpus (67.5%) does the correct answer appear in the summary for one of 2The value for k was chosen so that at least 80% of the questions in the set had answers appearing in the retrieved documents ranked 1-k. We have not experimented extensively with many values of k and expect that better performance can be obtained by tuning k for each text collection. the top k -- 6 documents. 3 It is possible that automatic relevance feedback or coreference resolution would improve performance. We are investigating these options in current work.</Paragraph>
      <Paragraph position="7"> The decision of whether or not to incorporate text summarization in the QA system depends, in part, on the ability of subsequent processing components (i.e. the linguistic filters) to locate answer hypotheses. If subsequent components are very good at discarding implausible answers, then summarization methods may limit system performance. Therefore, we investigate next the use of two linguistic filters in conjunction with the query-dependent text summarization methods evaluated here.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="182" end_page="183" type="metho">
    <SectionTitle>
5 Incorporating the Noun Phrase
Filter
</SectionTitle>
    <Paragraph position="0"> The restricted QA task that we investigate requires answers to be short -- no more than 50 bytes in length. This effectively eliminates how or why questions from consideration. Almost all of the remaining question types are likely to have noun phrases as answers. In the TREC8 development corpus, for example, 36 of 38 questions have noun phrase answers.</Paragraph>
    <Paragraph position="1"> As a result, we next investigate the use of a very simple linguistic filter that considers only noun phrases as answer hypotheses. The filter operates on the ordered list of summary extracts for a particular question and produces a list of answer hypotheses, one for each noun phrase (NP) in the extracts in the left-to-right order in which they appeared.</Paragraph>
    <Paragraph position="2"> The NP-based QA System. Our implementation of the NP-based QA system uses the Empire noun phrase finder, which is described in detail in Cardie and Pierce (1998). Empire identifies base NPs -- non-recursive noun phrases -- using a very simple algorithm that matches part-of-speech tag sequences based on a learned noun phrase grammar.</Paragraph>
    <Paragraph position="3"> The approach is able to achieve 94% precision and recall for base NPs derived from the Penn Treebank Wall Street Journal (Marcus et al., 1993). In the experiments below, the NP filter follows the application of the document retrieval and text summarization components. Pronoun answer hypotheses are discarded, and the NPs are assembled into 50-byte chunks.</Paragraph>
    <Paragraph position="4"> Evaluation. Results for the NP-based QA system are shown in the third row of Table 1. The noun phrase filter markedly improves system performance for the development corpus, nearly dou3Paragraph-based summaries provide better coverage on the test corpus than sentence-based summaries: for 151 questions, the correct answer appears in the summary for one of the top k documents. This suggests that paragraph summaries might be better suited for use with more sophisticated linguistic filters that are capable of discerning the answer in the larger summary.</Paragraph>
    <Paragraph position="5">  Answering Task. Results for 38 development and 200 test questions are shown. The mean answer rank (MAR) is computed w.r.t, all questions correctly answered. bling the number of questions answered correctly.</Paragraph>
    <Paragraph position="6"> We found these results somewhat surprising since this linguistic filter is rather weak: we expected it to work well only in combination with the semantic filter described below. The noun phrase filter has much less of an effect on the test corpus, improving performance on questions answered from 45 to 50.</Paragraph>
    <Paragraph position="7"> In a separate experiment, we applied the NP filter to the baseline system that includes no text summadeg rization component. Here the NP filter does not improve performance -- the system gets only two questions correct. This indicates that the NP filter depends critically on the text summarization component. As a result, we will continue to use query-dependent text summarization in the experiments below.</Paragraph>
    <Paragraph position="8"> The NP filter provides the first opportunity to look at single-phrase answers. The preceding QA systems produced answers that were rather unnaturally chunked into 50-byte strings. When such chunking is disabled, only one development and 20 test questions are answered. The difference in performance between the NP filter with chunking and the NP filter alone clearly indicates that the NP filter is extracting good guesses, but that subsequent linguistic processing is needed to promote the best guesses to the top of the ranked guess list.</Paragraph>
  </Section>
  <Section position="7" start_page="183" end_page="183" type="metho">
    <SectionTitle>
6 Incorporating Semantic Type
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
Information
</SectionTitle>
      <Paragraph position="0"> The NP filter does not explicitly consider the question in its search for noun phrase answers. It is clear, however, that a QA system must pay greater attention to the syntactic and semantic constraints specified in the question. For example, a question like Who was president of the US in 19957 indicates that the answer is likely to be a person. In addition, there should be supporting evidence from the answer document that the person was president, and, more specifically, held this office in the US and in 1995.</Paragraph>
      <Paragraph position="1"> We introduce here a second linguistic filter that considers the primary semantic constraint from the question. The filter begins by determining the question type, i.e. the semantic type requested in the question. It then takes the ordered set of summary extracts supplied by the IR subsytem, uses the syntactic filter from Section 5 to extract NPs, and generates an answer hypothesis for every noun phrase that is semantically compatible with the question type. Our implementation of this semantic class filter is described below. The filter currently makes no attempt to confirm other linguistic relations mentioned in the question.</Paragraph>
      <Paragraph position="2"> The Semantic Type Checking QA System.</Paragraph>
      <Paragraph position="3"> For most questions, the question word itself determines the semantic type of the answer. This is true for who, where, and when questions, for example, which request a person, place, and time expression as an answer. For many which and what questions, however, determining the question type requires additional syntactic analysis. For these, we currently extract the head noun in the question as the question type. For example, in Which country has the largest part o$ the Amazon rain :forest? we identify country as the question type. Our heuristics for determining question type were based on the development corpus and were designed to be general, but have not yet been directly evaluated on a separate question corpus.</Paragraph>
      <Paragraph position="4"> * Given the question type and an answer hypothesis, the Semantic Type Checking QA System then uses WordNet to check that an appropriate ancestordescendent relationship holds. Given Brazil as an answer hypothesis for the above question, for example, Wordnet's type hierarchy confirms that Brazil is a subtype of country, allowing the system to conclude that the semantic type of the answer hypothesis matches the question type.</Paragraph>
      <Paragraph position="5"> For words (mostly proper nouns) that do not appear in WordNet, heuristics are used to determine semantic type. There are heuristics to recognize 13 basic question types: Person, Location, Date, Month, Year, Time, Age, Weight, Area, Volume, Length, Amount, and Number. For Person questions, for example, the system relies primarily on a rule that checks for capitalization and abbreviations</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="183" end_page="184" type="metho">
    <SectionTitle>
'IIOA
</SectionTitle>
    <Paragraph position="0"> in order to identify phrases that correspond to people. There are approximately 20 such rules that together cover all 13 question types listed above. The rules effectively operate as a very simple named entity identifier.</Paragraph>
    <Paragraph position="1"> Evaluation. Results for the Semantic Type Checking variation of the QA system are shown in the fourth row of Table 1. Here we see a dramatic increase in performance: the system answers three times as many development questions (21) correctly over the previous variation. This is especially encouraging given that the IR and text summarization components limit the maximum number correct to 23. In addition, the mean answer rank improves from 2.29 to 1.38. A closer look at Table 1, however, indicates problems with the semantic type checking linguistic filter. While performance on the development corpus increases by 37 percentage points (from 18.4% correct to 55.3% correct), relative gains for the test corpus are much smaller. There is only an improvement of 18 percentage points, from 25.0% correct (50/200) to 43.0% correct (86/200). This is a clear indication that the heuristics used in the semantic type checking component, which were designed based on the development corpus, do not generalize well to different question sets. Replacing the current heuristics with a Named Entity identification component or learning the heuristics using standard inductive learning techniques should help with the scalability of this linguistic filter.</Paragraph>
    <Paragraph position="2"> Nevertheless, it is somewhat surprising that very weak syntactic information (the NP filter) and weak semantic class information (question type checking) can produce such improvements. In particular, it appears that it is reasonable to rely implicitly on the IR subsystems to enforce the other linguistic relationships specified in the query (e.g. that Clinton is president, that this office was held in the US and in 1995).</Paragraph>
    <Paragraph position="3"> Finally, when 50-byte chunking is disabled for the semantic type checking QA variation, there is a decrease in the number of questions correctly answered, to 19 and 57 for the development and test corpus, respectively.</Paragraph>
  </Section>
  <Section position="9" start_page="184" end_page="185" type="metho">
    <SectionTitle>
7 Syntactic Preferences for Ordering
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="184" end_page="185" type="sub_section">
      <SectionTitle>
Summary Extracts
</SectionTitle>
      <Paragraph position="0"> Syntactic and semantic linguistic knowledge has been used thus far as post-processing filters that locate and confirm answer hypotheses from the statistically specified summary extracts. We hypothesized that further improvements might be made by allowing this linguistic knowledge to influence the initial ordering of text chunks for the linguistic filters. In a final system, we begin to investigate this claim. Our general approach is to define a new scoring measure that operates on the summary extracts and can be used to reorder the extracts based on linguistic knowledge.</Paragraph>
      <Paragraph position="1"> The QA System with Linguistic Reordering of Summary Extracts. As described above, our final version of the QA system ranks summary extracts according to both their vector space similarity to the question as well as linguistic evidence that the answer lies within the extract. In particular, each summary extract E for question q is ranked according to a new score, Sq:</Paragraph>
      <Paragraph position="3"> The intuition behind the new score is to prefer summary extracts that exhibit the same linguistic relationships as the question (as indicated by LRq) and to give more weight (as indicated by w) to linguistic relationship matches in extracts from higher-ranked documents. More specifically, LRq(E ) is the number of linguistic relationships from the question that appear in E. In the experiments below, LRq(E) is just the number of base NPs from the question that appear in the summary extract. In future work, we plan to include other pairwise linguistic relationships (e.g. subject-verb relationships, verb-object relationships, pp-attachment relationships).</Paragraph>
      <Paragraph position="4"> The weight w(E) is a number between 0 and 1 that is based on the retrieval rank r of the document that contains E: w(E) = max(m, 1 - p. r) In our experiments, m = 0.5 and p = 0.1. Both values were selected manually based on the development corpus; an extensive search for the best such values was not done.</Paragraph>
      <Paragraph position="5"> The summary extracts are sorted according to the new scoring measure and the ranked list of sentences is provided to the linguistic filters as before.</Paragraph>
      <Paragraph position="6"> Evaluation. Results for this final variation of the QA system are shown in the bottom row of Table 1.</Paragraph>
      <Paragraph position="7"> Here we see a fairly minor increase in performance over the use of linguistic filters alone: the system answers only one more question correctly than the previous variation for the development corpus and answers five additional questions for the test corpus. The mean answer rank improves only negligibly. Sixteen of the 22 correct answers (73%) appear as the top-ranked guess for the development corpus; only 53 out of 91 correct answers (58%) appear as the top-ranked guess for the test corpus. Unfortunately, when 50-byte chunking is disabled, system performance drops precipitously, by 5% (to 20 out of 38) for the development corpus and by 13% (to 65 out of 200) for the test corpus. As noted above, this indicates that the filters are finding the answers, but more sophisticated linguistic sorting is needed to promote the best answers to the top. Through  its LRq term, the new scoring measure does provide a mechanism for allowing other linguistic relationships to influence the initial ordering of summary extracts. The current results, however, indicate that with only very weak syntactic information (i.e. base noun phrases), the new scoring measure is only marginally successful in reordering the summary extracts based on syntactic information.</Paragraph>
      <Paragraph position="8"> As noted above, the final system (with the liberal 50-byte answer chunker) correctly answers 22 out of 38 questions for the development corpus. Of the 16 errors, the text retrieval component is responsible for five (31.2%), the text summarization component for ten (62.5%), and the linguistic filters for one (6.3%). In this analysis we consider the linguistic filters responsible for an error if they were unable to promote an available answer hypothesis to one of the top five guesses. A slightly different situation arises for the test corpus: of the 109 errors, the text retrieval component is responsible for 39 (35.8%), the text summarization component for 26 (23.9%), and the linguistic filters for 44 (40.4%). As discussed in Section 6, the heuristics that comprise the semantic type checking filter do not scale to the test corpus and are the primary reason for the larger percentage of errors attributed to the linguistic filters for that corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML