File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1050_evalu.xml

Size: 7,584 bytes

Last Modified: 2025-10-06 13:59:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1050">
  <Title>A Probabilistic Answer Type Model</Title>
  <Section position="6" start_page="397" end_page="399" type="evalu">
    <SectionTitle>
5 Experimental Setup &amp; Results
</SectionTitle>
    <Paragraph position="0"> We evaluate our answer typing system by using it to filter the contents of documents retrieved by the information retrieval portion of a question answering system. Each answer candidate in the set of documents is scored by the answer typing system and the list is sorted in descending order of score. We treat the system as a filter and observe the proportion of candidates that must be accepted by the filter so that at least one correct answer is accepted. A model that allows a low percentage of candidates to pass while still allowing at least one correct answer through is favorable to a model in which a high number of candidates must pass.</Paragraph>
    <Paragraph position="1"> This represents an intrinsic rather than extrinsic evaluation (Moll'a and Hutchinson, 2003) that we believe illustrates the usefulness of our model.</Paragraph>
    <Paragraph position="2"> The evaluation data consist of 154 questions from the TREC-2003 QA Track (Voorhees, 2003) satisfyingthefollowingcriteria, alongwiththetop 10 documents returned for each question as identified by NIST using the PRISE1 search engine.</Paragraph>
    <Paragraph position="3"> * the question begins with What, Which, or Who. We restricted the evaluation such questions because our system is designed to deal with questions whose answer types are often semantically open-ended noun phrases.</Paragraph>
    <Paragraph position="4"> * There exists entry for the question in the answer patterns provided by Ken Litkowski2.</Paragraph>
    <Paragraph position="5"> * One of the top-10 documents returned by PRISE contains a correct answer.</Paragraph>
    <Paragraph position="6"> We compare the performance of our probabilistic model with that of two other systems. Both comparison systems make use of a small, predefined set of manually-assigned MUC7 named-entity types (location, person, organization, cardinal, percent, date, time, duration, measure, money) augmented with thing-name (proper  names of inanimate objects) and miscellaneous (a catch-all answer type of all other candidates). Some examples of thing-name are Guinness Book of World Records, Thriller, Mars Pathfinder, and GreyCup. Examplesofmiscellaneousanswersare copper, oil, red, and iris.</Paragraph>
    <Paragraph position="7"> The differences in the comparison systems is with respectto how entitytypes are assignedto the words in the candidate documents. We make use of the ANNIE (Maynard et al., 2002) named entity recognition system, along with a manual assigned &amp;quot;oracle&amp;quot; strategy, to assign types to candidate answers. In each case, the score for a candidate is either 1 if it is tagged as the same type as the question or 0 otherwise. With this scoring scheme producingasortedlistwecancomputetheprobability null of the first correct answer appearing at rankR = k as follows:</Paragraph>
    <Paragraph position="9"> that are of the appropriate type andcis the number of unique candidate answers that are correct.</Paragraph>
    <Paragraph position="10"> Using the probabilities in equation (15), we compute the expected rank, E(R), of the first correct answer of a given question in the system as:</Paragraph>
    <Paragraph position="12"> Answer candidates are the set of ANNIEidentified tokens with stop words and punctuation removed. This yields between 900 and 8000 candidates for each question, depending on the top 10 documents returned by PRISE. The oracle system represents an upper bound on using the predefined set of answer types. The ANNIE system represents a more realistic expectation of performance. The median percentage of candidates that are accepted by a filter over the questions of our evaluation data provides one measure of performance and is preferred to the average because of the effect of large values on the average. In QA, a system accepting 60% of the candidates is not significantly better or worse than one accepting 100%,  but the effect on average is quite high. Another measure is to observe the number of questions with at least one correct answer in the top N% for various values of N. By examining the number of correctanswersfoundinthetopN%wecanbetter understand what an effective cutoff would be.</Paragraph>
    <Paragraph position="13"> The overall results of our comparison can be found in Table 2. We have added the results of a system that scores candidates based on their frequency within the document as a comparison with a simple, yet effective, strategy. The second column is the median percentage of where the highest scored correct answer appears in the sorted candidate list. Low percentage values mean the answer is usually found high in the sorted list. The remaining columns list the number of questions that have a correct answer somewhere in the top N% of their sorted lists. This is meant to show the effects of imposing a strict cutoff prior to running the answer type model.</Paragraph>
    <Paragraph position="14"> The oracle system performs best, as it benefits from both manual question classification and manual entity tagging. If entity assignment is performed by an automatic system (as it is for ANNIE), the performance drops noticeably. Our probabilistic model performs better than ANNIE and achieves approximately 2/3 of the performance of the oracle system. Table 2 also shows that the use of candidate contexts increases the performance of our answer type model.</Paragraph>
    <Paragraph position="15"> Table 3 shows the performance of the oracle system, our model, and the ANNIE system broken down by manually-assigned answer types. Due to insufficient numbers of questions, the cardinal, percent, time, duration, measure, and money types are combined into an &amp;quot;Other&amp;quot; category. When compared with the oracle system, our model performs worse overall for questions of all types except for those seeking miscellaneous answers. For miscellaneous questions, the oracle identifies all tokens that do not belong to one of the other known categories as possible answers. For all questions of non-miscellaneous type, only a small subset of the candidates are marked appropriate.</Paragraph>
    <Paragraph position="16"> In particular, our model performs worse than the oracle for questions seeking persons and thingnames. Person questions often seek rare person names, which occur in few contexts and are difficult to reliably cluster. Thing-name questions are easy for a human to identify but difficult for automatic system to identify. Thing-names are a diversecategoryandarenotstronglyassociatedwith null any identifying contexts.</Paragraph>
    <Paragraph position="17"> Our model outperforms the ANNIE system in general, and for questions seeking organizations, thing-names, and miscellaneous targets in particular. ANNIE may have low coverage on organization names, resulting in reduced performance. Like the oracle, ANNIE treats all candidates not assigned one of the categories as appropriate for miscellaneous questions. Because ANNIE cannot identify thing-names, they are treated as miscellaneous. ANNIE shows low performance on thing-names because words incorrectly assigned types are sorted to the bottom of the list for miscellaneous and thing-name questions. If a correct answer is incorrectly assigned a type it will be sorted near the bottom, resulting in a poor score.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML