File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-2029_evalu.xml
Size: 2,838 bytes
Last Modified: 2025-10-06 13:59:33
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2029"> <Title>Adaptivity in Question Answering with User Modelling and a Dialogue Interface</Title> <Section position="8" start_page="200" end_page="201" type="evalu"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> Since YourQA does not single out one correct answerphrase,TRECevaluationmetricsarenotsuit- null able for it. A user-centred methodology to assess how individual information needs are met is more appropriate. Webaseourevaluationon(Su,2003), which proposes a comprehensive search engine evaluation model, defining the following metrics: 1. Relevance: we define strict precision (P1) as the ratio between the number of results rated as relevant and all the returned results, and loose pre- null cision (P2) as the ratio between the number of results rated as relevant or partially relevant and all the returned results.</Paragraph> <Paragraph position="1"> 2. Usersatisfaction: a7-pointLikertscale7 isused to assess the user's satisfaction with loose precision of results (S1) and query success (S2). 3. Reading level accuracy: given the set R of results returned for a reading level r, Ar is the ratio between the number of results [?] R rated by the users as suitable for r and |R|.</Paragraph> <Paragraph position="2"> 4. Overall utility (U): the search session as a whole is assessed via a 7-point Likert scale.</Paragraph> <Paragraph position="3"> We performed our evaluation by running 24 queries (some of which in Tab. 2) on Google and YourQA and submitting the results -i.e. Google result page snippets and YourQA passages- of both to 20 evaluators, along with a questionnaire. The relevance results (P1 and P2) in Tab. 1 show a strict and loose precision. The coarse semantic processing applied and context visualisation thus contribute to creating more relevant passages. Both user satisfaction results (S1 and S2) in Tab. 1 also denote a higher level of satisfaction tributed toYourQA.Tab. 2showsthatevaluatorsfoundour Query Ag Am Ap When did the Middle Ages begin? 0,91 0,82 0,68 Who painted the Sistine Chapel? 0,85 0,72 0,79 When did the Romans invade Britain? 0,87 0,74 0,82 Who was a famous cubist? 0,90 0,75 0,85 Who was the first American in space? 0,94 0,80 0,72 results appropriate for the reading levels to which they were assigned. The accuracy tended to decrease (from 94% to 72%) with the level: it is indeed more constraining to conform to a lower reading level than to a higher one. Finally, the 7This measure - ranging from 1= &quot;extremely unsatisfactory&quot; to 7=&quot;extremely satisfactory&quot; - is particularly suitable to assess how well a system meets user's search needs. general satisfaction values for U in Tab. 1 show an improved preference for YourQA.</Paragraph> </Section> class="xml-element"></Paper>