File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1602_evalu.xml

Size: 7,280 bytes

Last Modified: 2025-10-06 13:59:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1602">
  <Title>Interactive Authoring of Logical Forms for Multilingual Generation/</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> The objectives of the SAUT authoring system are to provide the user with a fast, intuitive and accurate way to compose semantic structures that represent meaning s/he wants to convey, then presenting the meaning in various natural languages. Therefore, an evaluation of these aspects (speed, intuitiveness, accuracy and coverage) is required, and we have conducted an experiment with human subjects to measure them. The experiment measures a snapshot of these parameters at a given state of the implementation. In the error analysis we have isolated parameters which depend on specifics of the implementation and those which require essential revisions to the approach followed by SAUT.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 User Experiment
</SectionTitle>
      <Paragraph position="0"> We have conducted a user experiment, in which ten subjects were given three to four recipes in English (all taken from the Internet) from a total pool of ten. The subjects had to compose semantic documents for these recipes using SAUT 2. The ontology and lexicon for the specific domain of cooking recipes were prepared in advance, and we have tested the tool by composing these recipes with the system. The documents the authors prepared are later used as a 'gold standard' (we refer to them as &amp;quot;reference documents&amp;quot;). The experiment was managed as follows: first, a short presentation of the tool (20 minutes) was given. Then, each subject recieved a written interactive tutorial which took approximately half an hour to process. Finally, each subject composed a set of 3 to 4 documents. The overall time taken for each subject was</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 hours.
4.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> We have measured the following aspects of the system during the experiment.</Paragraph>
      <Paragraph position="1"> Coverage - answers the questions &amp;quot;can I say everything I mean&amp;quot; and &amp;quot;how much of the possible meanings that can be expressed in natural language can be expressed using the input language&amp;quot;. In order to check the coverage of the tool, we examined the reference documents. We compared the text generated from the reference documents with the original recipes and checked which parts of the information were 2All subjects were computer science students.</Paragraph>
      <Paragraph position="2"> included, excluded or expressed in a partial way with respect to the original. We counted each of these in number of words in the original text, and expressed these 3 counts as a percentage of the words in the original recipe. We summed up the result as a coverage index which combined the 3 counts (correct, missing, partial) with a factor of 70% for the partial count.</Paragraph>
      <Paragraph position="3"> The results were checked by two authors independently and we report here the average of these two verifications. On a total of 10 recipes, containing 1024 words overall, the coverage of the system is 91%. Coverage was uniform across recipes and judges. We performed error analysis for the remaining 9% of the un-covered material below.</Paragraph>
      <Paragraph position="4"> Intuitiveness - to assess the ease of use of the tool, we measured the &amp;quot;learning curve&amp;quot; for users first using the system, and measuring the time it takes to author a recipe for each successive document (1st, 2nd, 3rd, 4th). For 10 users first facing the tool, the time it took to author the documents is as follows:  The time distribution among 10 users was extremely uniform. We did not find variation in the quality of the authored documents across users and across number of document.</Paragraph>
      <Paragraph position="5"> The tool is mastered quickly, by users with no prior training in knowledge representation or natural language processing. Composing the reference documents (approximately 100-words recipes) by the authors took an average of 12 minutes.</Paragraph>
      <Paragraph position="6"> Speed - we measured the time required to compose a document as a semantic representation, and compare it to the time taken to translate the same document in a different language. We compare the average time for trained users to author a recipe (14 minutes) with that taken by 2 trained translators to translate 4 recipes (from English to Hebrew).</Paragraph>
      <Paragraph position="7"> Semantic Authoring Time Translation Time 14 (minutes) 6 (minutes) The comparison is encouraging - it indicates that a tool for semantic authoring could become cost-effective if it is used to generate in 2 or 3 languages.</Paragraph>
      <Paragraph position="8"> Accuracy - We analyzed the errors in the documents prepared by the 10 users according to the following breakup: + Words in the source document not present in the semantic form + Words in the source document presented inaccurately in the semantic form + Users' errors in semantic form that are not included in the former two parameters.</Paragraph>
      <Paragraph position="9"> We calculated the accuracy for each document produced by the subjects during the experiment. Then we compared each document with the corresponding reference document (used here as a gold standard). Relative accuracy of this form estimates a form of confidence - &amp;quot;how sure can the user be that s/he wrote what s/he meant&amp;quot;? This measurement depends on the preliminary assumption that for a given recipe, any two readers (in the experiment environment including the authors), will extract similar information. This assumption is warranted for cooking recipes. This measure takes into account the limitations of the tool and reflects the success of users to express all that the tool can express:</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Document # Accuracy
</SectionTitle>
      <Paragraph position="0"> Accuracy is quite consistent during the experiment sessions, i.e., it does not change as practice increases. The average 92.5% accuracy is quite high.</Paragraph>
      <Paragraph position="1"> We have categorized the errors found in subjects' documents in the following manner: + Content can be accurately expressed with SAUT (user error) + Content will be accurately expressed with changes in the SAUT's lexicon and ontology (ontology deficit) + Content cannot be expressed in the current implementation, and requires further investigation of the concept  This breakdown indicates that the tool can be improved by investing more time in the GUI and feedback quality and by extending the ontology. The difficult conceptual issues (those which will require major design modifications, or put in question our choice of formalism for knowledge encoding) represent 33% of the errors - overall accounting for 2.5% of the words in the word count of the generated text.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML