File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1402_metho.xml

Size: 16,639 bytes

Last Modified: 2025-10-06 14:07:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1402">
  <Title>A Task-based Framework to Evaluate Evaluative Arguments</Title>
  <Section position="4" start_page="10" end_page="10" type="metho">
    <SectionTitle>
3 A threshold controlling verbosity was set to its
</SectionTitle>
    <Paragraph position="0"> maximum value.</Paragraph>
    <Paragraph position="1"> Several empirical methods have been proposed and applied in the literature for evaluating NLG models. We discuss now why, among the three main evaluation methods (i.e., human judges, corpus-based and task efficacy), task efficacy appears to be the most appropriate for testing the effectiveness of evaluative arguments that are tailored to a complex model of the user's preferences.</Paragraph>
    <Paragraph position="2"> The human judges evaluation method requires a panel of judges to score outputs of generation models (Chu-Carroll and Carberry 1998; Lester and Porter March 1997). The main limitation of this approach is that the input of the generation process needs to be simple enough to be easily understood by judges 4. Unfortunately, this is not the case for our argument generator, where the input consists of a possibly complex and novel argument subject (e.g., a new house with a large number of features), and a complex model of the user's preferences.</Paragraph>
    <Paragraph position="3"> The corpus-based evaluation method (Robin and McKeown 1996) can be applied only when a corpus of input/output pairs is available. A portion of the corpus (the training set) is used to develop a computational model of how the output can be generated from the input. The rest of the corpus (the testing set) is used to evaluate the model. Unfortunately, a corpus for our generator does. not exist. Furthermore, it would be difficult and extremely time-consuming to obtain and analyze such a corpus given the complexity of our. generator..input/output pairs.</Paragraph>
  </Section>
  <Section position="5" start_page="10" end_page="11" type="metho">
    <SectionTitle>
4 See (Chu-Carroll and Carberry 1998) tbr an
</SectionTitle>
    <Paragraph position="0"> illustration of how the specification of the context can become extremely complex when human judges are used to evaluate content selection strategies for a dialog system.</Paragraph>
    <Paragraph position="1">  * When a generator-is designed to generate output for users engaged in certain tasks, a natural way to evaluate its effectiveness is by experimenting with users performing those tasks. For instance, in (Young, to appear) different models for generating natural language descriptions of plans are evaluated by measuring how effectively users execute those plans given the descriptions. main sub-systems: the'IDEA system,- a Usei- ...... Model Refiner and the Argument Generator. The framework assumes that a model of the user's preferences based on MAUT has been previously acquired using traditional methods from decision theory (Edwards and Barron 1994), to assure a reliable initial model.</Paragraph>
    <Paragraph position="2"> At the onset, the user is assigned the task to .This evaluatiort~ ,method,.,,~cal!ed~.:task.~.~effiea~y,,~ ~.~, z,select~.PSrara~,~\]~,:dataset..:.the:zfaur.most p~ePSeazed ..... allows one to evaluate a generation model without explicitly evaluating its output but by measuring the output's effects on user's behaviors, beliefs and attitudes in the context of the task. The only requirement for this method is the specification of a sensible task.</Paragraph>
    <Paragraph position="3"> Task efficacy is the method we have adopted in our evaluation framework.</Paragraph>
  </Section>
  <Section position="6" start_page="11" end_page="141" type="metho">
    <SectionTitle>
4 The Evaluation Framework
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.1 The task
</SectionTitle>
      <Paragraph position="0"> Aiming at general results, we chose a rather basic and frequent task that has been extensively studied in decision analysis: the selection of a subset of preferred objects (e.g., houses) out of a set of possible alternatives by considering trade-offs among multiple objectives (e.g., house location, house quality). The selection is performed by evaluating objects with respect to their values for a set of primitive attributes (e.g., house distance form the park, size of the garden). In the evaluation framework we have developed, the user performs this task by using a computer environment (shown in Figure 3) that supports interactive data exploration and analysis (IDEA) (Roth, Chuah et al. 1997). The IDEA environment provides the user with a set of powerful visualization and direct manipulation techniques that facilitate user's autonomous exploration of the set of alternatives and the selection of the preferred alternatives.</Paragraph>
      <Paragraph position="1"> Let's examine now how the argument generator, that we described in Section 1, can be evaluated in the ,context of the ~selection =ta.sk, by gging.</Paragraph>
      <Paragraph position="2"> through the architecture of the evaluation framework.</Paragraph>
    </Section>
    <Section position="2" start_page="11" end_page="141" type="sub_section">
      <SectionTitle>
4.2 The framework architecture
</SectionTitle>
      <Paragraph position="0"> alternatives and to place them in the Hot List (see Figure 3 upper right comer) ordered by preference. The IDEA system supports the user in this task (Figure 2 (I)). As the interaction unfolds, all user actions are monitored and collected in the User's Action History (Figure 2 (2a)). Whenever the user feels that she has accomplished the task, the ordered list of preferred alternatives is saved as her Preliminary  At this point, the stage is set for argument generation. Given the Refined Model of the User's Preferences for the target selection task, the Argument Generator produces an evaluative argument tailored to the user's model (Figure 2 (5-6)). Finally, the argument is presented to the user by the IDEA system (Figure 2 (7)).</Paragraph>
      <Paragraph position="1"> The argument goal is to introduce a new alternative (not included in the dataset initially presented to the user) and to persuade the user that the alternative is worth being considered.</Paragraph>
      <Paragraph position="2"> The new alternative is designed on the fly to be preferable for the user given her preference model. Once the argument is presented, the user may (a) decide to introduce the new alternative in her Hot List, or (b) decide to further explore the dataset, possibly making changes to the Hot List and introducing the new instance in the Hot List, or (c) do nothing. Figure 3 shows the display at the end of the interaction, when the user; after reading the argument, has decided to introduce the new alternative in the first position.</Paragraph>
      <Paragraph position="3"> Figure 2 shows the architecture of the evaluation framework. The framework consists of three  Whenever the user decides to stop exploring and is satisfied and confident with her final selections, measures related to argument's effectiveness can be assessed (Figure 2 (8)).</Paragraph>
      <Paragraph position="4"> These measures are obtained either from the record of the user interaction with the system or from user self-reports (see Section 2).</Paragraph>
      <Paragraph position="5"> First, and most important, are measures of behavioral intentions and attitude change: (a) whether or not the user adopts the new proposed alternative, (b) in which position in the Hot List she places it, (c) how much she likes it, (d) whether or not the user revises the Hot List and (e) how much the user likes the objects in the Hot List. Second, a measure can be obtained of the user's confidence that she has selected the best for her in the set of alternatives. Third, a measure of argument effectiveness can also be derived by explicitly questioning the user at the end of the interaction, about the rationale for her decision. This can be done either by asking the user to justify her decision in a written .paragraph, or by asking,the user to :self-report for each attribute of the new house how important the attribute was in her decision (Olso and Zanna 1991). Both methods can provide valuable information on what aspects of the argument were more influential (i.e., better understood and accepted by the user).</Paragraph>
      <Paragraph position="6"> A fourth measure of argument effectiveness is to explicitly ask the user at the end of the interaction to judge the argument with respect to several dimensions of quality, such as content, organization, writing style and convincigness.</Paragraph>
      <Paragraph position="7"> Evaluations based on judgments along these dimensions are clearly weaker than evaluations measuring actual behavioural and attitudinal changes (Olso and Zanna 1991). However, these judgments may provide more information than judgments from independent judges (as in the &amp;quot;human judges&amp;quot; method discussed in Section 3), because they are performed by the addressee of the argument, when the experience of the task is still vivid in her memory.</Paragraph>
      <Paragraph position="8"> To summarize, the evaluation framework just described supports users in performing a realistic task at their own pace by-interacting with an IDEA system. In the context of this task, an evaluative argument is generated and tneasurements ,related to its effectiveness can be.' performed.</Paragraph>
      <Paragraph position="9"> In the next section, we discuss an experiment that we are currently running by using the evaluation framework.</Paragraph>
      <Paragraph position="10">  ~nl I I Glwden_glze P ~C/=h_Sl:r~ Deck * m ~. * I m lie ,/ - . . &amp;quot; &amp;quot;', Norghslde -4 ~f.%'~&amp;quot; &amp;quot; \ ~o ......</Paragraph>
      <Paragraph position="11"> 2~d ,. - . &gt;% 2rld ............... ~ ~.. T~ . ~ A University \[8 &amp;quot;\:~ ~ I &amp;quot; i 8h,~pplng \] 3rd Wes~end i: :~ ~Eas tend ,C/PIO PS.rcf~ r,~. : ;,, C ~'Io ~nrnR rtt~,  z, i he safe Eastend neigflbOdl0od. Even |hough house 3-,26 is somewhat far am a rapid iranspod~idn stop ! ! .6 miles). It is C/idso towork {1.8 miles ). And Ihe traffic Is moderale on ~d streel. Furlherrnore. the quality of house is good. Houso ~ offecz * be~lh~ ~ o( Iho rlrver Anti aItso ooks beautiful, Figure 3 The IDEA environment display at the end of the interaction</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="141" end_page="141" type="metho">
    <SectionTitle>
5 The Experiment
</SectionTitle>
    <Paragraph position="0"> As explained in Section 1, the argument generator has been designed to facilitate testing of the effectiveness of different aspects of the generation process. The experimenter can easily control whether the generator tailors the argument to the current user, the degree of conciseness of the argument, and what microplanning tasks are performed. In our initial experiment, because of limited financial and human resources, we focus on the first two aspects for arguments about a single entity. Not because we =are not int~ested in effecti.veness of performing microplanning tasks, but because we consider effectiveness of tailoring and conciseness somewhat more difficult, and therefore more interesting to prove.</Paragraph>
    <Paragraph position="1"> Thus, we designed a between-subjects experiment with four experimental conditions: No-Argument - subjects are simply informed that a new house came on the market.</Paragraph>
    <Paragraph position="2"> Tailored-Concise - subjects are presented with an evaluation of the new house tailored to their preferences and at a level of conciseness that we hypothesize to be optimal.</Paragraph>
    <Paragraph position="3"> Non-Tailored-Concise - subjects are presented with an evaluation of the new house which is not tailored to their preferences 5, but is at a level of conciseness that we hypothesize to be optimal.</Paragraph>
    <Paragraph position="4"> Tailored-Verbose - subjects are presented with an evaluation,of the .new- house..tailored to their preferences, but at a level of conciseness that we hypothesize to be too low.</Paragraph>
    <Paragraph position="5"> s The evaluative argument is tailored to a default average user, for whom all aspects of a house are equally important.</Paragraph>
    <Paragraph position="6">  a) How would you judge the houses in your Hot List? The more you like the house the closer you should put a cross to &amp;quot;good choice&amp;quot;  In the four conditions, all the information about the new house is also presented graphically. Our hypotheses on the outcomes of the experiment can be summarized as follows. We expect arguments generated for the Tailored-Concise condition to be more effective than arguments generated for both the Non-Tailored-Concise and Tailored-Verbose conditions. We also expect the Tailored-Concise condition to be somewhat better than the No-Argument condititon, but to a lesser extent, because subjects, in the absence of any argument, may spend more time further exploring the dataset, therefore reaching a more informed and balanced decision. Finally, we do not have strong hypotheses on comparisons of argument effectiveness among the No-Argument, Non-Tailored-Concise and Tailored-Verbose conditions.</Paragraph>
    <Paragraph position="7"> The design of our evaluation framework and consequently the design of this experiment take into account that the effectiveness of arguments is determined not only by the argument itselK but also by user's traits such as argumentativeness, need for cognition, self-esteem and intelligence (as described in Section 2). Furthermore, we assume that argument effectiveness can be measured by means of the behavioral intentions and self-reports described in Section 4.2.</Paragraph>
    <Paragraph position="8"> The experiment is organized in two phases.-In the first phase, the subject fills out three questionnaires on the Web. One questionnaire implements a method from decision theory to acquire a model of the subject's preferences (Edwards and Barton 1994). The second questionnaire assesses the subject's argumentativeness (lnfante and Rancer 1982).</Paragraph>
    <Paragraph position="9"> The last one assesses the subject's need for cognition (Cacioppo, PeW et al. 1984). In the second phase of the experiment, to control for other possible confounding variables (including intelligence and self-esteem), the subject is randomly assigned to one of the four conditions. Then, the subject interacts with the evaluation framework and at the end of the interaction measures of the argument effectiveness are collected. Some details on measures based on subjects' self-reports can be examined in Figure 4, which shows an excerpt from the final questionnaire that subjects are asked to fill out at the end of the interaction.</Paragraph>
    <Paragraph position="10"> After running the experiment with 8 pilot subjects to refine and improve the experimental procedure, we are currently running a formal experiment involving 40 subjects, I0 in each experimental conditions.</Paragraph>
    <Section position="1" start_page="141" end_page="141" type="sub_section">
      <SectionTitle>
Future Work
</SectionTitle>
      <Paragraph position="0"> In this paper, we propose a task-based framework to evaluate evaluative arguments.</Paragraph>
      <Paragraph position="1"> We are currently using this framework to run a formal experiment to evaluate arguments about a single entity. However, this is only a first step. The power of the framework is that it enables the design and execution of many different :experiments . about- evaluative .arguments. The goal of our current experiment is to verify whether tailoring an evaluative argument to the user and varying the degree of argument conciseness influence argument effectiveness.</Paragraph>
      <Paragraph position="2"> We envision further experiments along the following lines.</Paragraph>
      <Paragraph position="3">  In the short &amp;quot;term, we plan to stcidy more complex arguments, including comparisons between two entities, as well as comparisons between mixtures of entities and set of entities. One experiment could assess the influence of tailoring and conciseness on the effectiveness of these more complex arguments. Another possible experiment could compare different . argumentative, strategies for:~ selecting ~and organizing the content of these arguments. In the long term, we intend to evaluate techniques to generate evaluative arguments that combine natural language and information graphics (e.g., maps, tables, charts).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML