File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3007_metho.xml
Size: 12,577 bytes
Last Modified: 2025-10-06 14:11:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3007"> <Title>User-Centered Evaluation of Interactive Question Answering Systems</Title> <Section position="4" start_page="51" end_page="51" type="metho"> <SectionTitle> 3 Data Collection </SectionTitle> <Paragraph position="0"> System logs and Glass Box (Hampson & Crowley, 2005) were the core logging methods providing process data. Post-scenario, post-session, postsystem and cognitive workload questionnaires, interviews, focus groups, and other user-centered methods were applied to understand more about analysts' experiences and attitudes. Finally, crossevaluation (Sun & Kantor, 2006) was the primary method for evaluating reports produced.</Paragraph> <Paragraph position="1"> Each experimental block had two sessions, corresponding to the two unique scenarios. Methods and instruments described below were either administered throughout the experimental block (e.g., observation and logging); at the end of the session, in which case the analyst would complete two of these instruments during the block (e.g., a post-session questionnaire for each scenario); or once, at the end of the experimental block (e.g., a postsystem questionnaire). We added several data collection efforts at the end of the workshop to understand more about analysts' overall experiences and to learn more about the study method.</Paragraph> <Section position="1" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 3.1 Observation </SectionTitle> <Paragraph position="0"> Throughout the experimental sessions, trained observers monitored analysts' interactions with systems. Observers were stationed behind analysts, to be minimally intrusive and to allow for an optimal viewing position. Observers used an Observation Worksheet to record activities and behaviors that were expected to be indicative of analysts' level of comfort, and feelings of satisfaction or dissatisfaction. Observers noted analysts' apparent patterns of activities. Finally, observers used the Worksheet to note behaviors about which to follow-up during subsequent session interviews.</Paragraph> </Section> <Section position="2" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 3.2 Spontaneous Self-Reports </SectionTitle> <Paragraph position="0"> During the evaluation, we were interested in obtaining feedback from analyst in situ. Analysts were asked to report their experiences spontaneously during the experimental session in three ways: commenting into lapel microphones, using the &quot;SmiFro Console&quot; (described more fully below), and completing a three-item online Status Questionnaire at 30 minute intervals.</Paragraph> <Paragraph position="1"> The SmiFro Console provided analysts with a persistent tool for commenting on their experiences using the system. It was rendered in a small display window, and analysts were asked to leave this window open on their desktops at all times. It displayed smile and frown faces, which analysts could select using radio buttons. The Console also displayed a text box, in which analysts could write additional comments. The goal in using smiles and frowns was to create a simple, recognizable, and quick way for analysts to provide feedback.</Paragraph> <Paragraph position="2"> The SmiFro Console contained links to the Status Questionnaires which were designed to solicit analysts' opinions and feedback about the progress of their work during the session. Each questionnaire contained the same three questions, which were worded differently to reflect different moments in time. There were four Status Questionnaires, corresponding to 30-minute intervals during the session: 30, 60, 90, 120 minutes.</Paragraph> </Section> <Section position="3" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 3.3 NASA TLX Questionnaire </SectionTitle> <Paragraph position="0"> After completing each scenario, analysts completed the NASA Task Load Index (TLX)</Paragraph> </Section> </Section> <Section position="5" start_page="51" end_page="53" type="metho"> <SectionTitle> . The </SectionTitle> <Paragraph position="0"> NASA TLX is a standard instrument used in aviation research to assess pilot workload and was used in this study to assess analysts' subjective cognitive workloads while completing each scenario.</Paragraph> <Paragraph position="1"> 1. Mental demand: whether this searching task affects a user's attention, brain, and focus.</Paragraph> <Paragraph position="2"> 2. Physical demand: whether this searching task affects a user's health, makes a user tired, etc.</Paragraph> <Paragraph position="3"> 3. Temporal demand: whether this searching task takes a lot of time that can't be afforded. null 4. Performance: whether this searching task is heavy or light in terms of workload.</Paragraph> <Paragraph position="4"> 5. Frustration: whether this searching task makes a user unhappy or frustrated.</Paragraph> <Paragraph position="5"> 6. Effort: whether a user has spent a lot of effort on this searching task.</Paragraph> <Section position="1" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 3.4 Post-Scenario Questionnaire </SectionTitle> <Paragraph position="0"> Following the NASA TLX, analysts completed the six-item Scenario Questionnaire. This Questionnaire was used to assess dimensions of scenarios, such as their realism and difficulty.</Paragraph> </Section> <Section position="2" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 3.5 Post-Session Questionnaire </SectionTitle> <Paragraph position="0"> After completing the Post-Scenario Questionnaire, analysts completed the fifteen-item Post-Session Questionnaire. This Questionnaire was used to assess analysts' experiences using this particular system to prepare a pseudo-report. Each question was mapped to one or more of our research hypotheses. Observers examined these responses and used them to construct follow-up questions for subsequent Post-Session Interviews.</Paragraph> </Section> <Section position="3" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 3.6 Post-Session Interview </SectionTitle> <Paragraph position="0"> Observers used a Post-Session Interview Schedule to privately interview each analyst. The Interview Schedule contained instructions to the observer for conducting the interview, and also provided a list of seven open-ended questions. One of these questions required the observer to use notes from the Observation Worksheet, while two called for the observer to use analysts' responses to Post-Session Questionnaire items.</Paragraph> </Section> <Section position="4" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 3.7 NASA TLX Weighting Instrument </SectionTitle> <Paragraph position="0"> After using the system to complete two scenarios, analysts completed the NASA-TLX Weighting instrument. The NASA-TLX Weighting instrument was used to elicit a ranking from analysts about the factors that were probed with the NASA-TLX instrument. There are 15 pair-wise comparisons of 6 factors and analysts were forced to choose one in each pair as more important. A simple sum of &quot;wins&quot; is used to assign a weight to each dimension, for the specific analyst.</Paragraph> </Section> <Section position="5" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 3.8 Post-System Questionnaire </SectionTitle> <Paragraph position="0"> After the NASA-TLX Weighting instrument, analysts completed a thirty-three item Post-System Questionnaire, to assess their experiences using the specific system used during the block. As with the Post-Session Questionnaire, each question from this questionnaire was mapped to one or more of our research hypotheses and observers asked follow-up questions about analysts' responses to select questions during the Post-System Interview.</Paragraph> </Section> <Section position="6" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 3.9 Post-System Interview </SectionTitle> <Paragraph position="0"> Observers used a Post-System Interview Schedule to privately interview each analyst at the end of a block. The Interview Schedule contained instructions to the observer for conducting the interview, as well as six open-ended questions. As in the Post-Session Interview, observers were instructed to construct content for two of these questions from analysts' responses to the Post-System Questionnaire.</Paragraph> </Section> <Section position="7" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 3.10 Cross-Evaluation </SectionTitle> <Paragraph position="0"> The last component of each block was Cross Evaluation (Ying & Kantor, 2006). Each analyst reviewed (using a paper copy) all seven reports prepared for each scenario in the block (14 total reports). Analysts used an online tool to rate each report according to 7 criteria using 5-point scales.</Paragraph> <Paragraph position="1"> After analysts completed independent ratings of each report according to the 7 criteria, they were asked to sort the stack of reports into rank order, placing the best report at the top of the pile. Analysts were then asked to use a pen to write the appropriate rank number at the top of each report, and to use an online tool to enter their report rankings. The criteria that the analysts used for evaluating reports were: (1) covers the important ground; (2) avoids the irrelevant materials; (3) avoids redundant information; (4) includes selective information; (5) is well organized; (6) reads clearly and easily; and (7) overall rating.</Paragraph> <Paragraph position="2"> analysts were formed to discuss the results of the Cross Evaluation. These focus groups had two purposes: to develop a consensus ranking of the seven reports for each scenario, and to elicit the aspects, or dimensions, which led each analyst to rank a report high or low in overall quality. These discussions were taped and an observer took notes during the discussion.</Paragraph> <Paragraph position="3"> 3.12 System Logs and Glass Box Throughout much of the evaluation, logging and Glass Box software captured analysts' interactions with systems. The Glass Box software supports capture of analyst workstation activities including keyboard/mouse data, window events, file open and save events, copy/paste events, and web browser activity. The Glass Box uses a relational database to store time-stamped events and a hierarchical file store where files and the content of web pages are stored. The Glass Box copies every file the analyst opens so that there is a complete record of the evolution of documents. Material on every web page analysts visit is explicitly stored so that each web page can be later recreated by researchers as it existed at the time it was accessed by analysts; screen and audio capture are also available. The data captured by the Glass Box provides details about analysts' interaction with Microsoft desktop components, such as MS Office and Internet Explorer. User interaction with applications that do not run in a browser and Java applications that may run in a browser are opaque to Glass Box.</Paragraph> <Paragraph position="4"> Although limited information, e.g. Window Title, application name, information copied to the system Clipboard, is captured, the quantity and quality of the data is not sufficient to serve as a complete log of user-system interaction. Thus, a set of logging requirements was developed and implement by each system. These included: time stamp; set of documents the user copied text from; number of documents viewed; number of documents that the system said contained the answer; and analyst's query/question.</Paragraph> <Paragraph position="5"> On the final day of the workshop, analysts completed a Scenario Difficulty Assessment task, provided feedback to system developers and participated in two focus group interviews. As part of the Scenario Difficulty Assessment, analysts rated each scenario on 12 dimensions, and also rank-ordered the scenarios according to level of difficulty. After the Scenario Difficulty Assessment, analysts visited each of the three experimental system developers in turn, for a 40-minute free form discussion to provide feedback about systems. As the last event in the workshop, analysts participated in two focus groups. The first was to obtain additional feedback about analysts' overall experiences and the second was to obtain feedback from analysts about the evaluation process.</Paragraph> </Section> </Section> class="xml-element"></Paper>