File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0406_metho.xml
Size: 10,101 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0406"> <Title>Text Summarizer in Use: Lessons Learned from Real World Deployment and Evaluation</Title> <Section position="4" start_page="62" end_page="62" type="metho"> <SectionTitle> 3.5 Technology-related Modifications </SectionTitle> <Paragraph position="0"> customers for a summary service * gains in efficiency are hard to measure for a task already efficiently performed in a real-world situations.</Paragraph> <Paragraph position="1"> In response, we have established a summary service in which retrieval results are directly routed to our summary server and await the user. We plan to integrate the summarization tool into the IR system. (Uploading batches and then submission to the server is still an option.) We also abandoned the naive idea that data overload equates to summarization requirements and realized that the technology does not apply to all users. We have more effectively selected users by profiling characteristics of active ,users (e.g. daily document viewing work practice, document volume, static query use, etc.) and have prioritized deployment to that population which could most benefit from it.</Paragraph> <Paragraph position="2"> In order to demonstrate tool ~summarization efficiency, we needed to :baseline full-text review. We considered, but . rejected a number of options--user self-report and timing, observations, and even the * creation of a viewing tool to monitor and document full text review. Instead, we ba, selined full text scanning through information retrieval logs for a subgroup of users by tracking per document viewing time for a month period. These users submit the same queries daily and view their documents through the IR system browser. For the heaviest system users, 75% of the documents were viewed in under 20 seconds per document, but note that users vary widely with a tendency to spend a much longer browse time on a relatively small number of documents. We then identified a subgroup of these users and attempted to deploy the summarizer to this baseline group to compare scanning time required over a similar time frame. We are currently analyzing these data. System in a work environment is considered a good indicator of tool utility, but we wanted some gauge of summary quality and also anticipated user concerns about an emerging technology like automatic text summarization. We compromised and selected a method to measure the effectiveness of our summaries that serves a dual purpose--our users gain confidence in the utility of the summaries and we can collect and measure the effectiveness of the generic summaries for some of our users on their data. We initially piloted and now have incorporated a data collection procedure into our software. In our on-line training, we guide users to explore tool capabilities through a series of experiments or tasks. In the first of these tasks, a user is asked to submit a batch for summarization, then for each of five to seven user-selected summaries to record answers to the question: &quot;Is this document likely to be relevant to me?&quot;(based on the summary) ~.yes no Then, the user was directed to open the original documents for each of the summaries and record answers to the question: &quot;Is the document relevant to me?&quot; (after reading the original text) yes no In a prototype collection effort, we asked users to review the first ten documents, but in follow-on interviews the users recommended review of fewer documents. We understand the limits this places on interpreting our data. Also, the on-line training is optional so we are not able to collect these data for all our users uniformly.</Paragraph> <Paragraph position="3"> Most of the users tested exhibited both high recall and precision, with six users judging relevance correctly for all documents (in Table 2 below). The False Negative error was high for only one user, while the majority of-the users exhibited no False Negative</Paragraph> <Paragraph position="5"> errors, a worse error to commit than wasting time viewing irrelevant data, False Positive.</Paragraph> <Paragraph position="6"> Across all the users, 79% of all relevant documents and 81% of the irrelevant documents were accurately categorized by examination of the summary.</Paragraph> <Section position="1" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 3.5.2 User-centered Changes in Support </SectionTitle> <Paragraph position="0"> On user support, we learned that</Paragraph> </Section> <Section position="2" start_page="62" end_page="62" type="sub_section"> <SectionTitle> User </SectionTitle> <Paragraph position="0"> our system did not effectively support user tool navigation * our users did not fully exploit system tailorable features In response, we addressed user support needs from three different angles, each of which we discuss below: incorporation of Electronic Performance Support Systems, design and implementation of procedural on-line training and guided discovery training, and user analysis of summary quality.</Paragraph> <Paragraph position="1"> Systems (EPSS) is a widely acknowledged strategy for on the job performance support. Defined as &quot;an optimized body of co-ordinated on-line methods and resources that enable and maintain a person's or an organization's performance,&quot; EPSS interventions range from simple help systems to intelligent wizard-types of support.</Paragraph> <Paragraph position="2"> (Villachica and Stone, 1999; Gery 1991). We elected to incorporate EPSS rather than cl~issroom instruction. Based on an analysis of tool usage data, user requirements, and user observations, experts in interface design and technology performance support prototyped an EPSS enhanced interface. Active system users reviewed these changes before implementation. The on-line perfomance support available at all times includes system feature procedures, a term glossary, FAQ, and a new interface design.</Paragraph> <Paragraph position="3"> With incorporation of the EPSS, we also addressed the under-utilization of the configurable features. Although simple technologies with few options such as traditional telephones do not require conceptual system understanding for effective use, more complex systems with multiple options are often underutilized when supported with procedural training alone. We decided to incorporate both procedural training in a &quot;Getting Started&quot; tutorial and conceptual training in &quot;The Lab.&quot; In &quot;Getting Started&quot;, users learn basic system actions (e.g., creating set-ups, submitting batches for summarization, viewing summaries). &quot;The Labi', on the other hand, supports guided discovery training in which users explore the system through a series of experiments in which they use their own data against various tool options and record their observations.</Paragraph> <Paragraph position="4"> Given our own experience with under-utilization and research reporting difficulties with unguided exploratory learning (Hsu et al., 1993; Tuovinen and Sweller, 1999), we built on the work of de Mul and Van Oostendorf (1996) and Van Oostendorf and de Mul (1999) and their finding that task-oriented exploratory support leads to more effective learning of computer systems. We created a series of experiments that the user conducts to discover how the summarization technology can best meet their needs. For example, users are directed to change summary length and to determine for themselves how the variation.</Paragraph> <Paragraph position="5"> affects their ability to judge relevance using their data.</Paragraph> <Paragraph position="6"> In February, we conducted a study of.</Paragraph> <Paragraph position="7"> two groups, one with the EPSS and &quot;Getting Starting&quot; Tutorial and a second with the same level of support and additionally &quot;The Lab&quot;. Earlier work by Kieras and Bovair (1984) compared straight procedural training with conceptual training and showed that the conceptually trained users made more efficient use of system features. The goal of our study was to determine just what level of training support the summarization technology requires for effective use. Through surveys, we planned to collect attitudes toward the tool and training and through web logs, tool usage data and option trials. We also planned to assess the users' understanding of the features and benefits of the tool. We are currently analyzing these data.</Paragraph> <Paragraph position="8"> In addition to the EPSS and the on-line training, we developed a method for taking into account user assessment of our summary quality in a systematic way. User feedback on summarization quality during the beta test was far too general and uneven. We recruited two users to join our technology team and become informed rather than the typical naive users. They designed an analysis tool through which they database problematic machine generated summaries and assign them to errortype categories. Though we expected users to address issues like summary coherence, they have identified categories like the following: * sentence identification errors * formatting errors * sentence extraction due to the &quot;rare&quot; word phenomena * sentence extraction in &quot;long&quot; documents * failure to identify abstracts when available We expect that this approach can complement a technology-driven one by helping us prioritize changes we need based on methodical data collection and analysis.</Paragraph> </Section> </Section> <Section position="5" start_page="62" end_page="62" type="metho"> <SectionTitle> 4.0 Summary </SectionTitle> <Paragraph position="0"> Our experience with text summarization technology in-use has been quite sobering. In this paper, we have shown how beta testing an emerging technology has helped us to understand that for technology to enhance job performance many factors besides the algorithm need to be addressed.</Paragraph> </Section> class="xml-element"></Paper>