File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0406_intro.xml
Size: 14,313 bytes
Last Modified: 2025-10-06 14:00:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0406"> <Title>Text Summarizer in Use: Lessons Learned from Real World Deployment and Evaluation</Title> <Section position="3" start_page="49" end_page="62" type="intro"> <SectionTitle> 3.0 Overview </SectionTitle> <Paragraph position="0"> This paper presents a user study of a summarization system and provides insights on a number of technical issues relevant to the summarization R&D community that arise in, the context of use, concerning technology performance and user support. We describe initial stages in the insertion of the SRA summarizer in which (1) a large scale beta test was conducted, and (2) analysis of tool usage data, user surveys and observations, and user requirements is leading to system enhancements and more effective summarization technology insertion. In our user study, we begin with a brief description of the task and technology (3.1). We then describe the beta test methodology (3.2) and :analysis of tool usage data (3.3). We focus on what we learned in our user-centered . approach about how technology performance in a task and user support affect user * acceptance (3.4) and what significant technology-related modifications resulted and what studies are in progress to measure tool efficacy, summarization effectiveness, and the impact of training on tool use (3.5).</Paragraph> <Paragraph position="1"> Though work to enhance the text summarization system is underway, we focus in this paper on user-centered issues. Our work is predicated on the belief that there is no substitute for user generated data to guide tool enhancement.</Paragraph> <Section position="1" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 3.1 Task and Technology </SectionTitle> <Paragraph position="0"> The task is indicative. Our users rely on machine generated summaries (single document, either generic or query-based, with user adjustment of compression rates) to judge relevance of full documents to their information need. As an information analyst,&quot; our typical user routinely scans summaries to stay current with fields of interest and enhance domain knowledge. This scanning task is one of many jobs an analyst performs to support report writing for customers in other Government agencies. Our goal is to generate summaries that accelerate eliminating or selecting documents without misleading or causing a user to access the original text unnecessarily.</Paragraph> <Paragraph position="1"> The system in this user study is a version of the SRA sentence extraction system described in Aone et al. (1997, 1998, 1999).</Paragraph> <Paragraph position="2"> Users retrieve documents from a database of multiple text collections of reports and press. Documents are generally written in a journalistic style and average 2,000 characters in length. The number of documents in a batch may vary from a few to hundreds. Batches of retrieved texts may be routinely routed to our summary server or uploaded by the user. The system is web-.based and provides the capability to tailor summary output by creating multiple summary set-ups. User options include: number of sentences viewed, summary type applied and sorting, other information viewed (e.g. title, date), and high frequency document terms and named entities viewed. Users can save, print or view full text originals with summaries appended. Viewed originals highlight extracted sentences.</Paragraph> <Paragraph position="3"> All system use is voluntary. Our users are customers and, if dissatisfied, may elect to scan data without our technology.</Paragraph> </Section> <Section position="2" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 3.2 Beta Test Methodology </SectionTitle> <Paragraph position="0"> In the fall of 1998, 90+ users were recruited primarily through an IR system news group and provided access to the SRA system summarizer to replace their full text review process of scanning concatenated files.</Paragraph> <Paragraph position="1"> Procedural (how-to) training was optional, but</Paragraph> <Paragraph position="3"> approximately 70 users opted to receive a one-on-one hands-on demonstration (about fortyfive minutes in length) on texts that the new user had retrieved. The beta testing took place over a six month period. With no stipulation on the length of participation, many users simply tried out the system a limited number of times. Initial feedback gave us a clear picture of the likelihood of continued use. Our relatively low retention rate highlighted the fact that the experimental conditions in previous summary experiments may be misleading and masked factors that do not surface until users use a system in a daily work in a real-world setting.</Paragraph> </Section> <Section position="3" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 3.3 Analysis of Tool Usage Data </SectionTitle> <Paragraph position="0"> Usage data were collected for all system users and analyzed through web logs.&quot; These logs were a record of what users did on their actual work data. For each user, our logs provided a rich source of information: number of summary batches, number of documents in each, whether documents were viewed, and set up features--summary type, summary lines viewed, number of indicator (high frequency signature terms) lines viewed, number of entity (persons, places, organizations-) lines viewed, query terms). Table 1 below illustrates the type of representative data collected, questions of interest, and findings.</Paragraph> </Section> <Section position="4" start_page="50" end_page="50" type="sub_section"> <SectionTitle> Questions Data Finding </SectionTitle> <Paragraph position="0"> Were documents number of sum- Users routinely accessed our system to read summarized? mary events machine generated summaries.</Paragraph> <Paragraph position="1"> Did users actually number of current Most users did not appear to fully exploit the flextailor the system? set-ups ibility of the system. The beta test population had a median of only two set-up types active.</Paragraph> <Paragraph position="2"> type of summary Did the users select generic or query-based summaries? Is there a difference among summary t3~pes for the number of sentences viewed? Do users choose to use indicators and entities when tailoring browsing capability? null number of sentences viewed by summary types (generic, querybased, lead) indicator/entity preferences for non-default set-ups (on or off) Usage data indicated that about half the population selected generic and the other half query-based summaries. (Note: The default set-up was the generic summarization.) The hypothesis of equal median number of sentences available for viewing sentences was tested. The number of sentences viewed with generic summary type (3) is significantly different from either query-based (5) or lead (6). Users tended to retain indicator and entity preferences when tailoring capabilities. (But users generallymodified a default set-up in which both preferences have a line viewed.)</Paragraph> <Paragraph position="4"> Does training make a difference on system use or user profile type? Users were categorized (advanced, intermediate, novice) on the basis of usage features with Hartigan's K-Means clustering algorithm. null training and use data A chi-squared test for independence between training and use reflected a significant relationship (p value close to 0) i.e., training did impact the user's decision to use the system. However, training did not make a difference across the three user profile types. A Fisher Exact test on a 3x2 contingency table revealed that the relative numbers of trained and untrained users at the three user profile types were the same (p-value= 0.1916) i.e., training and type are independent.</Paragraph> <Paragraph position="6"> As we began to analyze the data, we realized that we had only a record of use, but were not sure of what motivated the use patterns. Therefore, the team supplemented tool usage data with an or/-line survey and one-on-one observations to help us understand and analyze the user behavior. These additional data points motivated much of our work described in 3.5. Throughout the six month cycle we also collected and categorized user requirements.</Paragraph> </Section> <Section position="5" start_page="50" end_page="62" type="sub_section"> <SectionTitle> 3.4 Insights on Text Summarization </SectionTitle> <Paragraph position="0"> * 3.4.1 Technology Performance Insight 1: For user acceptance, technology * performance must go beyond a good suthmary. It requires an understanding of the users&quot; work practices.</Paragraph> <Paragraph position="1"> We learned that many factors in the task environment affect technology performance and user acceptance.</Paragraph> <Paragraph position="2"> Underpinning much work in summarization is the view that summaries are time savers. Mani et al. (1999) report that summaries at a low compression rate reduced decision making time by 40% (categorization) and 50% (adhoc) with relevance asessments almost as accurate as the full text. Although evaluators acknowledge the role of data presentation ( e.g., Firmin and Chrzanowski, 1999; Merlino and Maybury, 1999), most studies use summary system output as the metric for evaluation. The question routinely posed seems to be &quot;Do summaries save the user time without loss in accuracy?&quot; However, we confirmed observations on the integration of summarization and retrieval technologies of McKeown et al. (1998) and learned that users are not likely to consider using summaries as a time saver unless the summaries are efficiently accessed. For our users a tight coupling of retrieval and summarization is pre-requisite. Batches automatically routed to the summary server available for user review were preferred over those requiring the user to upload files for summarization. Users pointed out that the uploading took more time then they were willing to spend.</Paragraph> <Paragraph position="3"> User needs and their work practices often constrain how technology is applied.</Paragraph> <Paragraph position="4"> For example, McKeown et al. (1998) focused on the needs of physicians who want to examine only data for patients with similar characteristics to their own patients, and Wasson (1998) focused on the needs of news information customers who want to retrieve documents likely to be on-topic. We too</Paragraph> <Paragraph position="6"> discovered that the user needs affect their interest in summarization technology, but from a more general perspective. Text REtrieval Conferences (e.g., Harman, 1996) have baselined system performance in terms of two types of tasks--routing or ad-h.oc. In our environment the ad-hoc users were less likely .to want a summary. They simply wanted an answer to a question and did not want to review summaries. If too many documents were retrieved, they would simply craft a more effective query.</Paragraph> <Paragraph position="7"> Measuring the efficiency gains with a real population was quite problematic for technology in-use. We faced a number of challenges. Note that in experimental conditions, subjects perform, on full and reduced versions. One challenge was to baseline non-intrusively the current (nonsummary) full text review process. A second was to measure both accuracy and efficiency gains for users performing on the job. These challenges were further exacerbated by the fact that users in an indicative task primarily use a summary to eliminate most documents.</Paragraph> <Paragraph position="8"> They have developed effective skimming and scanning techniques and are already quite efficient at this task.</Paragraph> <Paragraph position="9"> . In short, our experience showed that technologists deploying single document summarization capability are likely be constrained by the following factors: * * the ease of technology use deg the type of user information need * how effective the user performs the task without the technology.</Paragraph> <Paragraph position="10"> Insight 2: Users require more than just a good summary. They require the right level of technology support, Although the bulk of the research work still continues to focus on summarization algorithms, we now appreciate the importance of user support to text summarization use.</Paragraph> <Paragraph position="11"> The SRA software was quite robust and fast. The task of judging relevance with a summary (even a machine generated one) instead of the full text version does not require a user to acquire a fundamentally different work practice. Yet our system was not apparently sufficiently supporting tool navigation. One of the reasons was that our on-line help was not developed from a user perspective and was rarely accessed. Another was that browse and view features did not maximize performance. For example, the interface employed a scroll bar for viewing summaries rather than more effective Next Or Previous buttons. Users frequently asked the same questions, but we were answering them individually.</Paragraph> <Paragraph position="12"> Terminology clear to the technologists was not understood by users. We also noticed that though there were requirements for improvement of summarization quality, many requirements were associated with these user support issues.</Paragraph> <Paragraph position="13"> One of the more unexpected findings was the under-utilization of tailoring features. The system offered the user many ways to tailor summaries to their individual needs, yet most users simply relied on default set-ups. Observations revealed little understanding of the configurable features and how these features corresponded to user needs to say nothing of how the algorithm worked. Some users did not understand the difference between the two summary types or sorting effects with query-based summary selection.</Paragraph> <Paragraph position="14"> Non-traditional summary types--indicators and named entities--did not appear to help render a relevance judgment. We came to understand that just because technologists sees the value to these features does not mean that a user will or that the features, in fact, have utility.</Paragraph> </Section> </Section> class="xml-element"></Paper>