File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-3001_intro.xml
Size: 3,999 bytes
Last Modified: 2025-10-06 14:02:16
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-3001"> <Title>Columbia Newsblaster: Multilingual News Summarization on the Web</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The Columbia Newsblaster1 system has been online and providing summaries of topically clustered news daily since late 2001 (McKeown et al., 2002). The goal of the system is to aid daily news browsing by providing an automatic, user-friendly access to important news topics, along with summaries and links to the original articles for further information. The system has six major phases: crawling, article extraction, clustering, summarization, classification, and web page generation.</Paragraph> <Paragraph position="1"> The focus of this paper is to present the entire multilingual Columbia Newsblaster system as a platform for multilingual multi-document summarization experiments. The phases in the multilingual version of Columbia Newsblaster have been modified to take language and character encoding into account, and a new phase, translation, has been added. Figure 1 depicts the multilingual Columbia Newsblaster architecture. We will describe the system, in particular a method using machine learning to extract article text from web pages that is applicable to different languages, and a baseline approach to multilingual multi-document summarization.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Related Research </SectionTitle> <Paragraph position="0"> Previous work in multilingual document summarization, such as the SUMMARIST system (Hovy and Lin, 1999) Newsblaster system.</Paragraph> <Paragraph position="1"> extracts sentences from documents in a variety of languages, and translates the resulting summary. This system has been applied to Information Retrieval in the MuST System (Lin, 1999) which uses query translation to allow a user to search for documents in a variety of languages, summarize the documents using SUMMARIST, and translate the summary. The Keizei system (Ogden et al., 1999) uses query translation to allow users to search Japanese and Korean documents in English, and displays query-specific summaries focusing on passages containing query terms. Our work differs in the document clustering component - we cluster news to provide emergent topic structure from the data, instead of using an information retrieval model. This is useful in analysis, monitoring, and browsing settings, where a user does not have an a priori topic in mind. Our summarization strategy also differs from the approach taken by MuST in that we focus our effort on the summarization system, but only target a single language, shifting the majority of the multilingual knowledge burden to specialized machine translation systems. The Keizei system has the advantage of being able to generate query-specific summaries.</Paragraph> <Paragraph position="2"> Chen and Lin (Chen and Lin, 2000) describe a system that combines multiple monolingual news clustering components, a multilingual news clustering component, and a news summarization component. Their system clusters news in each language into topics, then the multilingual clustering component relates the clusters that are similar across languages. A summary is generated by linking sentences that are similar from the two languages. The system has been implemented for Chinese and English, and an evaluation over six topics is presented. Our clustering strategy differs here, as we translate documents before clustering, and cluster documents from all languages at the same time. This makes it easy to add support for additional languages by incorporating a new translation system for the language; no other changes need to be made. Our summarization model also provides summaries for documents from each language, allowing comparisons between them.</Paragraph> </Section> </Section> class="xml-element"></Paper>