File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-4008_intro.xml

Size: 3,629 bytes

Last Modified: 2025-10-06 14:01:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-4008">
  <Title>Columbia's Newsblaster: New Features and Future Directions</Title>
  <Section position="2" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
Columbia's Newsblaster
</SectionTitle>
    <Paragraph position="0"> provide news updates on a daily basis from news published on the Internet; it crawls newssites,categorizesstoriesintosixbroadareas,groups news into stories on the same event, andgeneratesa summaryofthemultiplearticlesdescribingeachevent. Inaddition to demonstrating the robustness of current summarizationand tracking technology, Newsblasteralso serves as a research environment in which we explore new directions and problems. Currently, we are exploring the tasks of multilingual summarization where input sources  aredrawnfrommultiplelanguagesandasummaryisgenerated in English on the same event (Figure 1), tracking events across days and generating summaries that update the user onwhat is new, andeditinggenerated summaries to improve fluency and accuracy. Our focus here is on editing references to people, improving coherency of the summary and ensuring that references are accurate. Editing is particularly important as we addmultilingual capabilities, given the errors inherent in machine translation.  The multilingual version of Columbia Newsblaster is built upon the English version of Columbia Newsblaster, sharing the same structure and components. To add multilingual capability, the system first crawls web sites in foreign languages, and stores both the language and encoding for the files. To extract the article text from the HTML pages, we use a new article extraction component usinglanguage-independent statistical features computed over text blocks along with a machine learning component to classify text blocks as one of &amp;quot;Article Text&amp;quot;, &amp;quot;Title&amp;quot;, &amp;quot;Image&amp;quot;, &amp;quot;Image Caption&amp;quot;, or &amp;quot;Other&amp;quot;. The article extraction component has been trained and tested on English, Japanese, and Russian data, but is also being successfully applied to French, Spanish, German, and Italian data. We plan to train the article extractor on other languages (Chinese, Arabic, Korean, Spanish, German, French, etc.) in the near future.</Paragraph>
    <Paragraph position="1"> To cluster multilingual documents with English documents, we use the existing Newsblaster English document clustering module. Non-English documents are translated for clustering after the article extraction phase.</Paragraph>
    <Paragraph position="2"> We use simple and fast document translation techniques for clustering if available, since we potentially process thousands of documents for a language for each run. We have developed simple dictionary lookup techniques for translation for clustering for Japanese and Russian; for other languages we use an interface to the Systran translation system via Babelfish. We plan on adding Arabic translation to the system in the near future.</Paragraph>
    <Paragraph position="3"> Summarization is performed using the same summarization strategies in Newsblaster. We are experimenting with different methods for improving summary quality when translation of text is noisy. For example, when an input cluster contains both English and foreign sources, we weight the English higher in cases where we determine it is representative of both the English and foreign  input documents. We are also experimenting with methods for determining similarity across documents using different levels of translation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML