File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-3001_metho.xml
Size: 10,609 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-3001"> <Title>Columbia Newsblaster: Multilingual News Summarization on the Web</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Extracting article data </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Extracting article text </SectionTitle> <Paragraph position="0"> To move Columbia Newsblaster into a multilingual capable environment, we must be able to extract the &quot;article text&quot; from web pages in multiple languages. The article text is the portion of a web page that contains the actual news content of the page, as opposed to site navigation links, ads, layout information, etc. Our previous approach to extracting article text in Columbia Newsblaster used regular expressions that were hand-tailored to specific web sites. Adapting this approach to new web sites is difficult, and it is also difficult to adapt to foreign languages sites. We solved this problem by incorporating a new article extraction module using machine learning techniques. The new article extraction module parses HTML into blocks of text based on HTML markup and computes a set of 34 features based on simple surface characteristics of the text. We use features such as the percentage of text that is punctuation, the number of HTML links in the block, the percentage of question marks, the number of characters in the text block, and so on. Since the features are relatively language independent they can be computed for and applied to any language.</Paragraph> <Paragraph position="1"> Training data for the system is generated using a GUI that allows a human to annotate text candidates with one of fives labels: &quot;ArticleText&quot;, &quot;Title&quot;, &quot;Caption&quot;, &quot;Image&quot;, or &quot;Other&quot;. The &quot;ArticleText&quot; label is associated with the actual text of the article which we wish to extract. At the same time, we try to determine document titles, image caption text, and image blocks in the same framework. &quot;Other&quot; is a catch-all category for all other text blocks, such as links to related articles, navigation links, ads, and so on. The training data is used with the cle text in three languages.</Paragraph> <Paragraph position="2"> machine learning program Ripper (Cohen, 1996) to induce a hypothesis for categorizing text candidates according to the features. This approach has been trained on web pages from sites in English, Russian, and Japanese as shown in Table 1, but has been used with sites in English, Russian, Japanese, Chinese, French, Spanish, German, Italian, Portuguese, and Korean.</Paragraph> <Paragraph position="3"> The English training set was composed of 353 articles, collected from 19 web sites. Using 10-fold crossvalidation, the induced hypothesis classify into the article text category with a precision of 89.1% and a recall of 90.7%. Performance over Russian data was similar, with a precision of 90.59% and recall of 95.06%. We evaluated the English hypothesis against the Russian data to observe whether the languages behave differently. As expected, the English hypothesis resulted in poor performance over the Russian data, and we saw comparable results for Japanese. The same English hypothesis performs adequately on other English sites not in the training set, so the differences between languages seem to be significant.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Title and date extraction </SectionTitle> <Paragraph position="0"> The article extraction component also determines a title for each document, and attempts to locate a publishing date for the articles. Title identification is important since in a cluster, sometimes with as many as 60 articles, the only information the user sees are the titles for the articles; if our system chooses poor titles, they will have a difficult time discriminating between the articles. If the article extraction component finds a title it is used. Unfortunately, this process is not always successful, so we have a variety of fall-back methods, including taking the title from the HTML TITLE tag, using heuristics to detect the title from the first text block, and using a portion of the first sentence. These approaches led to many uninformative titles extracted from the non-English sites, since they were developed for English news. We implemented a system to identify titles that are clearly nondescriptive, such as &quot;Stock Market News&quot;, that would apply to non-English text as well. We record the titles seen and rejected over time and use the list to reject titles with high frequency. A title with high frequency is assumed to be not descriptive enough to give a clear idea of the content of an article in a cluster of similar articles.</Paragraph> <Paragraph position="1"> To correctly extract dates for articles, we use heuristics to identify sequences of possible dates, weigh them, and choose the most likely date as the publication date. Regular expressions for Japanese date extraction were added to the system.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Multilingual Clustering </SectionTitle> <Paragraph position="0"> The document clustering system that we use (Hatzivassiloglou et al., 2000) has been trained on, and extensively tested with English. While it can cluster documents in other languages, our goal is to generate clusters with documents from multiple languages, so a baseline approach is to translate all non-English documents into English, and then cluster the translated documents. We take this approach, and further experimented with using simple and fast techniques for glossing the input articles for clustering. We developed simple dictionary lookup glossing systems for Japanese and Russian. Our experimentation showed that full translation using Systran outperformed our glossing-based techniques, so the glossing techniques are not used in the current system.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Multilingual Summarization Baseline </SectionTitle> <Paragraph position="0"> Our baseline approach to multilingual multi-document summarization is to apply our English-based summarization system, the Columbia Summarizer (McKeown et al., 2001), to document clusters containing machine-translated versions of non-English documents. The Columbia Summarizer routes to one of two multi-document summarization systems based on the similarity of the documents in the cluster. If the documents are highly similar, the Multigen summarization system (McKeown et al., 1999) is used. Multigen clusters sentences based on similarity, and then parses and fuses information from similar sentences to form a summary.</Paragraph> <Paragraph position="1"> The second summarization system used is DEMS, the Dissimilarity Engine for Multi-document Summarization (Schiffman et al., 2002), which uses a sentence extraction approach to summarization. The resulting summary is then run through a named entity recovery tool (Nenkova and McKeown, 2003), which repairs named entity references in the summary by making the first reference descriptive, and shortening subsequent reference mentions in the summary. Using an unmodified version of DEMS, summaries might contain sentences from translated documents which are not grammatically correct. The DEMS summarization system was modified to prefer choosing a sentence from an English article if there are sentences that express similar content in multiple languages. By setting different weight penalties we can take the quality glish documents to a summary from German documents.</Paragraph> <Paragraph position="2"> account.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Similarity-based Summarization </SectionTitle> <Paragraph position="0"> As part of our multilingual summarization work, we are investigating approaches to summarization that use sentence-level similarity computation across languages to cluster sentences by similarity, and then generate a summary sentence using translated portions of the relevant sentences. The multilingual version of Columbia Newsblaster provides us with a platform to frame future experiments for this summarization technique. We are investigating translation at different levels - sentence level, clause level, and phrase level. Our initial similarity-based summarization system works at the sentence level. Starting with machine-translated sentences, we compute their similarity to English sentences that have been simplified(Siddharthan, 2002). Foreign-language sentences that have a high enough similarity to English text are replaced (or augmented with) the similar English sentence.</Paragraph> <Paragraph position="1"> This first system using full machine translation over the sentences and English similarity detection will be extended using simple features for multilingual similarity detection in SimFinder MultiLingual (SimFinderML), a multilingual version of SimFinder (Hatzivassiloglou et al., 2001). We also plan an experiment evaluating the usefulness of noun phrase detection and noun phrase variant detection as a primitive for multilingual similarity detection, using tools such as Christian Jacquemin's FASTR (Jacquemin, 1994; Jacquemin, 1999).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Summary presentation </SectionTitle> <Paragraph position="0"> Multilingual Newsblaster presents multiple views of a cluster of documents to the user, broken down by language and by country. Summaries are generated for the entire cluster, as well as sub-sets of the articles based on the country of origin and language of the original articles. Users are first presented with a summary of the entire cluster using all documents, and then have the ability to focus on countries or languages of their choosing. We also allow the user to view two summaries side-by-side so they can easily compare differences between summaries from different countries. For example, figure 4.2 shows a summary of articles about talks between America, Japan, and Korea over nuclear arms, comparing the summaries from articles in English and German.</Paragraph> </Section> </Section> class="xml-element"></Paper>