File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3256_metho.xml
Size: 11,731 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3256"> <Title>Multi-document Biography Summarization</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Corpus Description </SectionTitle> <Paragraph position="0"> In order to extract information that is related to a person from a large set of news texts written not exclusively about this person, we need to identify attributes shared among biographies.</Paragraph> <Paragraph position="1"> Biographies share certain standard components.</Paragraph> <Paragraph position="2"> We annotated a corpus of 130 biographies of 12 people (activists, artists, leaders, politicians, scientists, terrorists, etc.). We found 9 common elements: bio (info on birth and death), fame factor, personality, personal, social, education, nationality, scandal, and work. Collected biographies are appropriately marked at clause-level with one of the nine tags in XML format, for example: Martin Luther King <nationality> was born in Atlanta, Georgia </nationality>. ... He <bio>was assassinated on April 4, 1968 </bio>. ... King <education> entered the Boston University as a doctoral student </education>. ...</Paragraph> <Paragraph position="3"> In all, 3579 biography-related phrases were identified and recorded for the collection, among them 321 bio, 423 fame, 114 personality, 465 personal, 293 social, 246 education, 95 nationality, 292 scandal, and 1330 work. We then used 100 biographies for training and 30 for testing the classification module.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Sentence Classification </SectionTitle> <Paragraph position="0"> Relating to human practice on summarizing, three main points are relevant to aid the automation process (Sparck Jones, 1993). The first is a strong emphasis on particular purposes, e.g., abstracting or extracting articles of particular genres. The second is the drafting, writing, and revision cycle in constructing a summary. Essentially as a consequence of these first two points, the summarizing process can be guided by the use of checklists. The idea of a checklist is especially useful for the purpose of generating biographical summaries because a complete biography should contain various aspects of a person's life. From a careful analysis conducted while constructing the biography corpus, we believe that the checklist is shared and common among all persons in question, and consists the 9 biographical elements introduced in Section 3.</Paragraph> <Paragraph position="1"> The task of fulfilling the biography checklist becomes a classification problem. Classification is defined as a task of classifying examples into one of a discrete set of possible categories (Mitchell, 1997). Text categorization techniques have been used extensively to improve the efficiency on information retrieval and organization. Here the problem is that sentences, from a set of documents, need to be categorized into different biography-related classes.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Task Definitions </SectionTitle> <Paragraph position="0"> We designed two classification tasks: 1) 10-Class: Given one or more texts about a person, the module must categorize each sentence into one of ten classes. The classes are the 9 biographical elements plus a class called none that collects all sentences without biographical information. This fine-grained classification task will be beneficial in generating comprehensive biographies on people of interest. The classes are: 2) 2-Class: The module must make a binary decision of whether the sentence should be included in a biography summary. The classes are: bio none The label bio appears in both task definitions but bears different meanings. Under 10-Class, class bio contains information on a person's birth or death, and under 2-Class it sums up all 9 biographical elements from the 10-Class.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Machine Learning Methods </SectionTitle> <Paragraph position="0"> We experimented with three machine learning methods for classifying sentences.</Paragraph> <Paragraph position="1"> Naive Bayes The Naive Bayes classifier is among the most effective algorithms known for learning to classify text documents (Mitchell, 1997), calculating explicit probabilities for hypotheses. Using k shown to be an effective classifier in text categorization. We extend the idea of classifying documents into predefined categories to classifying sentences into one of the two biography categories defined by the 2-Class task. Sentences are categorized based on their biographical saliency (a percentage of clearly identified biography words) and their non-biographical saliency (a percentage of clearly identified non-biography words). We used LIBSVM (Chang and Lin, 2003) for training and testing.</Paragraph> <Paragraph position="2"> Decision Tree (4.5) In addition to SVM, we also used a decision-tree algorithm, C4.5 (Quinlan, 1993), with the same training and testing data as SVM.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Classification Results </SectionTitle> <Paragraph position="0"> The lower performance bound is set by a baseline system that randomly assigns a biographical class given a sentence, for both 10-Class and 2-Class. 2599 testing sentences are from classification, using Naive Bayes Classifier. Part-of-speech (POS) information (Brill, 1995) and word stems (Lovins, 1968) were used in some feature sets.</Paragraph> <Paragraph position="1"> We bootstrapped 10395 more biographyindicating words by recording the immediate hypernyms, using WordNet (Fellbaum, 1998), of the words collected from the controlled biography corpus described in Section 3. These words are called Expanded Unigrams and their frequency scores are reduced to a fraction of the original word's frequency score.</Paragraph> <Paragraph position="2"> Some sentences in the testing set were labeled with multiple biography classes due to the fact that the original corpus was annotated at clause level. Since the classification was done at sentence level, we relaxed the matching/evaluating program allowing a hit when any of the several classes was matched. This is shown in Table 1 as the Relaxed cases.</Paragraph> <Paragraph position="3"> A closer look at the instances where the false negatives occur indicates that the classifier mislabeled instances of class work as instances of class none. To correct this error, we created a list of 5516 work specific words hoping that this would set a clearer boundary between the two classes. However performance did not improve.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 2-Class Classification </SectionTitle> <Paragraph position="0"> All three machine learning methods were evaluated in classifying among 2 classes. The results are shown in Table 2. The testing data is slightly skewed with 68% of the sentences being none.</Paragraph> <Paragraph position="1"> In addition to using marked biographical phrases as training data, we also expanded the marking/tagging perimeter to sentence boundaries. As shown in the table, this creates noise.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Biography Extraction </SectionTitle> <Paragraph position="0"> Biographical sentence classification module is only one of two components that supply the overall system with usable biographical contents, and is followed by other stages of processing (see system design in Figure 1). We discuss the other modules next.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Name-filter </SectionTitle> <Paragraph position="0"> A filter scans through all documents in the set, eliminating sentences that are direct quotes, dialogues, and too short (under 5 words). Personoriented sentences containing any variation (first name only, last name only, and the full name) of the person's name are kept for subsequent steps.</Paragraph> <Paragraph position="1"> Sentences classified as biography-worthy are merged with the name-filtered sentences with duplicates eliminated.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Sentence Ranking </SectionTitle> <Paragraph position="0"> An essential capability of a multi-document summarizer is to combine text passages in a useful manner for the reader (Goldstein et al., 2000).</Paragraph> <Paragraph position="1"> This includes a sentence ordering parameter (Mani, 2001). Each of the sentences selected by the name-filter and the biography classifier is either related to the person-in-question via some news event or referred to as part of this person's biographical profile, or both. We need a mechanism that will select sentences that are of informative significance within the source document set. Using inverse-term-frequency (ITF), i.e. an estimation of information value, words with high information value (low ITF) are distinguished from those with low value (high ITF). A sorted list of words along with their ITF scores from a document set--topic ITFs--displays the important events, persons, etc., from this particular set of texts. This allows us to identify passages that are unusual with respect to the texts about the person.</Paragraph> <Paragraph position="2"> However, we also need to identify passages that are unusual in general. We have to quantify how these important words compare to the rest of the world. The world is represented by 413307562 words from TREC-9 corpus (http://trec.nist.gov/data.html), with corresponding ITFs.</Paragraph> <Paragraph position="3"> The overall informativeness of each word w is:</Paragraph> <Paragraph position="5"> The following is a set of sentences extracted according to the method described so far. The person-in-question is the famed cyclist Lance Armstrong.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Redundancy Elimination </SectionTitle> <Paragraph position="0"> Summaries that emphasize the differences across documents while synthesizing common information would be the desirable final results. Removing similar information is part of all MDS systems. Redundancy is apparent in the Armstrong example from Section 5.2. To eliminate repetition while retaining interesting singletons, we modified (Marcu, 1999) so that an extract can be automatically generated by starting with a full text and systematically removing a sentence at a time as long as a stable semantic similarity with the original text is maintained. The original extraction algorithm was used to automatically create large volume of (extract, abstract, text) tuples for training extraction-based summarization systems with (abstract, text) input pairs.</Paragraph> <Paragraph position="1"> Top-scoring sentences selected by the ranking mechanism described in Section 5.2 were the input to this component. The removal process was repeated until the desired summary length was achieved.</Paragraph> <Paragraph position="2"> Applying this method to the Armstrong example, the result leaves only one sentence that contains the topics &quot;chemotherapy&quot; and &quot;cancer&quot;. It chooses sentence 3, which is not bad, though sentence 1 might be preferable.</Paragraph> </Section> </Section> class="xml-element"></Paper>