File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2151_metho.xml
Size: 15,230 bytes
Last Modified: 2025-10-06 14:15:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2151"> <Title>Automatic Text Summarization Based on the Global Document Annotation</Title> <Section position="4" start_page="917" end_page="919" type="metho"> <SectionTitle> 3 Text Summarization </SectionTitle> <Paragraph position="0"> As an example of a basic application of GDA, we have developed an automatic text summarization system. Summarization generally requires deep semantic processing and a lot of background knowledge. However, nmst previous works use several superficial clues and heuristics on specific styles or configurations of documents to summarize.</Paragraph> <Paragraph position="1"> For example, clues for determining the importance of a sentence include (1) sentence length, (2) key-word count, (3) tense, (4) sentence type (such as fact, conjecture and assertion), (5) rhetorical relation (such as reason and example), and (6) position of sentence in the whole text. Most of these are extracted by a shallow processing of the text. Such a computation is rather robust.</Paragraph> <Paragraph position="2"> Present summarization systems (Watanabe, 1996: Hovy and Lin, 1997) use such clues to calculate an importance score for each sentence, choose sentences according to the score, and simply put the selected sentences together in order of their occurrences in the original document. In a sense, these systems are successful enough to be practical, and are based on reliable technologies. However, the quality of summarization cannot be improved beyond this basic level without any deep content-based processing.</Paragraph> <Paragraph position="3"> We propose a new summarization method based on GDA. This method employs a spreading activation technique (Hasida et al., 1987) to calculate the importance values of elements in the text. Since the method does not employ any heuristics dependent on the domain and style of documents, it is applicable to any GDA-tagged documents. The method also can trim sentences in the summary because importance scores are assigned to elements smaller than sentences.</Paragraph> <Paragraph position="4"> A GDA-tagged document naturally defines an intra-document network in which nodes correspond to elements and links represent the semantic relations mentioned in the previous section. This network consists of sentence trees (syntactic head-daughter hierarchies of subsentential elements such as words or phrases), coreference/emaphora links, document/subdivision/paragraph nodes, and rhetorical relation links.</Paragraph> <Paragraph position="5"> Figure 1 shows a graphical representation of the intra-document network.</Paragraph> <Paragraph position="7"> The summalization algorithm is the following: 1. Spreading activation is performed in such a way that two elements have the same activation value if they are coreferent or One of them is the syntactic head of the other.</Paragraph> <Paragraph position="8"> 2. The unmarked element with the highest activation value is marked for inclusion in the summary. null 3. When an element is marked, other elements listed below are recursively marked ms well, until no more element may be marked.</Paragraph> <Paragraph position="9"> * its head * its antecedent * its compulsory or a priori important daughters, the values of whose relational attributes are agt. pat. obj. pos, cnt, cau, end, sbra, and so forth.</Paragraph> <Paragraph position="10"> * the antecedent of a zero anaphor in it with some of the above values for the relational attribute 4. All marked elements in the intra-docmnent network are generated preserving the order of their positions in the original document.</Paragraph> <Paragraph position="11"> 5. If a size of the sunnnary reaches the user null specified value, then ternfinate; otherwise go back to Step 2.</Paragraph> <Paragraph position="12"> The following article of the Wall Street Journal was used for testing this algorithm.</Paragraph> <Paragraph position="13"> During its centennial year. The Wall Street Journal will report events of the past century that stand as milestones of American business history. THREE COMPUTERS THAT CHANGED the face of personal computing were launched in 1977. That year the Apple II. Commodore Pet and 'randy TRS came to market. The computers were crude by today's stmldards. Apple II owners, for example. had to use their television sets as screens and stored data on audiocassettes. But Apple II was a major advance from Apple I, which was built in a garage by Stephen Wozniak and Steven Jobs for hobbyists such as the Homebrew Computer Club. In addition, the Apple II was an affordable $1,298. Crude as they were, these early PCs triggered explosive product development in desktop models for the home and office. Big mainframe computers for business had been around for years. But the new 1977 PCs - unlike earlier built-from-kit types such as the Altair, Sol and IMSAI - had keyboards and could store about two pages of data in their memories. Current PCs are more than 50 tinms faster and have memory capacity 500 times greater than their 1977 counterparts. There were many pioneer PC contributors. William Gates and Paul Allen in 1975 developed an early language-housekeeper system for PCs, and Gates became an industry billionaire six years after IBM adapted one of these versions in 1981. Alan F. Shugart, currently chairman of Seagate Technology, led the team that developed the disk drives for PCs.</Paragraph> <Paragraph position="14"> Dennis Hayes and Dale Heatherington, two Atlanta engineers, were co-developers of the internal modems that allow PCs to share data via the telephone. IBM, the world leader in computers, didn't offer its first PC until August 1981 as many other companies entered the market. Today. PC shipments annually total some $38.3 billion world-wide.</Paragraph> <Paragraph position="15"> Here is a short, computer-generated summary of this sample article:</Paragraph> </Section> <Section position="5" start_page="919" end_page="919" type="metho"> <SectionTitle> THREE COMPUTERS THAT </SectionTitle> <Paragraph position="0"> CHANGED the face of personal computing were launched. Crude as they were, these early PCs triggered explosive product development. Current PCs are more than 50 times faster and have memory capacity 500 times greater than their counterparts.</Paragraph> <Paragraph position="1"> The proposed method is flexible enough to dynmnically generate summaries of various sizes. If a longer summary is needed, the user can change the window size of the summary browser, as described in Section 3.1. Then. the sumnlary changes its size to fit into the new window. An example of a longer summary follows:</Paragraph> </Section> <Section position="6" start_page="919" end_page="919" type="metho"> <SectionTitle> THREE COMPUTERS THAT </SectionTitle> <Paragraph position="0"> CHANGED the face of personal computing were launched. The Apple II, Comnlodore Pet and Tandy TRS came to market. The computers were crude. Apple II owners had to use their television sets and stored data on audiocassettes. The Apple II was an affordable $1.298. Crude as they were, these early PCs triggered explosive product development. The new PCs had keyboards and could store about two pages of data in their memories. Current PCs are more than 50 times faster and have memo~T capacity 500 times greater than their counterparts. There were many pioneer PC contributors. William Gates and Paul Allen developed an early language-housekeeper system, and Gates became an industry billionaire after IBM adapted one of these versions. IBM didn't offer its first PC.</Paragraph> <Paragraph position="1"> An observation obtained from this experiment is that tags for coreferences and thematic and rhetorical relations are almost enough to make a summary. In particular, coreferences and rhetorical relations help summarization very much.</Paragraph> <Paragraph position="2"> GDA tags allow us to apply more sophisticated natural language processing technologies to come up with better summaries. It is straightforward to incorporate sentence generation technologies to paraphrase parts of the document, rather than just selecting or pruning them. Annotations on anaphora can be exploited to produce context-dependent paraphrases. Also the summary could be itemized to fit in a slide presentation.</Paragraph> <Section position="1" start_page="919" end_page="919" type="sub_section"> <SectionTitle> 3.1 Summary Browser </SectionTitle> <Paragraph position="0"> We developed a summary browser using a Javacapable WWW browser. Figure 2 shows an example screen of the summary browser.</Paragraph> <Paragraph position="1"> 1, .... ~!i During its centennial year The Wall Street Journal will report events ol the past century that stand its milestones of American business history. THREE COMRJTERS THAT CHANGED the ! face of personal computing were launched in |977. That year the Apple II, Commodore Pet and Tandy TRS came to market. The computers were crude by today's standards. Apple U owners, for ~C/ample, had to use their television sets as scfeens and stored data on i audiocasset t es. But II was a advance horn I, which built in Apple rllajof Apple was a garage by t Stephan Wozniak and Stevan Jobs for hobbyists such as the Homebrew Computer Club+ In addition, the Apple n was an affordable $1,298. Crude as they were, these early I~:s trl &quot;ggered e~plo~ve product development in desktop models for the home and office_ B/g mainlrame co~nput ers for business had been around for yeats. But the ~ 1977 PCs-- unlike eadier built-from-kit types such as the Altair, Sol and IMSAI - had keyboards and could store about two pages of data in their memories. Current PCs are more than 50 times faster and t have memory capacity SO0 times greater than their 1977 counteq~acts. There were many pioneer PC contributors. W~lliam Gates and Paul Allen in 197S devdoged an early language-housek eep~ system for PCS, and Gates became an industry billionaire six years alter IBM adapted one of these versions in 1981. Alan F. Sbugart, currently chairman ol' Seagate Technology, led the team that developed the disk drives for PCs. Dennis Hayes and Dale Heatheriagton, two Atlanta engineers, were co-devolopef~ of the internal moderns that allow PCs to share data via the telephone. IBM, the wodd leader in computers, didn't offer its f~s'lr PC lunta Al/nll~t 1 qR1 =C/ m~m nthtlC/ rnmnlni~ ~ntmt=~l th~ mlr~at Tnd=u P~ ............. .......... ....... ................... ..... ~, THREE&quot; COMPUTERS THAT CHANGED the face of personal computing were launched. Crude as i they were, these early PCs tnggered e~plosive product development. Current PCs aee mote ! than 50 times taster and have memory capacity SO0 times greater than their counterparts. I It has the following functionalities: 1. A screen is divided into three parts (frames).</Paragraph> <Paragraph position="2"> One frame provides a user input form through which you can select documents and type keywords. The other frames are for displaying the original document and its summary.</Paragraph> <Paragraph position="3"> 2. The frame for the summary text is resizable by sliding the boundary with the original document frame. The size of the summary frame influences the size of the summary itself. Thus you can see the summary in a preferred size and change the size in an easy and intuitive way.</Paragraph> <Paragraph position="4"> 3. The frame for the original document is mouse sensitive. You can select any element of text in this frame. This function is used for the customization of the summary, as described later.</Paragraph> <Paragraph position="5"> 4. HTML tags are also handled by the browser.</Paragraph> <Paragraph position="6"> So, images are viewed and hyperlinks are nianaged both in the summary. If a hyperlink is clicked in the original document frame, the linked document appears on the same frame.</Paragraph> <Paragraph position="7"> The hyperlinks are kept in the summary.</Paragraph> </Section> </Section> <Section position="7" start_page="919" end_page="920" type="metho"> <SectionTitle> 4 Personalization </SectionTitle> <Paragraph position="0"> A good summary might depend on the background knowledge of its creator. It, also should change ac- null cording to the interests or preferences of its reader. Let us refer to the adaptation of the summarization process to a particular user as personalization. GDA-based summarization can be easily personalized because our method is flexible enough to bias a summary toward the user's concerns. You can select any elements in the original document during summarization, to interactively provide information concerning your personal interests.</Paragraph> <Paragraph position="1"> We have been developing the following techniques for personalized summarization:</Paragraph> </Section> <Section position="8" start_page="920" end_page="920" type="metho"> <SectionTitle> * Keyword-based customization </SectionTitle> <Paragraph position="0"> The user can input any words of interest.</Paragraph> <Paragraph position="1"> The system relates those words with those in the document using cooccurrence statistics acquired from a corpus and a dictionary such as WordNet (Miller, 1995). The related words in the document are assigned numeric values that reflect closeness to the input words. These values are used in spreading activation for calculating importance scores.</Paragraph> <Paragraph position="2"> * Interactive custonfization by selecting any elements from a document The user can mark any words, phrases, and sentences to be included in the summary. The summatt browser allows the user to select those elements by pointing devices such as mouse and stylus pen. The user can easily select elements by clicking on them. The click count corresponds to the level of elements. That is, the first click means the word, the second the next larger element containing it, and so on. The selected elements will have higher activation values in spreading activation.</Paragraph> <Paragraph position="3"> * Learning user interests by observation of WWW browsing The summmization system can customize the summary according to the user without any explicit user inputs. We implemented a learning mechanism for user personalization. The mechanism uses a weighted feature vector. The feature corresponds to the category or topic of documents. The category is defined according to a WWW directory such as Yahoo. The topic is detected using the summarization technique.</Paragraph> <Paragraph position="4"> Learning is roughly divided into data acquisition and model nmdification. The user's behavioral data is acquired by detecting her information access on the WWW. This data includes the time and duration of that information access and features related to that information.</Paragraph> <Paragraph position="5"> The first step of model modification is to estimate the degree of relevance between the input feature vector assigned to the information accessed by the user and the model of the user's interests acquired fl'om previous data. The second step is to adjust the weights of features in the user model.</Paragraph> </Section> class="xml-element"></Paper>