XML Viewer - a00-1043

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-1043_intro.xml
Size: 11,898 bytes
Last Modified: 2025-10-06 14:00:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1043">
  <Title>D B E G /\ /',, A C F H ? A D t B E G C F H Reduced: A B D G H Figure 3: Reduced form by a human D B E G A C F H Reduced: B C D Input: A B C D E F G H Figure 4: Reduced form by the program</Title>
  <Section position="3" start_page="310" end_page="311" type="intro">
    <SectionTitle>
2 Sentence reduction based on
</SectionTitle>
    <Paragraph position="0"> multiple sources of knowledge The goal of sentence reduction is to &amp;quot;reduce without major loss&amp;quot;; that is, we want to remove as many extraneous phrases as possible from an extracted sentence so that it can be concise, but without detracting from the main idea the sentence conveys. Ideally, we want to remove a phrase from an extracted sentence only if it is irrelevant to the main topic. To achieve this, the system relies on multiple sources of knowledge to make reduction decisions. We first introduce the resources in the system and then describe the reduction algorithm.</Paragraph>
    <Section position="1" start_page="310" end_page="310" type="sub_section">
      <SectionTitle>
2.1 The resources
</SectionTitle>
      <Paragraph position="0"> (1) The corpus. One of the key features of the system is that it uses a corpus consisting of original sentences and their corresponding reduced forms written by humans for training and testing purpose. This corpus was created using an automatic program we have developed to automatically analyze human-written abstracts. The program, called the decomposition program, matches phrases in a human-written summary sentence to phrases in the original document (Jing and McKeown, 1999). The human-written abstracts were collected from the free daily news service &amp;quot;Communicationsrelated headlines&amp;quot;, provided by the Benton Foundation (http://www.benton.org). The articles in the corpus are news reports on telecommunication related issues, but they cover a wide range of topics, such as law, labor, and company mergers.</Paragraph>
      <Paragraph position="1"> (2) The lexicon. The system also uses a largescale, reusable lexicon we combined from multiple resources (Jing and McKeown, 1998). The resources that were combined include COMLEX syntactic dictionary (Macleod and Grishman, 1995), English Verb Classes and Alternations (Levin, 1993), the WordNet lexical database (Miller et al., 1990), the Brown Corpus tagged with WordNet senses (Miller et al., 1993). The lexicon includes subcategorizations for over 5,000 verbs. This information is used to identify the obligatory arguments of verb phrases.</Paragraph>
      <Paragraph position="2"> (3) The WordNet lexical database. Word-Net (Miller et al., 1990) is the largest lexical database to date. It provides lexical relations between words, including synonymy, antonymy, meronymy, entailment (e.g., eat --+ chew), or causation (e.g., kill --4 die). These lexical links are used to identify the focus in the local context.</Paragraph>
      <Paragraph position="3"> (4) The syntactic parser. We use the English Slot Grammar(ESG) parser developed at IBM (Mc-Cord, 1990) to analyze the syntactic structure of an input sentence and produce a sentence parse tree.</Paragraph>
      <Paragraph position="4"> The ESG parser not only annotates the syntactic category of a phrase (e.g., &amp;quot;np&amp;quot; or &amp;quot;vp&amp;quot;), it also annotates the thematic role of a phrase (e.g., &amp;quot;subject&amp;quot; or &amp;quot;object&amp;quot;).</Paragraph>
    </Section>
    <Section position="2" start_page="310" end_page="311" type="sub_section">
      <SectionTitle>
2.2 The algorithm
</SectionTitle>
      <Paragraph position="0"> There are five steps in the reduction program: Step 1: Syntactic parsing.</Paragraph>
      <Paragraph position="1"> We first parse the input sentence using the ESG parser and produce the sentence parse tree. The operations in all other steps are performed based on this parse tree. Each following step annotates each node in the parse tree with additional information, such as syntactic or context importance, which are used later to determine which phrases (they are represented as subtrees in a parse tree) can be considered extraneous and thus removed.</Paragraph>
      <Paragraph position="2"> Step 2: Grammar checking.</Paragraph>
      <Paragraph position="3"> In this step, we determine which components of a sentence must not be deleted to keep the sentence grammatical. To do this, we traverse the parse tree produced in the first step in top-down order and mark, for each node in the parse tree, which of its children are grammatically obligatory. We use two sources of knowledge for this purpose. One source includes simple, linguistic-based rules that use the thematic role structure produced by the ESG parser.</Paragraph>
      <Paragraph position="4"> For instance, for a sentence, the main verb, the subject, and the object(s) are essential if they exist, but a prepositional phrase is not; for a noun phrase, the head noun is essential, but an adjective modifier of the head noun is not. The other source we rely on is the large-scale lexicon we described earlier. The information in the lexicon is used to mark the obligatory arguments of verb phrases. For example, for the verb &amp;quot;convince&amp;quot;, the lexicon has the following  This entry indicates that the verb &amp;quot;convince&amp;quot; can be followed by a noun phrase and a prepositional phrase starting with the preposition &amp;quot;of&amp;quot; (e.g., he convinced me of his innocence). It can also be followed by a noun phrase and a to-infinitive phrase (e.g., he convinced me to go to the party). This information prevents the system from deleting the &amp;quot;of&amp;quot; prepositional phrase or the to-infinitive that is part of the verb phrase.</Paragraph>
      <Paragraph position="5"> At the end of this step, each node in the parse tree -- including both leaf nodes and intermediate nodes -- is annotated with a value indicating whether it is grammatically obligatory. Note that whether a node is obligatory is relative to its parent node only. For  example, whether a determiner is obligatory is relative to the noun phrase it is in; whether a prepositional phrase is obligatory is relative to the sentence or the phrase it is in.</Paragraph>
      <Paragraph position="6"> Step 3: Context information.</Paragraph>
      <Paragraph position="7"> In this step, the system decides which components in the sentence are most related to the main topic being discussed. To measure the importance of a phrase in the local context, the system relies on lexical links between words. The hypothesis is that the more connected a word is with other words in the local context, the more likely it is to be the focus of the local context. We link the words in the extracted sentence with words in its local context, if they are repetitions, morphologically related, or linked in WordNet through one of the lexical relations. The system then computes an importance score for each word in the extracted sentence, based on the number of links it has with other words and the types of links. The formula for computing the context importance score for a word w is as follows:</Paragraph>
      <Paragraph position="9"> Here, i represents the different types of lexical relations the system considered, including repetition, inflectional relation, derivational relation, and the lexical relations from WordNet. We assigned a weight to each type of lexical relation, represented by Li in the formula. Relations such as repetition or inflectional relation are considered more important and are assigned higher weights, while relations such as hypernym are considered less important and assigned lower weights. NUMi(w) in the formula represents the number of a particular type of lexical links the word w has with words in the local context.</Paragraph>
      <Paragraph position="10"> After an importance score is computed for each word, each phrase in the &amp;quot;sentence gets a score by adding up the scores of its children nodes in the parse tree. This score indicates how important the phrase is in the local context.</Paragraph>
      <Paragraph position="11"> Step 4: Corpus evidence.</Paragraph>
      <Paragraph position="12"> The program uses a corpus consisting of sentences reduced by human professionals and their corresponding original sentences to compute how likely humans remove a certain phrase. The system first parsed the sentences in the corpus using ESG parser. It then marked which subtrees in these parse trees (i.e., phrases in the sentences) were removed by humans. Using this corpus of marked parse trees, we can compute how likely a subtree is removed from its parent node. For example, we can compute the probability that the &amp;quot;when&amp;quot; temporal clause is removed when the main verb is &amp;quot;give&amp;quot;, represented as Prob(&amp;quot;when-clause is removed&amp;quot;l&amp;quot;v=give&amp;quot;), or the probability that the to-infinitive modifier of the head noun &amp;quot;device&amp;quot; is removed, represented as Prob(&amp;quot;to-infinitive modifier is removed&amp;quot;l&amp;quot;n=device&amp;quot;). These probabilities are computed using Bayes's rule. For example, the probability that the &amp;quot;when&amp;quot; temporal clause is removed when the main verb is &amp;quot;give&amp;quot;, Prob(&amp;quot;when-clause is removed'l&amp;quot;v=give'), is computed as the product of Prob(&amp;quot;v=give&amp;quot;\[&amp;quot;when-clause is removed&amp;quot;) (i.e., the probability that the main verb is &amp;quot;give&amp;quot; when the &amp;quot;when&amp;quot; clause is removed) and Prob(&amp;quot;when-clause is removed&amp;quot;) (i.e., the probability that the &amp;quot;when&amp;quot; clause is removed), divided by Prob(&amp;quot;v=give&amp;quot;) (i.e., the probability that the main verb is &amp;quot;give&amp;quot;).</Paragraph>
      <Paragraph position="13"> Besides computing the probability that a phrase is removed, we also compute two other types of probabilities: the probability that a phrase is reduced (i.e., the phrase is not removed as a whole, but some components in the phrase are removed), and the probability that a phrase is unchanged at all (i.e., neither removed nor reduced).</Paragraph>
      <Paragraph position="14"> These corpus probabilities help us capture human practice. For example, for sentences like &amp;quot;The agency reported that ...&amp;quot;, &amp;quot;The other source says that ...&amp;quot;, &amp;quot;The new study suggests that ...', the that-clause following the say-verb (i.e., report, say, and suggest) in each sentence is very rarely changed at all by professionals. The system can capture this human practice, since the probability that that-clause of the verb say or report being unchanged at all will be relatively high, which will help the system to avoid removing components in the that-clause.</Paragraph>
      <Paragraph position="15"> These corpus probabilities are computed beforehand using a training corpus. They are then stored in a table and loaded at running time.</Paragraph>
      <Paragraph position="16"> Step 5: Final Decision.</Paragraph>
      <Paragraph position="17"> The final reduction decisions are based on the results from all the earlier steps. To decide which phrases to remove, the system traverses the sentence parse tree, which now have been annotated with different types of information from earlier steps, in the top-down order and decides which subtrees should be removed, reduced or unchanged. A subtree (i.e., a phrase) is removed only if it is not grammatically obligatory, not the focus of the local context (indicated by a low importance score), and has a reasonable probability of being removed by humans.</Paragraph>
      <Paragraph position="18"> Figure 1 shows sample output of the reduction program. The reduced sentences produced by humans are also provided for comparison.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML