XML Viewer - w03-0501

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0501_metho.xml
Size: 17,652 bytes
Last Modified: 2025-10-06 14:08:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0501">
  <Title>Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Feasibility Testing
</SectionTitle>
    <Paragraph position="0"> Our approach is based on the selection of words from the original story, in the order that they appear in the story, and allowing for morphological variation. To determine the feasibility of our headline-generation approach, we first attempted to apply our &amp;quot;select-words-in-order&amp;quot; technique by hand. We asked two subjects to write headline headlines for 73 AP stories from the TIPSTER corpus for January 1, 1989, by selecting words in order from the story. Of the 146 headlines, 2 did not meet the &amp;quot;select-words-in-order&amp;quot; criteria because of accidental word reordering. We found that at least one fluent and accurate headline meeting the criteria was created for each of the stories. The average length of the headlines was 10.76 words.</Paragraph>
    <Paragraph position="1"> Later we examined the distribution of the headline words among the sentences of the stories, i.e. how many came from the first sentence of a story, how many from the second sentence, etc. The results of this study are shown in Figure 1. We observe that 86.8% of the headline words were chosen from the first sentence of their stories. We performed a subsequent study in which two subjects created 100 headlines for 100 AP stories from August 6, 1990. 51.4% of the headline words in the second set were chosen from the first sentence. The distribution of headline words for the second set shown in Figure 2.</Paragraph>
    <Paragraph position="2"> Although humans do not always select headline words from the first sentence, we observe that a large percentage of headline words are often found in the first sentence.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Approach
</SectionTitle>
    <Paragraph position="0"> The input to Hedge is a story, whose first sentence is immediately passed through the BBN parser. The parse-tree result serves as input to a linguistically-motivated module that selects story words to form headlines based on key insights gained from our observations of human-constructed headlines. That is, we conducted a human inspection of the 73 TIPSTER stories mentioned in Section 3 for the purpose of developing the Hedge Trimmer algorithm.</Paragraph>
    <Paragraph position="1"> Based on our observations of human-produced headlines, we developed the following algorithm for  parse-tree trimming: 1. Choose lowest leftmost S with NP,VP 2. Remove low content units o some determiners o time expressions 3. Iterative shortening: o XP Reduction o Remove preposed adjuncts o Remove trailing PPs o Remove trailing SBARs  More recently, we conducted an automatic analysis of the human-generated headlines that supports several of the insights gleaned from this initial study. We parsed 218 human-produced headlines using the BBN parser and analyzed the results. For this analysis, we used 72 headlines produced by a third participant.1 The parsing results included 957 noun phrases (NP) and 315 clauses (S).</Paragraph>
    <Paragraph position="2"> We calculated percentages based on headline-level, NP-level, and Sentence-level structures in the parsing results. That is, we counted:  trailing time expressions, SBARs, and PPs Figure 3 summarizes the results of this automatic analysis. In our initial human inspection, we considered each of these categories to be reasonable candidates for deletion in our parse tree and this automatic analysis indicates that we have made reasonable choices for deletion, with the possible exception of trailing PPs, which show up in over half of the human-generated headlines. This suggests that we should proceed with caution with respect to the deletion of trailing PPs; thus we consider this to be an option only if no other is available.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
HEADLINE-LEVEL PERCENTAGES
</SectionTitle>
    <Paragraph position="0"> preposed adjuncts = 0/218 (0%) conjoined S = 1/218 ( .5%) conjoined VP = 7/218 (3%) NP-LEVEL PERCENTAGES relative clauses = 3/957 (.3%) determiners = 31/957 (3%); of these, only 16 were &amp;quot;a&amp;quot; or &amp;quot;the&amp;quot; (1.6% overall) S-LEVEL PERCENTAGES2 time expressions = 5/315 (1.5%) trailing PPs = 165/315 (52%) trailing SBARs = 24/315 (8%)  counting the number of SBARs (or PPs) not designated as an argument of (contained in) a verb phrase. For a comparison, we conducted a second analysis in which we used the same parser on just the first sentence of each of the 73 stories. In this second analysis, the parsing results included 817 noun phrases (NP) and 316 clauses (S). A summary of these results is shown in Figure 4. Note that, across the board, the percentages are higher in this analysis than in the results shown in Figure 3 (ranging from 12% higher--in the case of trailing PPs--to 1500% higher in the case of time expressions), indicating that our choices of deletion in the Hedge Trimmer algorithm are well-grounded.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
HEADLINE-LEVEL PERCENTAGES
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Choose the Correct S Node
</SectionTitle>
      <Paragraph position="0"> The first step relies on what is referred to as the Projection Principle in linguistic theory (Chomsky, 1981): Predicates project a subject (both dominated by S) in the surface structure. Our human-generated headlines always conformed to this rule; thus, we adopted it as a constraint in our algorithm.</Paragraph>
      <Paragraph position="1"> An example of the application of step 1 above is the following, where boldfaced material from the parse tree representation is retained and italicized material is eliminated: (2) Input: Rebels agree to talks with government officials said Tuesday.</Paragraph>
      <Paragraph position="2"> Parse: [S [S [NP Rebels] [VP agree to talks with government]] officials said Tuesday.] Output of step 1: Rebels agree to talks with government. null When the parser produces a correct tree, this step provides a grammatical headline. However, the parser often produces an incorrect output. Human inspection of our 624-sentence DUC-2003 evaluation set revealed that there were two such scenarios, illustrated by the following cases:  (3) [S [SBAR What started as a local controversy] [VP has evolved into an international scandal.]] (4) [NP [NP Bangladesh] [CC and] [NP [NP In null dia] [VP signed a water sharing accord.]]] In the first case, an S exists, but it does not conform to the requirements of step 1. This occurred in 2.6% of the sentences in the DUC-2003 evaluation data. We resolve this by selecting the lowest leftmost S, i.e., the entire string &amp;quot;What started as a local controversy has evolved into an international scandal&amp;quot; in the example above.</Paragraph>
      <Paragraph position="3"> In the second case, there is no S available. This occurred in 3.4% of the sentences in the evaluation data. We resolve this by selecting the root of the parse tree; this would be the entire string &amp;quot;Bangladesh and India signed a water sharing accord&amp;quot; above. No other parser errors were encountered in the DUC-2003 evaluation data.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Removal of Low Content Nodes
</SectionTitle>
      <Paragraph position="0"> Step 2 of our algorithm eliminates low-content units. We start with the simplest low-content units: the determiners a and the. Other determiners were not considered for deletion because our analysis of the human-constructed headlines revealed that most of the other determiners provide important information, e.g., negation (not), quantifiers (each, many, several), and deictics (this, that).</Paragraph>
      <Paragraph position="1"> Beyond these, we found that the human-generated headlines contained very few time expressions which, although certainly not content-free, do not contribute toward conveying the overall &amp;quot;who/what content&amp;quot; of the story. Since our goal is to provide an informative headline (i.e., the action and its participants), the identification and elimination of time expressions provided a significant boost in the performance of our automatic headline generator.</Paragraph>
      <Paragraph position="2"> We identified time expressions in the stories using BBN's IdentiFinder(TM) (Bikel et al, 1999). We imple- null mented the elimination of time expressions as a two-step process: * Use IdentiFinder to mark time expressions * Remove [PP ... [NP [X] ...] ...] and [NP [X]]  where X is tagged as part of a time expression The following examples illustrate the application of  this step: (5) Input: The State Department on Friday lifted the ban it had imposed on foreign fliers.</Paragraph>
      <Paragraph position="3"> Parse: [Det The] State Department [PP [IN on] [NP [NNP Friday]]] lifted [Det the] ban it had imposed on foreign fliers.</Paragraph>
      <Paragraph position="4"> Output of step 2: State Department lifted ban it has imposed on foreign fliers.</Paragraph>
      <Paragraph position="5"> (6) Input: An international relief agency announced  Wednesday that it is withdrawing from North Korea.</Paragraph>
      <Paragraph position="6"> Parse: [Det An] international relief agency announced [NP [NNP Wednesday]] that it is withdrawing from North Korea.</Paragraph>
      <Paragraph position="7"> Output of step 2: International relief agency announced that it is withdrawing from North Korea. We found that 53.2% of the stories we examined contained at least one time expression which could be deleted. Human inspection of the 50 deleted time expressions showed that 38 were desirable deletions, 10 were locally undesirable because they introduced an ungrammatical fragment,3 and 2 were undesirable because they removed a potentially relevant constituent. However, even an undesirable deletion often pans out for two reasons: (1) the ungrammatical fragment is frequently deleted later by some other rule; and (2) every time a constituent is removed it makes room under the threshold for some other, possibly more relevant constituent. Consider the following examples.</Paragraph>
      <Paragraph position="8">  (7) At least two people were killed Sunday.</Paragraph>
      <Paragraph position="9"> (8) At least two people were killed when singleengine airplane crashed.</Paragraph>
      <Paragraph position="10"> Example (7) was produced by a system which did not  remove time expressions. Example (8) shows that if the time expression Sunday were removed, it would make room below the 10-word threshold for another important piece of information.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Iterative Shortening
</SectionTitle>
      <Paragraph position="0"> The final step, iterative shortening, removes linguistically peripheral material--through successive deletions--until the sentence is shorter than a given threshold. We took the threshold to be 10 for the DUC task, but it is a configurable parameter. Also, given that the human-generated headlines tended to retain earlier material more often than later material, much of our  iterative shortening is focused on deleting the rightmost phrasal categories until the length is below threshold. There are four types of iterative shortening rules.</Paragraph>
      <Paragraph position="1"> The first type is a rule we call &amp;quot;XP-over-XP,&amp;quot; which is implemented as follows: In constructions of the form [XP [XP ...] ...] remove the other children of the higher XP, where XP is NP, VP or S.</Paragraph>
      <Paragraph position="2"> This is a linguistic generalization that allowed us apply a single rule to capture three different phenomena (relative clauses, verb-phrase conjunction, and sentential conjunction). The rule is applied iteratively, from the deepest rightmost applicable node backwards, until the length threshold is reached.</Paragraph>
      <Paragraph position="3"> The impact of XP-over-XP can be seen in these examples of NP-over-NP (relative clauses), VP-over-VP  (verb-phrase conjunction), and S-over-S (sentential conjunction), respectively: (9) Input: A fire killed a firefighter who was fatally  injured as he searched the house.</Paragraph>
      <Paragraph position="4"> Parse: [S [Det A] fire killed [Det a] [NP [NP firefighter] [SBAR who was fatally injured as he searched the house] ]] Output of NP-over-NP: fire killed firefighter  (10) Input: Illegal fireworks injured hundreds of people and started six fires.</Paragraph>
      <Paragraph position="5"> Parse: [S Illegal fireworks [VP [VP injured hundreds of people] [CC and] [VP started six fires] ]] Output of VP-over-VP: Illegal fireworks injured hundreds of people (11) Input: A company offering blood cholesterol  tests in grocery stores says medical technology has outpaced state laws, but the state says the company doesn't have the proper licenses.</Paragraph>
      <Paragraph position="6"> Parse: [S [Det A] company offering blood cholesterol tests in grocery stores says [S [S medical technology has outpaced state laws], [CC but] [S [Det the] state stays [Det the] company doesn't have [Det the] proper licenses.]] ] Output of S-over-S: Company offering blood cholesterol tests in grocery store says medical technology has outpaced state laws The second type of iterative shortening is the removal of preposed adjuncts. The motivation for this type of shortening is that all of the human-generated headlines ignored what we refer to as the preamble of the story. Assuming the Projection principle has been satisfied, the preamble is viewed as the phrasal material occurring before the subject of the sentence. Thus, adjuncts are identified linguistically as any XP unit preceding the first NP (the subject) under the S chosen by step 1. This type of phrasal modifier is invisible to the XP-over-XP rule, which deletes material under a node only if it dominates another node of the same phrasal category.</Paragraph>
      <Paragraph position="7"> The impact of this type of shortening can be seen in the following example: (12) Input: According to a now finalized blueprint described by U.S. officials and other sources, the Bush administration plans to take complete, unilateral control of a post-Saddam Hussein Iraq Parse: [S [PP According to a now-finalized blueprint described by U.S. officials and other sources] [Det the] Bush administration plans to take complete, unilateral control of [Det a] post-Saddam Hussein Iraq ] Output of Preposed Adjunct Removal: Bush administration plans to take complete unilateral control of post-Saddam Hussein Iraq The third and fourth types of iterative shortening are the removal of trailing PPs and SBARs, respectively: null * Remove PPs from deepest rightmost node backward until length is below threshold.</Paragraph>
      <Paragraph position="8"> * Remove SBARs from deepest rightmost node backward until length is below threshold.</Paragraph>
      <Paragraph position="9"> These are the riskiest of the iterative shortening rules, as indicated in our analysis of the human-generated headlines. Thus, we apply these conservatively, only when there are no other categories of rules to apply. Moreover, these rules are applied with a backoff option to avoid over-trimming the parse tree. First the PP shortening rule is applied. If the threshold has been reached, no more shortening is done. However, if the threshold has not been reached, the system reverts to the parse tree as it was before any PPs were removed, and applies the SBAR shortening rule. If the threshold still has not been reached, the PP rule is applied to the result of the SBAR rule.</Paragraph>
      <Paragraph position="10"> Other sequences of shortening rules are possible. The one above was observed to produce the best results on a 73-sentence development set of stories from the TIPSTER corpus. The intuition is that, when removing constituents from a parse tree, it's best to remove smaller portions during each iteration, to avoid producing trees with undesirably few words. PPs tend to represent small parts of the tree while SBARs represent large parts of the tree. Thus we try to reach the threshold by removing small constituents, but if we can't reach the threshold that way, we restore the small constituents, remove a large constituent and resume the deletion of small constituents.</Paragraph>
      <Paragraph position="11"> The impact of these two types of shortening can be seen in the following examples: (13) Input: More oil-covered sea birds were found over the weekend.</Paragraph>
      <Paragraph position="12"> Parse: [S More oil-covered sea birds were found [PP over the weekend]] Output of PP Removal: More oil-covered sea birds were found.</Paragraph>
      <Paragraph position="13"> (14) Input: Visiting China Interpol chief expressed confidence in Hong Kong's smooth transition while assuring closer cooperation after Hong Kong returns.</Paragraph>
      <Paragraph position="14"> Parse: [S Visiting China Interpol chief expressed confidence in Hong Kong's smooth transition [SBAR while assuring closer cooperation after Hong Kong returns]] Output of SBAR Removal: Visiting China Interpol chief expressed confidence in Hong Kong's smooth transition</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML