XML Viewer - h05-1005

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/h05-1005_intro.xml
Size: 8,863 bytes
Last Modified: 2025-10-06 14:02:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1005">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 33-40, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Improving Multilingual Summarization: Using Redundancy in the Input to Correct MT errors</Title>
  <Section position="3" start_page="33" end_page="35" type="intro">
    <SectionTitle>
2 References to people
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.1 Data
</SectionTitle>
      <Paragraph position="0"> We used data from the DUC 2004 Multilingual summarization task. The Document Understanding Conference (http://duc.nist.gov) has been run annually since 2001 and is the biggest summarization evaluation effort, with participants from all over the world. In 2004, for the rst time, there was a multi-lingual multi-document summarization task. There were 25 sets to be summarized. For each set consisting of 10 Arabic news reports, the participants were provided with 2 different machine translations into English (using translation software from ISI and IBM). The data provided under DUC includes 4 human summaries for each set for evaluation purposes; the human summarizers were provided a human translation into English of each of the Arabic New reports, and did not have to read the MT output that the machine summarizers took as input.</Paragraph>
    </Section>
    <Section position="2" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.2 Task de nition
</SectionTitle>
      <Paragraph position="0"> An analysis of premodi cation in initial references to people in DUC human summaries for the mono-lingual task from 2001 2004 showed that 71% of premodifying words were either title or role words (eg. Prime Minister, Physicist or Dr.) or temporal role modifying adjectives such as former or designate. Country, state, location or organization names constituted 22% of premodifying words. All other kinds of premodifying words, such as moderate or loyal constitute only 7%. Thus, assuming the same pattern in human summaries for the multilingual task (cf. section 2.6 on evaluation), our task for each person referred to in a document set is to:  1. Collect all references to the person in both translations of each document in the set.</Paragraph>
      <Paragraph position="1"> 2. Identify the correct roles (including temporal modi cation) and af liations for that person, ltering any noise. 3. Generate a reference using the above attributes and the person's name.</Paragraph>
    </Section>
    <Section position="3" start_page="33" end_page="34" type="sub_section">
      <SectionTitle>
2.3 Automatic semantic tagging
</SectionTitle>
      <Paragraph position="0"> As the task de nition above suggests, our approach is to identify particular semantic attributes for a person, and generate a reference formally from this semantic input. Our analysis of human summaries tells us that the semantic attributes we need to identify are role, organization, country, state, location and temporal modifier. In addition, we also need to identify the person name.</Paragraph>
      <Paragraph position="1"> We used BBN's IDENTIFINDER (Bikel et al., 1999) to mark up person names, organizations and locations. We marked up countries and (American) states using a list obtained from the CIA factsheet1.</Paragraph>
      <Paragraph position="2">  To mark up roles, we used a list derived from Word-Net (Miller et al., 1993) hyponyms of the person synset. Our list has 2371 entries including multi-word expressions such as chancellor of the exchequer, brother in law, senior vice president etc. The list is quite comprehensive and includes roles from the elds of sports, politics, religion, military, business and many others. We also used WordNet to obtain a list of 58 temporal adjectives. WordNet classies these as pre- (eg. occasional, former, incoming etc.) or post-nominal (eg. elect, designate, emeritus etc.). This information is used during generation.</Paragraph>
      <Paragraph position="3"> Further, we identi ed elementary noun phrases using the LT TTT noun chunker (Grover et al., 2000), and combined NP of NP sequences into one complex noun phrase. An example of the output of our semantic tagging module on a portion of machine translated text follows:  Our principle data structure for this experiment is the attribute value matrix (AVM). For example, we create the following AVM for the reference to Nizar Hamdoon in the tagged example above:</Paragraph>
      <Paragraph position="5"> Note that we store the relative positions (arg 1 and arg 2) of the country and organization attributes.</Paragraph>
      <Paragraph position="6"> This information is used both for error reduction and for generation as detailed below. We also replace adjectival country attributes with the country name, using the correspondence in the CIA factsheet.</Paragraph>
    </Section>
    <Section position="4" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
2.4 Identifying redundancy and ltering noise
</SectionTitle>
      <Paragraph position="0"> We perform coreference by comparing AVMs. Because of the noise present in MT (For example, words might be missing, or proper names might be spelled differently by different MT systems), simple name comparison is not suf cient. We form a coreference link between two AVMs if:  1. The last name and (if present) the rst name match. 2. OR, if the role, country, organization and time attributes  are the same.</Paragraph>
      <Paragraph position="1"> The assumption is that in a document set to be summarized (which consists of related news reports), references to people with the same af liation and role are likely to be references to the same person, even if the names do not match due to spelling errors. Thus we form one AVM for each person, by combining AVMs. For Nizar Hamdoon, to whom there is only one reference in the set (and thus two MT versions), we obtain the AVM:</Paragraph>
      <Paragraph position="3"> where the numbers in brackets represents the counts of this value across all references. The arg values now represent the most frequent ordering of these organizations and countries in the input references. As an example of a combined AVM for a person with a lot of references, consider:  This example displays common problems when generating a reference. Zeroual has two af liations Leader of the Renovation Party, and Algerian President. There is additional noise - the values AFP and former are most likely errors. As none of the organization or country values occur in the same reference, all are marked arg1; no relative ordering statistics are derivable from the input. For an example demonstrating noise in spelling, consider:  Our approach to removing noise is to:  1. Select the most frequent name with more than one word (this is the most likely full name).</Paragraph>
      <Paragraph position="4"> 2. Select the most frequent role.</Paragraph>
      <Paragraph position="5"> 3. Prune the AVM of values that occur with a frequency be null low an empirically determined threshold.</Paragraph>
      <Paragraph position="6"> Thus we obtain the following AVMs for the three examples above:  This is the input semantics for our generation module described in the next section.</Paragraph>
    </Section>
    <Section position="5" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
2.5 Generating references from AVMs
</SectionTitle>
      <Paragraph position="0"> In order to generate a reference from the words in an AVM, we need knowledge about syntax. The syntactic frame of a reference to a person is determined by the role. Our approach is to automatically acquire these frames from a corpus of English text. We used the Reuters News corpus for extracting frames. We performed the semantic analysis of the corpus, as in a2 2.3; syntactic frames were extracted by identifying sequences involving locations, organizations, countries, roles and prepositions. An example of automatically acquired frames with their maximum likelihood probabilities for the role ambassador is: ROLE=ambassador  (p=.35) COUNTRY ambassador PERSON (.18) ambassador PERSON (.12) COUNTRY ORG ambassador PERSON (.12) COUNTRY ambassador to COUNTRY PERSON (.06) ORG ambassador PERSON (.06) COUNTRY ambassador to LOCATION PERSON (.06) COUNTRY ambassador to ORG PERSON (.03) COUNTRY ambassador in LOCATION PERSON (.03) ambassador to COUNTRY PERSON  These frames provide us with the required syntactic information to generate from, including word order and choice of preposition. We select the most probable frame that matches the semantic attributes in the AVM. We also use a default set of frames shown below for instances where no automatically acquired frames exist:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML