File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-1137_abstr.xml

Size: 6,761 bytes

Last Modified: 2025-10-06 13:42:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1137">
  <Title>A New Probabilistic Model for Title Generation</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Title generation is a complex task involving both natural language understanding and natural language synthesis. In this paper, we propose a new probabilistic model for title generation. Different from the previous statistical models for title generation, which treat title generation as a generation process that converts the 'document representation' of information directly into a 'title representation' of the same information, this model introduces a hidden state called 'information source' and divides title generation into two steps, namely the step of distilling the 'information source' from the observation of a document and the step of generating a title from the estimated 'information source'. In our experiment, the new probabilistic model outperforms the previous model for title generation in terms of both automatic evaluations and human judgments.</Paragraph>
    <Paragraph position="1"> Introduction Compared with a document, a title provides a compact representation of the information and therefore helps people quickly capture the main idea of a document without spending time on the details. Automatic title generation is a complex task, which not only requires finding the title words that reflects the document content but also demands ordering the selected title words into human readable sequence. Therefore, it involves in both nature language understanding and nature language synthesis, which distinguishes title generation from other seemingly similar tasks such as key phrase extraction or automatic text summarization where the main concern of tasks is identify important information units from documents (Mani &amp; Maybury., 1999).</Paragraph>
    <Paragraph position="2"> The statistical approach toward title generation has been proposed and studied in the recent publications (Witbrock &amp; Mittal, 1999; Kennedy &amp; Hauptmann, 2000; Jin &amp; Hauptmann, 2001).</Paragraph>
    <Paragraph position="3"> The basic idea is to first learn the correlation between the words in titles (title words) and the words in the corresponding documents (document words) from a given training corpus consisting of document-title pairs, and then apply the learned title-word-document-word correlations to generate titles for unseen documents.</Paragraph>
    <Paragraph position="4"> Witbrock and Mittal (1999) proposed a statistical framework for title generation where the task of title generation is decomposed into two phases, namely the title word selection phase and the title word ordering phase. In the phase of title word selection, each title word is scored based on its indication of the document content. During the title word ordering phase, the 'appropriateness' of the word order in a title is scored using ngram statistical language model. The sequence of title words with highest score in both title word selection phase and title word ordering phase is chosen as the title for the document. The follow-ups within this framework mainly focus on applying different approaches to the title word selection phase (Jin &amp; Hauptmann, 2001; Kennedy &amp; Hauptmann, 2000).</Paragraph>
    <Paragraph position="5"> However, there are two problems with this framework for title generation. They are: * A problem with the title word ordering phase. The goal of title word selection phase is to find the appropriate title words for document and the goal of title word ordering phase is to find the appropriate word order for the selected title words. In the framework proposed by Witbrock and Mittal (1999), the title word ordering phase is accomplished by using ngram language model (Clarkson &amp; Rosenfeld, 1997) to predict the probability P(T), i.e. how frequently the word sequence T is used as a title for a document. Of course, the probability for the word sequence T to be used as a title for any document is definitely influenced by the correctness of the word order in T. However, the factor whether the words in the sequence T are common words or not will also have great influence on the chance of seeing the sequence T as a title. Word sequence T with many rare words, even with a perfect word order, will be difficult to match with the content of most documents and has small chance to be used as a title. As the result, using probability P(T) for the purpose of ordering title words can cause the generated titles to include unrelated common title words. The obvious solution to this problem is to somehow eliminate the bias of favouring common title words from probability P(T) and leave it only with the task of the word ordering. * A problem with the title word selection phase. The title word selection phase is responsible for coming up with a set of title words that reflect the meaning of the document. In the framework proposed by Witbrock and Mittal (1999), every document word has an equal vote for title words. However, title only needs to reflect the main content of a document not every single detail of that document.</Paragraph>
    <Paragraph position="6"> Therefore, letting all the words in the document participate equally in the selection of title words can cause a large variance in choosing title words. For example, common words usually have little to do with the content of documents. Therefore, allowing common words of a document equally compete with the content words in the same document in choosing title words can seriously degrade the quality of generated titles.</Paragraph>
    <Paragraph position="7"> The solution we proposed to this problem is to introduce a hidden state called 'information source'. This 'information source' will sample the important content word out of a document and a title will be computed based on the sampled 'information source' instead of the original document. By striping off the common words through the 'information source' state, we are able to reduce the noise introduced by common words to the documents in selecting title words. The schematic diagram for the idea is shown in Figure 1, together with the schematic diagram for the framework by Witbrock and Mittal. As indicated by Figure 1, the old framework for title generation has only a single 'channel' connecting the document words to the title words while the new model contains two 'channels' with one connecting the document words to the 'information source' state and the other connecting the 'information source' state to the title words.</Paragraph>
    <Paragraph position="8">  model and new model for title generation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML