File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0113_metho.xml
Size: 18,647 bytes
Last Modified: 2025-10-06 14:14:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0113"> <Title>Data Reliability and Its Effects on Automatic Abstracting</Title> <Section position="2" start_page="0" end_page="113" type="metho"> <SectionTitle> I. INTRODUCTION </SectionTitle> <Paragraph position="0"> The traditional approach to automatic abstracting aims at providing a reader with fast access to documents by facilitating a judgement on their relevance to his or her information needs. Another possible use of automatic abstracting can be found in wor~ such as Bateman and Teich (1995) and Alexa et al. (1996), where computer-generated abstracts are used for the editing purposes.</Paragraph> <Paragraph position="1"> In this paper, we discuss an approach to automatic abstracting where an abstract is created by extracting sentences from a text that are indicative of its content. In particular, the paper focuses on creating abstracts of Japanese newspaper texts. An approach to abstracting by extraction typically makes use of a text corpus with labelled extracts, indicating which sentence is a summary extract. However, as far as we know, no question has ever been raised on the empirical validity of the extracts used. Usually, extracts are manually supplied by the author himself (Watanabe, 1996) or by someone else (McKeown and Radev, 1995) (as in the TIPSTER Ziff-Davis corpus). Or one takes a roundabout way to identify extracts in a text through a human-supplied abstract (Kupiec et al.: 1995). In the paper, we will propose * a method for identifying summary extracts in a way that allows objective justification. We will do this by examining how humans perform on summary extraction and evaluating the reliability of their performance, using the kappa statistic, a metric standardly used in the behavioral sciences (Jean Carletta, 1996; Sidney Siegel and N. John Castellan Jr., 1988).</Paragraph> <Paragraph position="2"> Based on summary extracts supplied by hum~us, we construct a collection of texts annotated with information on sentence importance. They will be used as training and test data for a decision tree approach to abstracting, which we adopt in the paper (Qulnlan, 1993). In a decision tree approach, the task of extracting summary sentences is treated as a two-way classification task, where a sentence is assigned to either &quot;yes&quot; or &quot;no&quot; category, depending on its likelihood of being a summary sentence. The merit of a decision tree method is that it provides a generic framework in which to combine knowledge from multiple sources, a property necessary for automatic abstracting where information from a single source alone often fails to determine which sentence to extract.</Paragraph> </Section> <Section position="3" start_page="113" end_page="119" type="metho"> <SectionTitle> 2. METHODOLOGY </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="113" end_page="114" type="sub_section"> <SectionTitle> 2.1. Collecting Data on S!lrnmary Extraction by Humans </SectionTitle> <Paragraph position="0"> We conducted experiments with humans to collect data on how they perform on the sentence extraction task. We asked 112 naive subjects (students at graduate and undergraduate level) to extract 10 % of sentences in a text which they consider most important in making its summary. The number of extractions varied from two to four, depending on the length of a text. The age of subjects varied from 18 to 45. The experiments used 75 texts from three different text categories (25 for each category); COLUMN, EDITORIAL and NEWS REPORT. The texts were of about the same size in terms of character counts and the number of paragraphs, and were selected randomly from axticles that appeared in a Japanese economics daily in 1995 (Nihon-Keizai-Shimbun-Sha, 1995). Table I provides some statistics on the corpus from which extraction tests are constructed. A single test material consists of three extraction problems, each with a text from a different category. Though 85 of 112 subjects were assigned to one test, due to the lack of enough subjects, we had to ask the remaining 27 subjects to work on q five tests. On the average, each test had about 7 subjects assigned to it.</Paragraph> </Section> <Section position="2" start_page="114" end_page="118" type="sub_section"> <SectionTitle> 2.2. Measurement of B.eliability </SectionTitle> <Paragraph position="0"> The Kappa Statistic Following Jean Carletta (1996), we use the kappa statistic (Sidney Siegel and N. John Castellan Jr., 1988) to measure degree of agreement among subjects. The reason for choosing the kappa over other measures of agreement (Passonneau and Litman, 1993) derives from our interest in discovering a relationship between the reliability or quality of data (as quantified by some metric) and the performance of automatic abstracting. As aptly pointed out in Jean Carletta (1996), agreement measures proposed so far in the computational linguistics literature has failed to ask an important question of whether results obtained using agreement data are in any way different from random data. It has been left unclear just how high level of agreement among subjects needs to be achieved before reliably using data. It could be the case that data with high agreement may still be too noisy to use for a task for which they were collected.</Paragraph> <Paragraph position="1"> We assume that the kappa coefficient gives a suitable way of measuring the reliability of data, where we take reliability to mean reproducibility of data, or the degree to which data are reproduced under different circumstances, with different coders (Krippendorff, 1980).</Paragraph> <Paragraph position="2"> The kappa coefficient (K) of agreement measures the ratio of observed agreements to possible agreements among a set of raters on category judgements, correcting for chance agreement:</Paragraph> <Paragraph position="4"> where P(A) is the propo~ion of the times that raters agree and P(E) is the proportion of the times that we would expect them to agree by chance. K = 1 if there is complete agreement among the raters. K = 0 if there is no agreement other than that which is expected by chance. Consider a set of k raters and a group of N objects, each of which is to be assigned to one of m categories. Each of the raters assigns each object to one category. We represent the assignments data as an N x m matrix (Table 2), where the value (n~j) at each cellij (0 < i _< N, 0 < j < m) denotes the number of raters assigning the ith object to the jth category. Let Cj be the total number of times that objects are assigned to the jth category, N i.e., Cj = ~ n~j. Si measures the proportion of pairwise agreements among the raters on i=1 category assignments for a particular object i. Si gives a measurement of agreement among raters on decisions regarding which category a given object i is to be assigned to. Let us define Si by Def. 2.</Paragraph> <Paragraph position="6"> ml I 11 2--- j ...</Paragraph> <Paragraph position="7"> iolo o ... ... o I b .2 2 -:,- 2 --- 2 For each object i, agreement frequencies nil must sum up to k, the total number of raters. Note that 0 < Si ~ 1. Si = 1 when there is total agreement among the raters for a given category j on the ith row. Suppose that we asked 2m raters to assign two objects a and b to one of m categories and found results as in Table 3. For a, there is a complete agreement on the object's category, while for b, decisions are spread evenly over m categories. Since Sa = 1 and Sb = 1/(2m -- 1) (m > 1), we have Sa > Sb.</Paragraph> <Paragraph position="8"> The proportion P(A) of the times that the raters agree is given as the average of Si across all objects (Def. 3).</Paragraph> <Paragraph position="10"> The expected probability that a category is chosen at random is estimated as Pi = Ci/(N&quot; k).</Paragraph> <Paragraph position="11"> Then, the probability that any two raters agree on the jth category by chance would be p~.</Paragraph> <Paragraph position="12"> P(E) is defined as the sum of chance agreement for each category (Def 4), representing the overall rate of agreement by chance.</Paragraph> <Paragraph position="14"> The values of P(A) and P(E) are then combined to give the kappa coefficient K.</Paragraph> <Paragraph position="15"> Evaluation Judgements produced by subjects on a summary extraction task can be cast into an assignments matrix in a number of different ways. (Note that a single extraction task consists of extracting a specified number of sentences from one text.) We adopt here a representation scheme where we take N to be the number of choices made by a subject for a text and m to be the number of sentences in that text. 1 (Note that since we asked a subject to choose 10% of sentences in the text, the number of extractions made for each text depends entirely on the text's length, but the number of extractions from a given text should be the same across subjects.) Imagine for instance that nine subjects are asked to extract three most important sentences from a text with ten sentences. Under the scheme here, the resulting data could be represented as a matrix of height N = 3 and width rn = 10 with k = 9 like one in Table 4, where the first object is thought of as an earliest occurring sentence a subject considers most important, the second object as a second earliest occurring sentence a subject considers most important, and the third object as a third earliest sentence a subject considers most important.</Paragraph> <Paragraph position="16"> It is important to notice that a matrix is constructed for each extraction task and the agreement coefficient K is determined for each task, not for each sentence in the text. Table 5 lists the K values for subjects' judgements on sentence importance, averaged over texts. The number of subjects assigned to one extraction task varied from 4 to 9. 96% of the time, we had over 6 subjects working on a same task. The average number of subjects per text was 7.33. 2 We find in Table 5, however, that there is only marginal agreement among subjects.</Paragraph> <Paragraph position="17"> IAnother possibility is to represent the data as an JV x m matrix of height N=the number of sentences in the text and width m = 2 (yes/no), representing a binary judgement about whether a given sentence is relevant for summarizing.</Paragraph> <Paragraph position="18"> Sin Table 5, there are more raters than subjects. This happens because subjects are multiply assigned to extraction tasks.</Paragraph> <Paragraph position="19"> Level of Agreement and Data Reliability For a behavioral scientist, results in Table 5 would indicate that judgements produced by humans on the summary extraction are not to be trusted: on the reliability scale in Table 6, rates we get for the extraction data are somewhere between SLIGHT and FAIR. However it is not immediately clear how an abstracting program trained on such 'untrustworthy' data would perform. How does the notion of level of agreement or data reliability in a behavioral scientist's sense relate to the performance of automatic abstracting ? This is a question we are going to address in the following sections. We follow Passonneau and Litman (1993) in assuming that the majority opinion is correct and drop decisions not in agreement with the majority. In fact our approach here provides a principled basis for Passonneau and Litman (1993)'s notion of majority opinion through the kappa statistic.</Paragraph> <Paragraph position="20"> Now a decision on whether or not a sentence should be included in a summary extract is said to be a majority opinion if it is positiuely agreed upon by n subjects, where n ranges anywhere from 2 to the total number of subjects assigned to a task. 3 Data with various levels of agreement can be obtained by removing from agreement tables those decisions which are against the majority opinion for various values of r~. 4 Of them, only those data whose agreement rate is over a specific K threshold are used as training/test data for automatic abstracting. Table 7 lists average agreement rates for data with thresholds ranging from 0.1 to 0.8. The row represents K thresholds, and the column represents text types. Figures in parentheses axe the number of texts with a given threshold.</Paragraph> <Paragraph position="21"> agreement be t (0 < ~ < 1). For each text, set the size of the majority to 1. (b) Find K. If K > ~, stop. (c) Otherwise increase the s'me by one and remove decisions against the majority so defined. Go back to (b). Note that there will be no removaJ of disagreeing decisions if the text has the kappa coefficient greater than or equal to t at the start.</Paragraph> </Section> <Section position="3" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 2.3. Extraction Method </SectionTitle> <Paragraph position="0"> We make use of a decision tree program C4.5 (Quinlan, 1993) to develop a sentence extraction algorithm. What it does in essence is to classify sentences as either &quot;yes&quot; or &quot;no w, based on a prediction it makes about whether a given sentence is to be included in a sllmmary extract.</Paragraph> <Paragraph position="1"> C4:5 works with 'coded descriptions' of data (or cases). A coded description consists of a specification of data in terms of a fixed set of attributes and a category to which the data are to be assigned. &quot;We use a corpus of coded texts, where each sentence is represented with a set of attributes and assigned to either a &quot;yes&quot; or a &quot;no r category according to whether the sentence is a summary extract selected by a group of humans with some level of agreement among them. We constructed 15 sets of coded texts from the corpus by varying the threshold value of agreement from 0.1 to 0.8.</Paragraph> </Section> <Section position="4" start_page="118" end_page="119" type="sub_section"> <SectionTitle> 2.4. Attributes </SectionTitle> <Paragraph position="0"> Attributes provide ways in which to code a sentence. The trouble is, there are many possible ways of choosing among potential attributes and one has to go through some trial and error experimentation to find a set of attributes that work best for his or her task. The selection of attributes is essentially heuristic and empirical. After some ex~mlnation, we have settled on the following set of attributes, some of which are variations of those typically found in the summarization literature (Kupiec et al., 1995; Paice and Jones, 1993; Edmundson, 1969; Zechner, 1996).</Paragraph> <Paragraph position="1"> Text Type: This attribute is categorical and identifies the type of a text to which a given sentence belongs. The possible values are &quot;C&quot; for COLUMN, &quot;E&quot; for EDITORIAL and &quot;N&quot; for NEWS REPORT.</Paragraph> <Paragraph position="2"> Location in Text: The location attribute records information on how far a given sentence appears from the beginning of the text. The value is the ratio of the number of sentences preceding to the total number of sentences in the text. The assumption is that where a sentence occurs in the text gives an important clue to predicting whether it is an extract chosen by human subjects (Edmundson, 1969).</Paragraph> <Paragraph position="3"> Similarity to Title: This attribute records information on how similar a given sentence is to the title. We use the normalized tf-idf as a similarity metric (Wilkinson, 1994). The similarity between a sentence S and a title T of the text in which it occurs is given by:</Paragraph> <Paragraph position="5"> where F(w, S) denotes a frequency of the word w in S and MAX_F(S) the frequency of the most frequent word in S.</Paragraph> <Paragraph position="7"> DF(w) is the number of sentences in the text which have an occurrence of w. N is the total number of sentences in the text. log N is a normalization factor.</Paragraph> <Paragraph position="8"> Within Text tf-idf: The within-text tfidf is a metric to quantify how well a given sentence distinguishes itself from the rest of the text (Zechner, 1996). For a sentence S, its degree of distinction D(S) from other sentences is defined analogously to the S~m~\]a, rity function above:</Paragraph> <Paragraph position="10"/> </Section> <Section position="5" start_page="119" end_page="119" type="sub_section"> <SectionTitle> Attitudinal Construct: Attitudinal constructs in Japanese include modal verbs/auxiliaries, </SectionTitle> <Paragraph position="0"> a class of verbal/sentential constructions expressing the speaker's subjective attitude (hitsuyo-da 'it is necessary', ~boo-suru 'it is hoped') and sentence final particles such as interrogative and communicative maxkers(-ka,-yo,-ne) (Nagano, 1986; Unetaya, 1987). This attribute is categorical and takes one of the three values, TYPE 1, TYPE 2 and TYPE 3, depending on whether the sentence ends with a verbal of non-attitudinal type (TYPE I), or with an attidutinal verbal or a modal (TYPE 2), or with a sentence final particle (TYPE 3). The assumption here is that a sentence with attitudinal expressions has more of a chance to be chosen as a s11mma.ry extract. Unetaya (1987) gives some supporting evidence.</Paragraph> <Paragraph position="1"> SWords here mean nominals, which are identified using a Japanese tokenizer program (Sakur~ and Hisamitsu, 1997).</Paragraph> <Paragraph position="2"> Length: This attribute records the length (given in character) of a sentence. The idea is that short sentences may not be informative enough to serve as a sllmmary line (Kupiec et al., 1995).</Paragraph> <Paragraph position="3"> in Paragraph: This attribute records the location of a given sentence within the paragraph. The value is continuous and determined slmil~ly to the location attribute above.</Paragraph> <Paragraph position="4"> Shown in Table 8 are some sample encodings of sentences in terms of the attributes above. Each line encodes a sentence as regards to TEXToTYPE~ LOCATIONoIN-TEXT~ SIMILARdeg ITY, TEXT-LENGTH~ ATTIDUDINAL-TYPE~ WITHIN-TEXToTFIDF, LOCATION-IN-PARAGRAPH, and CLASS in this order. The first line for instance represents a sentence which is a column-type text; its location in text is in the rear; its similarity to title is nil; it is 28 character long; its attidutinal type is 1; it has a tfidf value of 2.9; it occurs at one third of the paragraph; and finally its class is Y, meaning that it is judged important.</Paragraph> </Section> </Section> class="xml-element"></Paper>