File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4027_metho.xml
Size: 11,737 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4027"> <Title>Summarizing Email Threads</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Data </SectionTitle> <Paragraph position="0"> Our corpus consists of 96 threads of email sent during one academic year among the members of the board of the student organization of the ACM at Columbia University. The emails dealt mainly with planning events of various types, though other issues were also addressed.</Paragraph> <Paragraph position="1"> On average, each thread contained 3.25 email messages, with all threads containing at least two messages, and the longest thread containing 18 messages.</Paragraph> <Paragraph position="2"> Two annotators each wrote a summary of the thread.</Paragraph> <Paragraph position="3"> We did not provide instructions about how to choose content for the summaries, but we did instruct the annotators on the format of the summary; specifically, we requested them to use the past tense, and to use speech-act verbs and embedded clauses (for example, Dolores reported she'd gotten 7 people to sign up instead of Dolores got 7 people to sign up). We requested the length to be about 5% to 20% of the original text length, but not longer than 100 lines.</Paragraph> <Paragraph position="4"> Writing summaries is not a task that competent native speakers are necessarily good at without specific training. Furthermore, there may be many different possible summary types that address different needs, and different summaries may satisfy a particular need. Thus, when asking native speakers to write thread summaries we cannot expect to obtain summaries that are similar.</Paragraph> <Paragraph position="5"> We then used the hand-written summaries to identify important sentences in the threads in the following manner. We used the sentence-similarity finder SimFinder (Hatzivassiloglou et al., 2001) in order to rate the similarity of each sentence in a thread to each sentence in the corresponding summary. SimFinder uses a combination of lexical and linguistic features to assign a similarity score to an input pair of texts. We excluded sentences that are being quoted, as well as signatures and the like. For each sentence in the thread, we retained the highest similarity score. We then chose a threshold; sentences with SimFinder scores above this threshold are then marked as &quot;Y&quot;, indicating that they should be part of a summary, while the remaining sentences are marked &quot;N&quot;. About 26% of sentences are marked &quot;Y&quot;. All sentences from the email threads along with their classification constitutes our data. For annotator DB, we have 1338 sentences, of which 349 are marked &quot;Y&quot;, for GR (who has annotated a subset of the threads that DB has annotated) there are 1296 sentences, of which 336 are marked &quot;Y&quot;. Only 193 sentences are marked &quot;Y&quot; using the summaries of both annotators, reflecting the difference in the summaries written by the two annotators. The kappa for the marking of the sentences is 0.29 (recall that this only indirectly reflects annotator choice). Thus, our expectation that human-written summaries will show great variation was borne out; we discuss these differences further in Section 5.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Features for Sentence Extraction </SectionTitle> <Paragraph position="0"> We start out with features that are not specific to email.</Paragraph> <Paragraph position="1"> These features consider the thread as a single text. We call this feature set basic. Each sentence in the email thread is represented by a feature vector. We shall call the sentence in consideration s, the message in which the sentence appears m, the thread in which the sentence appears t, and the entire corpus c. (We omit some features we use for lack of space.) thread line num: The absolute position of s in t.</Paragraph> <Paragraph position="2"> centroid sim: Cosine similarity of s's TF-IDF vector (excluding stop words) with t's centroid vector.</Paragraph> <Paragraph position="3"> The centroid vector is the average of the TF-IDF vectors of all the sentences in t. The IDF component is derived from the ACM Corpus.</Paragraph> <Paragraph position="4"> centroid sim local: Same as centroid sim except that the inverse document frequencies are derived from the thread.</Paragraph> <Paragraph position="5"> length: The number of content terms in s.</Paragraph> <Paragraph position="6"> tfidfsum: Sum of the TF-IDF weights of content terms in s. IDF weights are derived from c.</Paragraph> <Paragraph position="7"> tfidfavg: Average TF-IDF weight of the content terms in s. IDF weights are derived from c.</Paragraph> <Paragraph position="8"> t rel pos: Relative position of s in t: the number of sentences preceding s divided by the total number of sentences in t. All messages in a thread are ordered linearly by the time they were sent.</Paragraph> <Paragraph position="9"> is Question: Whether s is a question, as determined by punctuation.</Paragraph> <Paragraph position="10"> ent feature sets We now add two features that take into account the division of the thread into messages and the resulting dialog structure. The union of this feature set with basic is called basic+.</Paragraph> <Paragraph position="11"> msg num: The ordinality of m in t (i.e., the absolute position of m in t).</Paragraph> <Paragraph position="12"> m rel pos: Relative position of s in m: the number of sentences preceding s divided by the total number of sentences in m.</Paragraph> <Paragraph position="13"> Finally, we add features which address the specific structure of email communication. The full feature set is called full.</Paragraph> <Paragraph position="14"> subject sim: Overlap of the content words of the subject of the first message in t with the content words in s.</Paragraph> <Paragraph position="15"> num of res: Number of direct responses to m.</Paragraph> <Paragraph position="16"> num Of Recipients: Number of recipients of m.</Paragraph> <Paragraph position="17"> fol Quote: Whether s follows a quoted portion in m.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiments and Results </SectionTitle> <Paragraph position="0"> This section describes experiments using the machine learning program Ripper (Cohen, 1996) to automatically induce sentence classifiers, using the features described in Section 4. Like many learning programs, Ripper takes as input the classes to be learned, a set of feature names and possible values, and training data specifying the class and feature values for each training example. In our case, the training examples are the sentences from the threads as described in Section 3. Ripper outputs a classification model for predicting the class (i.e., whether a sentence should be in a summary or not) of future examples; the model is expressed as an ordered set of if-then rules. We obtained the results presented here using five-fold cross-validation. In this paper, we only evaluate the results of the machine learning step; we acknowledge the need for an evaluation of the resulting summaries using columns) using full feature set word/string based similarity metric and/or human judgments and leave that to future publications.</Paragraph> <Paragraph position="1"> We show results for the two annotators and different feature sets in Figure 1. First consider the results for annotator DB. Recall that basic includes only standard features that can be used for all text genres, and considers the thread a single text. basic+ takes the breakdown of the thread into messages into account. full also uses features that are specific to email threads. We can see that by using more features than the baseline set basic, performance improves. Specifically, using email-specific features improves the performance over the basic baseline, as we expected. We also give a second baseline, ctroid, which we determined by choosing the top 20% of sentences most similar to the thread centroid. All results using Ripper improve on this baseline.</Paragraph> <Paragraph position="2"> If we perform exactly the same experiments on the summaries written by annotator GR, we obtain the results shown in the bottom half of Figure 1. The results are much worse, and the centroid-based baseline outperforms all but the full feature set. We leave to further research an explanation of why this may be the case; we speculate that GR,as an annotator, is less consistent in her choice of material than is DB when forming a summary. Thus, the machine learner has less regularity to learn from. However, we take this difference as evidence for the claim that one should not expect great regularity in human-written summaries.</Paragraph> <Paragraph position="3"> Finally, we investigated what happens when we combine the data from both sources, DB and GR. Using SimFinder, we obtained two scores for each sentence, one that shows the similarity to the most similar sentence in DB's summary, and one that shows the similarity to the most similar sentence in GR's summary. We can combine these two scores and then use the combined score in the same way that we used the score from a single annotator. We explore two ways of combining the scores: the average, and the maximum. Both ways of combining the scores result in worse scores than either annotator on his or her own; the average is worse than the maximum (see Figure 2). We interpret these results again as meaning that there is little convergence in the human-written summaries, and it may be advantageous to learn from one particular annotator. (Of course, another option might be to develop and enforce very precise guidelines for the an- null Regarding &quot;acm home/bjarney&quot;, on Apr 9, 2001, Muriel Danslop wrote: Two things: Can someone be responsible for the press releases for Stroustrup? Responding to this on Apr 10, 2001, Theresa Feng wrote: I think Phil, who is probably a better writer than most of us, is writing up something for dang and Dave to send out to various ACM chapters. Phil, we can just use that as our &quot;press release&quot;, right? In another subthread, on Apr 12, 2001, Kevin Danquoit wrote: Are you sending out upcoming events for this ample, rule 1 states that questions at the beginning of a thread that are similar to entire thread should be retained, and rule 2 states that sentence which are very similar to the thread and which have a high number of recipients should be retained. However, some rules show signs of overfitting, for example rule 1 limits the average TF-IDF values to a rather narrow band. Hopefully, more data will alleviate the overfitting problem. (The data collection continues.)</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Postprocessing Extracted Sentences </SectionTitle> <Paragraph position="0"> Extracted sentences are sent to a module that wraps these sentences with the names of the senders, the dates at which they were sent, and a speech act verb. The speech act verb is chosen as a function of the structure of the email thread in order to make this structure more apparent to the reader. Further, for readability, the sentences are sorted by the order in which they were sent. An example can be seen in Figure 4. Note that while the initial question is answered in the following sentence, two other questions are left unanswered in this summary (the answers are in fact in the thread).</Paragraph> </Section> class="xml-element"></Paper>