File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1079_metho.xml
Size: 11,225 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1079"> <Title>Generating Overview Summaries of Ongoing Email Thread Discussions</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Issue Detection </SectionTitle> <Paragraph position="0"> To make the problem more manageable we make the following assumptions about the types of threads that our algorithm will handle. To begin with, we assume that the threads have been correctly constructed and classified as discussions supporting decision-making. Needless to say, the first assumption is a little unrealistic given that thread construction is a difficult problem. For example, it is not uncommon to receive emails with recycled subject lines simply because replying to an email is often more convenient than typing in an address.</Paragraph> <Paragraph position="1"> The other assumptions we make have to do with the dialogue structure of the threads. The first is that the issue being discussed (usually a statement describing the matter to be decided) is to be found in the first email. The second is that the email thread doesn't shift task, nor does it contain multiple issues.</Paragraph> <Paragraph position="2"> The first assumption is based on what we have observed to be normal behavior. Exceptions to this rule are broken threads and cases where the participants have responded to a forwarded email. In the first case, this can be seen as an error in thread construction and identification. In such cases however, even in such a thread, the first email usually contains a reference to the issue at hand, although it may be an impoverished paraphrase. Our algorithm extracts these paraphrases in lieu of the original wording. Cases where participants have responded to a forwarded email are not common. For such threads, we attempt to extract the sentence participants respond to. However, again, this may not be the best formulation of the issue.</Paragraph> <Paragraph position="3"> Secondly, we assume that a text segmentation algorithm (for examples see Hearst's &quot;Text-Tiling&quot; algorithm 1997, Choi et al. 2000) has already segmented the threads according to shifts in task. Operationally, our detection of shifts in task would then be based on corresponding changes in vocabulary used.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 The Algorithm </SectionTitle> <Paragraph position="0"> Our summarization approach is to extract a set of sentences consisting of one issue, and the corresponding responses - one per participant.</Paragraph> <Paragraph position="1"> Our sentence extraction mechanisms borrow from information retrieval methods which represent text as weighted term frequency vectors (for an overview see: Salton and McGill, 1983).</Paragraph> <Paragraph position="2"> In Figure 3, we present the general framework of the algorithm. In this framework we divide the thread into two parts, the initiating email and the replies. We create a comparison vector that represents what the replies are about. We can construct variations of this framework by changing the way we build our comparison vector. The aim is to compare each sentence to the comparison vector for the replies. Thus, we build separate vector representations, called candidate vectors, for each sentence in the first email. Using the cosine similarity metric to compare candidate vectors with the comparison vector, we rank the sentences of the first email. Conceptually, the highest ranked sentence will be the one that is closest in content to the replies and this is extracted as the issue of the discussion.</Paragraph> <Paragraph position="3"> 1. Separate thread into issue_email and replies 2. Create &quot;comparison vector&quot; V representing replies 3. For each sentence s in issue_email 3.1 Construct vector representation S for sentence s 3.2 Compare V and S using cosine similarity 4. Rank sentences according to their cosine similarity scores 5. Extract top ranking sentence We now discuss the four methods for building the comparison vector. These are: 1. The Centroid method 2. The SVD Centroid method.</Paragraph> <Paragraph position="4"> 3. The SVD Key Sentence method 4. Combinations of methods: Oracles In the Centroid method, we first build a term by sentence (t x s) matrix, A, from the reply emails. In this matrix, rows represent sentences and columns represent unique words found in the thread. Thus, the cells of a row store the term frequencies of words in a particular sentence. From this matrix, we form a centroid to represent the content of the replies. This is a matter of summing each row vector and normalizing by the number of rows. This centroid is then what we use as our comparison vector.</Paragraph> <Paragraph position="5"> Our interpretation of the SVD results is based on that of Gong and Liu (1999) and Hoffman (1999). Gong and Liu use SVD for text segmentation and summarization purposes. Hoffman describes the results of SVD within a probabilistic framework. For a more complete summary of our interpretation of the SVD analysis see Wan et al. (2003).</Paragraph> <Paragraph position="6"> To begin with, we construct the matrix A as in the Centroid Method. The matrix A provides a representation of each sentence in w dimensionality, where w is the size of the vocabulary of the thread. The SVD analysis1 is the product of three matrices U, S and V transpose. In the following equation, dimensionality is indicated by the subscripts.</Paragraph> <Paragraph position="7"> SVD(At x s) = Ut x r Sr x r(Vs x r) tr Conceptually, the analysis essentially maps the sentences into a smaller dimensionality r, which we interpret as the main &quot;concepts&quot; that are discussed in the sentences. These dimensions, or concepts, are automatically identified by the SVD analysis on the basis of similarities of cooccurrences. The rows of V matrix represent the sentences of the first email, and each row vector describes how a given sentence relates to the discovered concepts. Importantly, the number of discovered concepts is less than or equal to the vocabulary of the thread in question. If it is less than the vocabulary size, then the SVD analysis has been able to combine several related terms into a single concept. Conceptually, this corresponds to finding word associations between synonyms though in general, this association may not conserve part-of-speech. In contrast to the values of the A matrix which are always positive (since they are based on frequencies), the values of each cell in the V matrix can be negative. This represents the degree to which the sentence relates to a particular concept. We build a centroid from the V matrix to form our comparison vector.</Paragraph> <Paragraph position="8"> The SVD Key Sentence Method is similar to the preceding method. We build the matrix A, apply the SVD analysis and obtain the matrix V. Instead of constructing a vector which represents all of the replies, we choose one sentence from the replies that is most representative of the thread content. This is done by selecting the most important concept and finding the sentence that contains the most words related to it. The SVD analysis by default sorts the concepts according to degree to which sentences are associated with it. By this definition, the most important sentence is compute the analysis.</Paragraph> <Paragraph position="9"> represented by the values in the first column of the matrix V. We then take the maximum of this column vector and note its row index, r, which denotes a sentence. We use the rth row vector of the V matrix as the comparison vector.</Paragraph> <Paragraph position="10"> In both the SVD Centroid method and the SVD Key Sentence method, the comparison vector has a different dimensionality than the candidate vectors. To perform the comparison, we must map the candidate vectors into this new dimensionality. This is done by pre-multiplying each candidate vector with the result of the matrix multiplication: Utranspose x S. Both of the matrices involved are obtained from the SVD analysis.</Paragraph> <Paragraph position="11"> Since we have three alternatives for constructing the comparison vector we consider the possibility of combining the approaches. In Wan et al. (2003) we showed that using a combination of traditional TF[?]IDF approaches and SVD approaches was useful given that SVD provided additional information about word associations. Similarly, our two SVD methods provide complementary information. The vector computed by the SVD centroid method provides information about the replies and accounts for word associations such as synonyms. However, like the centroid method, this vector will include all topics discussed in the replies, even small digressions. In contrast, the SVD Key sentence is potentially better at ignoring these digressions by focusing on a single concept. We present three heuristic oracles which essentially re-rank the candidate issue sentences identified by each of the three methods. Re-ranking is based on a voting mechanism. The rules for three oracles are presented in Figures 4 and 5. 1. If a majority exists return it 2. If tie then: retrieve the lowest index number i, where i a0 1 The oracle in Figure 4 attempts to choose the best sentence, retrieving a single sentence. Rule 2 attempts to encode the intuition that the issue sentence is likely to occur early in the email, however, not usually at the top of the email.</Paragraph> <Paragraph position="12"> Finally, we use the Centroid Method as a default because it is less prone to errors arising from low vocabulary sizes found in shorter threads. For such threads, we found that SVD approaches tend not to perform so well.</Paragraph> <Paragraph position="13"> The second oracle again relies on a majority vote. However, it relaxes the constraint of just returning a single sentence if the majority is the first sentence of the email. Since we tend not to find issue sentences in at the very top of emails, we return all possible issue sentences in rule 1. 1. If a majority exists then return it; UNLESS i = 1 in which case, return all choices 2. If tie then retrieve the lowest index number i, where i a0 1 Finally, as a baseline, the third oracle returns all the possible issue sentences identified by all of the contributing methods.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Extracting the Responses to the Issue </SectionTitle> <Paragraph position="0"> To extract the responses to the issue, we simply take the first sentence of the replies of each responding participant. We make sure to only extract one response per participant.</Paragraph> <Paragraph position="1"> An alternative solution analogous to that of issue detection was also considered. In this solution, we applied the issue detection algorithm to the reply email in question. However, it turns out that most of the tagged responses occurred at the start of each reply email and a more complex approach was unnecessary and potentially introduced more errors.</Paragraph> </Section> class="xml-element"></Paper>