File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1027_evalu.xml
Size: 7,428 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1027"> <Title>Learning to Detect Conversation Focus of Threaded Discussions</Title> <Section position="6" start_page="211" end_page="214" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="211" end_page="212" type="sub_section"> <SectionTitle> 5.1 Experimental Setup </SectionTitle> <Paragraph position="0"> We tested our conversation-focus detection approach using a corpus of threaded discussions from three semesters of a USC undergraduate course in From the complete corpus, we selected only threads with lengths of greater than two and less than nine (messages). Discussion threads with lengths of only two would bias the random guess of our baseline system, while discussion threads with lengths greater than eight make up only 3.7% of the total number of threads (640), and are the least coherent of the threads due to topic-switching and off-topic remarks. Thus, our evaluation corpus included 314 threads, consisting of 1307 messages, with an average thread length of 4.16 messages per thread. Table 2 gives the distribution of the lengths of the threads.</Paragraph> <Paragraph position="1"> The input of our system requires the identification of speech act relations between messages. Collective classification approaches, similar to the dependency-network based approach that Carvalho and Cohen (2005) used to classify email speech acts, might also be applied to discussion threads. However, as the paper is about investigating how an SA analysis, along with other features, can benefit conversation focus detection, so as to avoid error propagation from speech act labeling to subsequent processing, we used manually-annotated SA relationships for our analysis.</Paragraph> <Paragraph position="2"> The corpus contains 1339 speech acts. Table 3 gives the frequencies and percentages of speech acts found in the data set. Each SA generates feature-oriented weighted links in the threaded graph accordingly as discussed previously.</Paragraph> <Paragraph position="3"> We then read each thread and choose the message that contained the best answer to the initial query as the gold standard. If there are multiple best-answer messages, all of them will be ranked as best, i.e., chosen for the top position. For example, different authors may have provided suggestions that were each correct for a specified situation. Table 4 gives the statistics of the numbers of correct messages of our gold standard. We experimented with further segmenting the messages so as to narrow down the best-answer text, under the assumption that long messages probably include some less-than-useful information. We applied TextTiling (Hearst, 1994) to segment the messages, which is the technique used by Zhou and Hovy (2005) to summarize discussions.</Paragraph> <Paragraph position="4"> For our corpus, though, the ratio of segments to messages was only 1.03, which indicates that our messages are relatively short and coherent, and that segmenting them would not provide additional benefits.</Paragraph> </Section> <Section position="2" start_page="212" end_page="212" type="sub_section"> <SectionTitle> 5.2 Baseline System </SectionTitle> <Paragraph position="0"> To compare the effectiveness of our approach with different features, we designed a baseline system that uses a random guess approach. Given a discussion thread, the baseline system randomly selects the most important message. The result was evaluated against the gold standard. The performance comparisons of the baseline system and other feature-induced approaches are presented next.</Paragraph> </Section> <Section position="3" start_page="212" end_page="214" type="sub_section"> <SectionTitle> 5.3 Result Analysis and Discussion </SectionTitle> <Paragraph position="0"> We conducted extensive experiments to investigate the performance of our approach with different combinations of features. As we discussed in Section 4.2, each poster acquires a trustworthiness score based on their behavior via an analysis of the whole corpus. Table 5 is a sample list of some posters with their poster id, the total number of responses (to their messages), the total number of positive responses, and their poster scores Based on the poster scores, we computed the strength score of each SA with Equation 7 and projected them to [0, 1]. Table 6 shows the strength scores for all of the SAs. Each SA has a different strength score and those in the NEGATIVE category have smaller ones (weaker recommendation). We tested the graph-based HITS algorithm with different feature combinations and set the error rate to be 0.0001 to get the algorithm to converge. In our experiments, we computed the precision score and the MRR (Mean Reciprocal Rank) score (Voorhees, 2001) of the most informative message chosen (the first, if there was more than one). Table 7 shows the performance scores for the system with different feature combinations. The performance of the baseline system is shown at the top. The HITS algorithm assigns both a hub score and an authority score to each message node, resulting in two sets of results. Scores in the HITS_ AUTHORITY rows of Table 7 represent the results using authority scores, while HITS_HUB rows represent the results using hub scores.</Paragraph> <Paragraph position="1"> Due to the limitation of thread length, the lower bound of the MRR score is 0.263. As shown in the table, a random guess baseline system can get a precision of 27.71% and a MRR score of 0.539.</Paragraph> <Paragraph position="2"> When we consider only lexical similarity, the result is not so good, which supports the notion that in human conversation context is often more important than text at a surface level. When we consider poster and lexical score together, the performance improves. As expected, the best performances use speech act analysis. More features do not always improve the performance, for example, the lexical feature will sometimes decrease performance. Our best performance produced a precision score of 70.38% and an MRR score of 0.825, which is a significant improvement over the baseline's precision score of 27.71% and its MRR score of 0.539.</Paragraph> <Paragraph position="3"> Another widely-used graph algorithm in IR is PageRank (Brin and Page, 1998). It is used to investigate the connections between hyperlinks in web page retrieval. PageRank uses a &quot;random walk&quot; model of a web surfer's behavior. The surfer begins from a random node m i and at each step either follows a hyperlink with the probability of d, or jumps to a random node with the probability of (1-d). A weighted PageRank algorithm is used to model weighted relationships of a set of objects. The iterative updating expression is where r and r+1 are the numbers of iterations.</Paragraph> <Paragraph position="4"> We also tested this algorithm in our situation, but the best performance had a precision score of only 47.45% and an MRR score of 0.669. It may be that PageRank's definition and modeling approach does not fit our situation as well as the HITS approach. In HITS, the authority and hub- null based approach is better suited to human conversation analysis than PageRank, which only considers the contributions from backward links of each node in the graph.</Paragraph> </Section> </Section> class="xml-element"></Paper>