File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/h05-1014_concl.xml

Size: 3,336 bytes

Last Modified: 2025-10-06 13:54:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1014">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 105-112, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Novelty Detection: The TREC Experience</Title>
  <Section position="7" start_page="110" end_page="110" type="concl">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> The novelty track in TREC examined a particular kind of novelty detection, that is, finding novel, on-topic sentences within documents that the user is reading. Both statistical and linguistic techniques, as well as filtering and learning approaches can be applied to detecting novel relevant information within documents, but nevertheless it is a hard problem for several reasons. First, because the unit of interest is a sentence, there is not a lot of data in each unit on which to base the decision. When the document as a whole is relevant, techniques designed for document retrieval seem unable to make fine distinctions about which sentences within the document contain the relevant information. Initial threshold setting is critical and difficult.</Paragraph>
    <Paragraph position="1"> When we examined human performance on this task, it is clear that users do make very fine distinctions. Looking particularly at the 2004 set of relevant and novel sentences, less than 20% of the sentences in relevant documents were marked as relevant, and only 40% of those (or 8% of the total sentences) were marked as both relevant and novel.</Paragraph>
    <Paragraph position="2"> The TREC novelty data sets themselves support some interesting uses outside of the novelty track.</Paragraph>
    <Paragraph position="3"> Whereas the data from 2002 is clearly flawed and should not be used, the data from 2003 and 2004 can be regarded as valid samples of user input in terms of relevant sentence selection, and further reduction of those sentences to those presenting new information. One obvious use is in the passage retrieval arena, e.g., using the relevant sentences for testing passage retrieval, either at the single sentence level or using the consecutive sentences to test when to retrieve multiple sentences. A second use is for summarization, where the relevant AND novel sentences can serve as the truth data for the extraction phase (and then compressed in some manner). Other uses of the data include manual analysis of user behavior when processing documents in response to a question, or looking further into the user agreement issues, particularly in the summarization area.</Paragraph>
    <Paragraph position="4"> The novelty data is also unique in that it deliberately contains a mix of topics on events and on opinions regarding controversial subjects. The opinions topics are quite different in this regard than other TREC topics, which have historically focused on events or narrative information on a subject or person. This exploration has been an interesting and fruitful one. By mixing the two topic types within each task, we see that identifying opinions within documents is hard, even with training data, while detecting new opinions (given relevance) seems analogous to detecting new information about an event.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML