File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2045_metho.xml
Size: 11,788 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2045"> <Title>Lycos Retriever: An Information Fusion Engine</Title> <Section position="3" start_page="2" end_page="3" type="metho"> <SectionTitle> 2 Motivations </SectionTitle> <Paragraph position="0"> The presentation of search results as ranked lists of document links has become so ingrained that it is hard now to imagine alternatives to it. Other interfaces, such as graphical maps or visualizations, have not been widely adopted. Question-answering interfaces on the Web have not had a high adoption http://www.wikipedia.org rate, either: it is hard to get users to venture beyond the 2.5 word queries they are accustomed to, and if question-answering results are not reliably better than keyword search, users quickly return to key-word queries. Many user queries specify nothing more than a topic anyway.</Paragraph> <Paragraph position="1"> But why treat common queries exactly like unique queries? For common queries we know that incentives for ranking highly have led to techniques for artificially inflating a site's ranking at the expense of useful information. So the user has many useless results to sift through. Furthermore, users are responsive to filtered information, as the upsurge in popularity of Wikipedia and Answers.com demonstrate.</Paragraph> <Paragraph position="2"> Retriever responds to these motivations by automatically generating a narrative summary that answers, &quot;What do I need to know about this topic?&quot; for the most popular topics on the Web.</Paragraph> </Section> <Section position="4" start_page="3" end_page="177" type="metho"> <SectionTitle> 3 Lycos Retriever pages </SectionTitle> <Paragraph position="0"> Figure 1 shows a sample Retriever page for the topic &quot;Mario Lemieux&quot;.</Paragraph> <Paragraph position="1"> The topic is indicated at the upper left. Below it is a category assigned to the topic, in this case Sports > Hockey > Ice Hockey > National Hockey League > Lemieux, Mario. The main body of the page is a set of paragraphs beginning with a biographical paragraph complete with Lemieux's birth date, height, weight and position extracted from Nationmaster.com, followed by paragraphs outlining his career from See (Liu, 2003) for a similarly motivated system. other sources. The source for each extract is indicated in shortened form in the left margin of the page; mousing over the shortened URL reveals the full title and URL. Associated images are thumbnailed alongside the extracted paragraphs. Running down the right side of the page under More About is a set of subtopics. Each subtopic is a link to a page (or pages) with paragraphs about the topic (Lemieux) with respect to such subtopics as Games, Seasons, Pittsburgh Penguins, Wayne Gretzky, and others, including the unpromising subtopic ice.</Paragraph> </Section> <Section position="5" start_page="177" end_page="177" type="metho"> <SectionTitle> 4 Topic Selection </SectionTitle> <Paragraph position="0"> An initial run of about 60K topics was initiated in December, 2005; this run yielded approximately 30K Retriever topic pages, each of which can have multiple display pages. Retriever topics that had fewer than three paragraphs or which were categorized as pornographic were automatically deleted. The biggest source of topic candidates was Lycos's own query logs. A diverse set of topics was chosen in order to see which types of topics generated the best Retriever pages.</Paragraph> </Section> <Section position="6" start_page="177" end_page="178" type="metho"> <SectionTitle> 5 Topic Categorization & Disambiguation </SectionTitle> <Paragraph position="0"> After a topic was input to the system, the Retriever system assigned it a category using a naive Bayes classifier built on a spidered DMOZ hierarchy.</Paragraph> <Paragraph position="1"> Various heuristics were implemented to make the returned set of categories uniform in length and depth, up-to-date, and readable.</Paragraph> <Paragraph position="2"> Once the categorizer assigned a set of categories to a topic, a disambiguator module determined whether the assigned categories could be assigned to a single thing using a set of disambiguating features learned from the DMOZ data itself. For example, for the topic 'Saturn', the assigned categories included 'Science/Astronomy', 'Recreation/Autos' and 'Computers/Video Games' (Sega Saturn). The disambiguator detected the presence of feature pairs in these that indicated more than one topic. Therefore, it clustered the assigned categories into groups for the car-, astronomy- and video-game-senses of the topic and assigned each group a discriminative term which was used to disambiguate the topic: Saturn (Auto), Saturn (Solar System), Saturn (Video Game). Retriever returned pages only for topics that were believed to be disambiguated according to DMOZ. If no categories were identified via DMOZ, a default Other category was assigned unless the system guessed that the topic was a personal name, based on its components. null The live system assigns non-default categories with 86.5% precision; a revised algorithm achieved 93.0% precision, both based on an evaluation of 982 topics. However, our precision on identifying unambiguous topics with DMOZ was only 83%.</Paragraph> <Paragraph position="3"> Still, this compares well with the 75% precision achieved on by the best-performing system on a similar task in the 2005 KDD Cup (Shen 2005).</Paragraph> </Section> <Section position="7" start_page="178" end_page="178" type="metho"> <SectionTitle> 6 Document Retrieval </SectionTitle> <Paragraph position="0"> After a topic was categorized and disambiguated, the disambiguated topic was used to identify up to 1000 documents from Lycos' search provider. For ambiguous topics various terms were added as optional 'boost' terms, while terms from other senses of the ambiguous topic categories were prohibited.</Paragraph> <Paragraph position="1"> Other query optimization techniques were used to get the most focused document set, with non-English and obscene pages filtered out</Paragraph> </Section> <Section position="8" start_page="178" end_page="178" type="metho"> <SectionTitle> 7 Passage Extraction </SectionTitle> <Paragraph position="0"> Each URL for the topic was then fetched. An HTML parser converted the document into a sequence of contiguous text blocks. At this point, contiguous text passages were identified as being potentially interesting if they contained an expression of the topic in the first sentence.</Paragraph> <Paragraph position="1"> When a passage was identified as being potentially interesting, it was then fully parsed to see if an expression denoting the topic was the Discourse Topic of the passage. Discourse Topic is an under-theorized notion in linguistic theory: not all linguists agree that the notion of Discourse Topic is required in discourse analysis at all (cf.</Paragraph> <Paragraph position="2"> Asher, 2004). For our purposes, however, we formulated a set of patterns for identifying Discourse Topics on the basis of the output of the CMU Link</Paragraph> <Section position="1" start_page="178" end_page="178" type="sub_section"> <SectionTitle> Parser </SectionTitle> <Paragraph position="0"> the system uses.</Paragraph> <Paragraph position="1"> Paradigmatically, we counted ordinary subjects of the first sentence of a passage as expressive of the Discourse Topic. So, if we found an expression of the topic there, either in full or reduced form, we took that as an instance of the topic appearing as Discourse Topic in that passage http://www.link.cs.cmu.edu/link/ and ranked that passage highly. Of course, not all Discourse Topics are expressed as subjects, and the system recognized this.</Paragraph> <Paragraph position="2"> A crucial aspect of this functionality is to identify how different sorts of topics can be expressed in a sentence. To give a simple illustration, if the system believes that a topic has been categorized as a personal name, then it accepted reduced forms of the name as expressions of the topic (e.g. &quot;Lindsay&quot; and &quot;Lohan&quot; can both be expressions of the topic &quot;Lindsay Lohan&quot; in certain contexts); but it does not accept reduced forms in all cases.</Paragraph> <Paragraph position="3"> Paragraphs were verified to contain a sequence of sentences by parsing the rest of the contiguous text. The verb associated with the Discourse Topic of the paragraph was recorded for future use in assembling the topic report. Various filters for length, keyword density, exophoric expressions, spam and obscenity were employed. A score of the intrinsic informativeness of the paragraph was then assigned, making use of such metrics as the length of the paragraph, the number of unique NPs, the type of verb associated with the Discourse Topic, and other factors.</Paragraph> <Paragraph position="4"> Images were thumbnailed and associated with the extracted paragraph on the basis of matching text in the image filename, alt-text or description elements of the tag as well as the size and proximity of the image to the paragraph at hand. We did not analyze the image itself.</Paragraph> </Section> </Section> <Section position="9" start_page="178" end_page="179" type="metho"> <SectionTitle> 8 Subtopic Selection and Report Assembly </SectionTitle> <Paragraph position="0"> Once the system had an array of extracted paragraphs, ranked by their intrinsic properties, we began constructing the topic report by populating an initial 'overview' portion of the report with some of the best-scoring paragraphs overall.</Paragraph> <Paragraph position="1"> First, Retriever eliminated duplicate and near-duplicate paragraphs using a spread-activation algorithm.</Paragraph> <Paragraph position="2"> Next the system applied question-answering methodology to order the remaining paragraphs into a useful overview of the topic: first, we found the best two paragraphs that say what the topic is, by finding the best paragraphs where the topic is the Discourse Topic of the paragraph and the associated verb is a copula or copula-like (e.g. be known as). Then, in a similar way, we found the best few paragraphs that said what attributes the topic has. Then, a few paragraphs that said what the topic does, followed by a few paragraphs that said what happens to the topic (how it is used, things it has undergone, and so on). The remaining paragraphs were then clustered into subtopics by looking at the most frequent NPs they contain, with two exceptions. First, superstrings of the topic were favored as subtopics in order to discover complex nominals in which the topic appears. Secondly, non-reduced forms of personal names were required as subtopics, even if a reduced form was more frequent.</Paragraph> <Paragraph position="3"> Similar heuristics were used to order paragraphs within the subtopic sections of the topic report as in the overview section.</Paragraph> <Paragraph position="4"> Additional constraints were applied to stay within the boundaries of fair use of potentially copyrighted material, limiting the amount of contiguous text from any one source.</Paragraph> <Paragraph position="5"> Topic reports were set to be refreshed by the system five days after they were generated in order to reflect any new developments.</Paragraph> <Paragraph position="6"> In an evaluation of 642 paragraphs, 88.8% were relevant to the topic; 83.4% relevant to the topic as categorized. For images, 85.5% of 83 images were relevant, using a revised algorithm, not the live system. Of 1861 subtopic paragraphs, 88.5% of paragraphs were relevant to the assigned topic and subtopic.</Paragraph> </Section> class="xml-element"></Paper>