File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1083_intro.xml
Size: 2,203 bytes
Last Modified: 2025-10-06 14:02:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1083"> <Title>Browsing Help for Faster Document Retrieval</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Intuition search engine </SectionTitle> <Paragraph position="0"> Intuition, the search engine of Sinequa, is based both on deep statistics and linguistic knowledge and treatments (Loupy et al., 2003). During indexing, the documents are analysed with a part of speech tagging, and a lemmatization procedure.</Paragraph> <Paragraph position="1"> But the most original linguistic feature of Intuition is the use of a semantic lexicon based on the &quot;see also&quot; relation (Manigot and Pelletier, 1997). In fact, it is based on bags of words containing units linked by a common seme. For instance, the bag of words &quot;Wind&quot; contains wind, hurricane, to blow, tornado, etc. 800 bags of words describe the &quot;Universe&quot;. It seems very poor but it is enough for most applications. A Salton like vector space (Salton, 1983) of 800 dimensions is created with these bags of words. 120,000 lemmas are represented in this space for French (a word can belong to several dimensions). During the analysis of a document, the vector of each term is added to the others in order to have a document representation in this space.</Paragraph> <Paragraph position="2"> This analysis allows a thematic characterization of a document. Secondly, it increases both precision and recall. When a query is submitted to Intuition, two searches are made in parallel. The first one is the standard search of documents containing the words (lemmas) of the query or synonyms. The second one searches for documents with similar subjects that are having a close vector. Each document of the corpus has two scores and they are merged according to a user defined heuristic. The advantage of such an approach is that the first documents retrieved not only contain the words of the query but are also closely related to the subject of the query. Lastly, this vector representation of words and documents allows the disambiguation of words semantically ambiguous.</Paragraph> </Section> class="xml-element"></Paper>