File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1102_metho.xml
Size: 15,997 bytes
Last Modified: 2025-10-06 14:07:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1102"> <Title>Exploiting Lexical Expansions and Boolean Compositions for Web Querying</Title> <Section position="2" start_page="13" end_page="14" type="metho"> <SectionTitle> 1 Lexical expansion </SectionTitle> <Paragraph position="0"> Two kinds of lexical expansion have been used in the experiment: morphological derivations and synonym expansions. Both of them try to expand a &quot;basic-keyword&quot;, that is a keyword direcdy derived from a natural language question. The language used in the experiments is Italian.</Paragraph> <Section position="1" start_page="13" end_page="13" type="sub_section"> <SectionTitle> 1.1 Basic keywords </SectionTitle> <Paragraph position="0"> The idea is that this level of keywords should reflect as much as possible the words used by an average user to query a web search engine.</Paragraph> <Paragraph position="1"> Given a question expressed with a natural language sentence, its basic keywords are derived selecting the lernmas for each content word of the question. Verbs are transformed in their corresponding nominalization. Furthermore we decided to consider collocations and multiwords as single keywords, as most of the currently available search engines allow the user to specify &quot;phrases&quot; in a very simple way. In the experiments presented in the paper multiword expressions are manually recognized and then added to the basic keyword list.</Paragraph> <Paragraph position="2"> Figure 1 shows a couple of questions with their respective basic keywords.</Paragraph> <Paragraph position="3"> NL-QUESTION: Chi ha inventato la luce elettrica? (Who invented the electric light?)</Paragraph> </Section> <Section position="2" start_page="13" end_page="14" type="sub_section"> <SectionTitle> 1.2 Morphological derivation </SectionTitle> <Paragraph position="0"> Morphological derivations are considered because they introduce new lemmas that we might find in possible correct answers to the question, improving in this way the engine recall. For instance, for a question like &quot;Chi ha inventato la luce elettrica?&quot; (&quot;Who invented the electric light?&quot;) we can imagine different contexts for the correct answer, such as &quot;la luce elettrica fu inventata da Edison&quot; (&quot;Electric light was invented by Edison&quot;), &quot;L'inventore della luce elettrica fu Edison&quot; (&quot;The inventor of electric light was Edison&quot;), &quot;L'invenzione della luce elettrica % dovuta a Edison&quot; (&quot;The invention of electric light is due to Edison&quot;), where different morphological derivations of the same basic keyword &quot;inventore&quot; (&quot;inventor&quot;) appear. Derivations have been automatically extracted from an Italian monolingual dictionary (Disc, 1997), and collected without considering the derivation order (i.e. &quot;inventare&quot; belongs to the derivation set of &quot;inventore&quot; even if in the actual derivation it is the noun that derives from the verb).</Paragraph> </Section> <Section position="3" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 1.3 Synonyms </SectionTitle> <Paragraph position="0"> Keyword expansion based on synonyms can potentially improve the system recall, as the answer to the question might contain synonyms of the basic keyword. For instance, the answer to the question &quot;Chi ha inventato la luce elettrica?&quot; (&quot;Who invented the electric light?&quot;) might be one among &quot;Lo scopntore della lute elettrica fu Edison&quot; (&quot;The discoverer of electric light was Edison&quot;), &quot;'L'inventore della illuminazione elettrica fu Edison&quot; (&quot;The inventor of electric illumination was Edison&quot;), &quot;La scopritore della illuminazione elettrica fu Edison&quot; (&quot;The discoverer of electric illumination was Edison&quot;), where different synonyms of &quot;inventore&quot; (&quot;inventor&quot;) and &quot;luce elettrica'&quot; (&quot;electric light&quot;) appear. In the experiment reported in section 3 Italian synonyms have been manually extracted from the ItalianWordnet database (Roventini et al., 2000), a further extension of the Italian Wordnet produced by the EuroWordNet project (Vossen, 1998). Once the correct synset for a basic keyword is selected, its synonyms are added to the expansion list. In the near future we plan to automate the process of synset selection using word domain disambiguation, a variant of word sense disambiguation based on subject field code information added to WordNet (Magnini and Cavaglih, 2000).</Paragraph> </Section> <Section position="4" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 1.4 Expansion chains </SectionTitle> <Paragraph position="0"> The expansions described in the previous sections could be recursively applied to every lemma derived by a morphological or a synonym expansion. For example, at the first expansion level we can pass from &quot;inventore&quot; &quot;inventor&quot; to its synonym &quot;scopritore&quot; &quot;discoverer&quot;, from which in turn we can morphologically derive the noun &quot;discovery&quot;, and so on (cfr. Figure 2). This would allow the retrieval of answers such as &quot;La scoperta della lampada ad incandescenza ~ dovuta a Edison&quot; (&quot;The discovery of the incandescent lamp is due to Edison&quot;).</Paragraph> <Paragraph position="1"> Although in the experiment reported in this paper we do not use recursive expansions (i.e.</Paragraph> <Paragraph position="2"> we stop at the first level of the expansion chain), a long term goal of this work is to verify their effects on the document relevance.</Paragraph> <Paragraph position="4"/> </Section> </Section> <Section position="3" start_page="14" end_page="14" type="metho"> <SectionTitle> 2 Query compositions </SectionTitle> <Paragraph position="0"> We wanted to take advantage of the &quot;advanced&quot; capabilities of the search engine. In particular we experimented the &quot;Boolean phraase&quot; modality, which allows the user to submit queries with keywords composed by means of logical operators. However we quickly realised that realistic choices were restricted to disjoint compositions of short AND clauses (i.e. with a limited number of elements, typically not more than four). This constrained us to two hypothesis, described in sections 2.2 and 2.3, which have been compared with a baseline composition strategy, described in 2.1.</Paragraph> <Section position="1" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 2.1 Keyword &quot;aa~lY' composition search </SectionTitle> <Paragraph position="0"> (KAS) This search strategy corresponds to the default method that most search engines implement. Given a list of basic keywords, no expansion is performed and keywords are composed in an AND clause. An example is reported in Figure</Paragraph> </Section> <Section position="2" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 2.2 Keyword expansion insertion search (Icls) </SectionTitle> <Paragraph position="0"> In this composition modality a disjunctive expression is constructed where each disjoint element is an AND clause formed by the base keywords plus the insertion of a single expansion. In addition, to guarantee that at least the same documents of the KAS modality are retrieved, both an AND clause with the basic keywords and all the single basic keywords are added as disjoint elements. Figure 4 reports an example. If the AND combination of the basic keywords produces a non empty set of documents, then the KIS modality should return the same set of documents remTanged by the presence of the keyword expansions. What we expect is an improvement in the position of a significant document, which is relevant when huge amounts of documents are retrieved.</Paragraph> <Paragraph position="1"> NL-QUESTION: Chi ha inventato la luce elettrica? (Who inven~d the e~ctric l~ht~ In this composition modality a disjunctive expression is constructed where each disjoint element is an AND clause formed by one of the possible tuple derived by the expansion set of each base keyword. In addition, to guarantee that at least the same documents of the KAS modality are retrieved, the single basic keywords are added as disjoint elements. Figure 5 reports an example.</Paragraph> <Paragraph position="2"> As in the previous case we expect that at least the same results of the KAS search are returned, because the AND composition of the basic keywords is guaranteed. We also expect a possible improvement of the recall, because new AND clauses are inserted.</Paragraph> <Paragraph position="3"> NL-QUESTION: Chi ha inventato la luce</Paragraph> </Section> </Section> <Section position="4" start_page="14" end_page="18" type="metho"> <SectionTitle> 3 Comparison experiment </SectionTitle> <Paragraph position="0"> This section reports about the problems we faced with comparing the three search strategies presented in section 2. The question set, the document assessment and the scoring used in the experiment are described.</Paragraph> <Section position="1" start_page="14" end_page="14" type="sub_section"> <SectionTitle> 3.1 Creating the Question Set </SectionTitle> <Paragraph position="0"> Initially, a question set of 40 fact-based, short-answer questions such as &quot;Chi ~ l'autore della Divina Commedia?&quot; (&quot;Who is the author of The Divine Comedy?&quot;) was created. Language was Italian and each question was guaranteed to have at least one web document that answered the question. Ambiguous questions (about 15%) were not eliminated (see Voorhees, 2000 for a discussion). A total of 20 questions from the initial question set have been randomly selected, this way preventing possible bias in favour of queries that would perform better with lexical expansions. Figure 6 reports the final question set of the experiment.</Paragraph> <Paragraph position="1"> Chi ha inventato la luce elettrica? (Who invented the electric light?) Come si chiama l'autore del libro &quot;I Each question was then associated with a corresponding human-generated set of basic keywords, resulting in an ordered list of \[nlquestion, basic-keywords \] pairs. We supposed a maximum of 3 basic keywords for each question, obtaining an average of 2.25. This is in line with (Jansen et al., 1998) where it is reported that, over a sample of 51.473 queries submitted to a major search service (Excite), the average query length was 2.35. Basic keywords are then expanded with their morphological derivations and synonyms (see Section 2), with an average of two expansions for question (rnin=0, max=6).</Paragraph> </Section> <Section position="2" start_page="14" end_page="17" type="sub_section"> <SectionTitle> 3.2 Document assessment </SectionTitle> <Paragraph position="0"> An automatic query generator has been realised that, given a question with its basic keywords and lexical expansions, builds up three queries, corresponding to KAS, KIS and KCS, and submits them to the search engine. Results are collected considering up to ten documents for search; then the union set is used for the evaluation experiment. There was no way for the assessor to relate a document to the search modality the document was retrieved by. Query generation, web querying and result displaying were all been made mntime, during the evaluation session.</Paragraph> <Paragraph position="1"> Fifteen researchers at ITC-irst were selected as assessors in the experiment. They were asked to judge the web documents returned by the query generator with respect to a given question, choosing a value among the fo\]tlowing five: 1) answer in context: The answer corresponding to the question is recovered and the document context is appropriate. For example, if the question is &quot;Who is the inventor of the electric light?&quot; then &quot;Edison&quot; is reported in the document, in some way, as the inventor of the electric light and the whole document deals with inventors and/or Edison's life.</Paragraph> <Paragraph position="2"> 2) answer_nocontext: The answer to the question is recovered but the document context is not appropriate. (e.g. the document does not deal neither with inventors or Edison's life). 3) noanswerin_context: The answer corresponding to the question is not recovered but the document context is appropriate.</Paragraph> <Paragraph position="3"> 4) noanswerno_context: The answer corresponding to the question is not recovered and the document context is not appropriate.</Paragraph> <Paragraph position="4"> 5) no_document: the requested document is not retrieved.</Paragraph> <Paragraph position="5"> The following instructions were provided to assessors: * The judgement has to be based on the document text only, that is no further links exploration is allowed.</Paragraph> <Paragraph position="6"> * If a question is considered ambiguous then give it just one interpretation and use that interpretation to judge aH question-related documents consistently. For example, if the question &quot;Chi ~ il vincitore del Tour de France? &quot; (&quot;Who is the winner of the Tour de France?&quot;) is considered ambiguous because the answer may change over time, then the assessor could decide that the correct interpretation is &quot;Who is the winner of the 1999 Tour de France?&quot; and judge all the documents consistently.</Paragraph> <Paragraph position="7"> * A document contains the answer only if it is explicitly reported in the text. That is, if the question is &quot;Who is the author of Options?&quot; it is not sufficient that the string &quot;Robert Sheckley&quot; or &quot;Sheckley&quot; is in the text, but the document has to say that Robert Sheckley is the author of Options.</Paragraph> <Paragraph position="8"> Each question was judged independently by three assessors. The number of texts to be judged for a question ranged from 10 to 18, with an average of 12. For each question k we obtained three sets VKm.k, VKXS,k and VKCS,k of (pos, assessment) pairs corresponding to the three search methods, where pos is the position * of the document in the ordered list returned by the search method, and assessment is the assessment of one participant.</Paragraph> </Section> <Section position="3" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 3.3 Assessment scoring </SectionTitle> <Paragraph position="0"> We eliminated all the (pos, assessment) pairs whose assessment was equal to no_document.</Paragraph> <Paragraph position="1"> Said i a (pos, assessment) pair belonging to VKAS, k, Vras, k or VKcs. k we define:</Paragraph> <Paragraph position="3"> \[3 if assessment is answer_ in_ context Given a question k and a set V~ of (pos, assessment) pairs corresponding to an ordered list Lk of documents, to evaluate the relevance of L~ with respect to k we have defined two relevance functions, defined in \[1\]: f/ that considers the document position, andf that does not.</Paragraph> <Paragraph position="5"> where - p(i) is the position of the web document in the ordered list.</Paragraph> <Paragraph position="6"> - v(O=~(r(i)).r(O+13(r(O) a(x), 13(x) : 10,1,2,3} ~ (0,1) are tuning functions that allow to weight the assessments.</Paragraph> <Paragraph position="7"> - m is the maximum length of an ordered list of web documents.</Paragraph> <Paragraph position="8"> For each search method we obtained a set of 20 ~, f/) pairs by the assessing process, i.e., we obtained 20 (f, f/)~s, k pairs, 20 (f, f/)ms, k pairs and 20 (f, f/)KCS, k pairs.</Paragraph> </Section> </Section> class="xml-element"></Paper>