File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1903_metho.xml
Size: 16,920 bytes
Last Modified: 2025-10-06 14:08:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1903"> <Title>A Reliable Indexing Method for a Practical QA System</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Previous works </SectionTitle> <Paragraph position="0"> The current QA approaches can be classified into two groups; text-snippet extraction methods and noun-phrase extraction methods (also called closed-class QA) (Vicedo and Ferrandex (2000)).</Paragraph> <Paragraph position="1"> The text-snippet extraction methods are based on locating and extracting the most relevant sentences or paragraphs to the query by assuming that this text will probably contain the correct answer to the query. These methods have been the most commonly used by participants in last TREC QA Track (Moldovan et al. (1999); Prager, Radev, Brown and Coden (1999)). The noun-phrase extraction methods are based on finding concrete information, mainly noun phrases, requested by users' closed-class questions. A closed-class question is a question stated in natural language, which assumes a definite answer typified by a noun phrase rather than a procedural answer.</Paragraph> <Paragraph position="2"> ExtrAns (Berri, Molla and Hess (1998)) is a representative QA system using the text-snippet extraction method. The system locates the phrases in a document from which a user can infer an answer. However, it is difficult for the system to be converted into other domains because the system uses syntactic and semantic information that only covers a very limited domain (Vicedo and Ferrandex (2000)).</Paragraph> <Paragraph position="3"> FALCON (Harabagiu et al. (2000)) is another text-snippet system. The system returns answer phrases with high precision because it integrates different forms of syntactic, semantic and pragmatic knowledge for the goal of archiving better performance. The answer engine of FALCON handles question reformulations of previously posed questions, finds the expected answer type from a large hierarchy that incorporates the WordNet (Miller (1990)), and extracts answers after performing unifications on the semantic forms of the question and its answer candidates. Although FALCON archives good performance, the system is not appropriate for a practical QA system because it is difficult to construct domain-specific knowledge like a semantic net.</Paragraph> <Paragraph position="4"> MURAX (Kupiec (1993)) is one of the noun-phrase extraction systems. MURAX uses modules for the shallow linguistic analysis: a Part-Of-Speech (POS) tagger and finite-state recognizer for matching lexico-syntactic pattern. The finite-state recognizer decides users' expectations and filters out various answer hypotheses. For example, the answers to questions beginning with the word Who are likely to be people's name. Some QA systems participating in TREC use a shallow linguistic knowledge and start from similar approaches as used in MURAX (Vicedo and Ferrandex (2000)).</Paragraph> <Paragraph position="5"> These QA systems use specialized shallow parsers to identify the asking point (who, what, when, where, etc). However, these QA systems take a long response time because they apply some rules to each sentence including answer candidates and give each answer a score on retrieval time. To overcome the week point, GuruQA system (Prager, Brown and Coden (2000)), one of text-snippet systems, uses a method for indexing answer candidates in advance (so-called Predictive Annotation).</Paragraph> <Paragraph position="6"> Predictive Annotation identifies answer candidates in a text, annotates them accordingly, and indexes them. Although the GuruQA system quickly replies to users' queries and has good performance, the system passed over useful information out of a document boundary. In other words, the system restricts the size of a context window containing an answer candidate from a sentence to a whole document, and calculates a similarity between the keywords in a query and the keywords in the window. The system does not consider any information out of the window at all.</Paragraph> </Section> <Section position="3" start_page="0" end_page="21" type="metho"> <SectionTitle> 2 Approach of MAYA </SectionTitle> <Paragraph position="0"> MAYA has been designed as a separate component that interfaces with a traditional IR system. In other words, it can be run without IR system. As shown in Figure 1, it consists of two engines; an indexing engine and a searching engine.</Paragraph> <Paragraph position="2"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Predictive answer indexing </SectionTitle> <Paragraph position="0"> The answer indexing phase can be separated in 2 stages; answer-finding and term-scoring. For answer-finding, we classify users' asking points into 105 semantic categories. As shown in Table 1, The 105 semantic categories consist of 2 layers; the first layer and the second layer. The semantic categories in the first layer have broader meanings than those in the second layer.</Paragraph> <Paragraph position="1"> To define the 105 categories, we referred to the categories of QA systems participating in TREC and analyzed users' query logs that are collected by a commercial IR system (DiQuest.com (n.d.)).</Paragraph> <Paragraph position="2"> To extract answer candidates belonging to each category from documents, the indexing engine uses a POS tagger and a NE recognizer. The NE recognizer consists of a named entity dictionary (so-called PLO dictionary) and a pattern matcher. The PLO dictionary contains not only the names of people, countries, cities, and organizations, but it also contains a lot of units such as the unit of the length (e.g. cm, m, km) and the units of weight (e.g. mg, g, kg). After looking up the dictionary, the NE recognizer assigns a semantic category to each answer candidate after disambiguation using POS tagging. For example, the NE recognizer extracts 4 answer candidates annotated with 4 semantic categories in the sentence, &quot;Yahoo Korea (CEO Jinsup Yeom www.yahoo.co.kr) expanded the size of the storage for free email service to 6 mega-bytes.&quot;. Yahoo Korea belongs to company, and Jinsup Yeom is person. www.yahoo.co.kr means URL, and 6 mega-bytes is size. The complex lexical candidates such as www.yahoo.co.kr are extracted by the pattern matcher. The pattern matcher extracts formed answers such as telephone number, email address, and URL. The patterns are described as regular expressions.</Paragraph> <Paragraph position="3"> In the next stage, the indexing engine gives scores to content words within a context window that occur with answer candidates. The maximum size of the context window is 3 sentences; a previous sentence, a current sentence, and a next sentence. The window size can be dynamically changed. When the indexing engine decides the window size, it checks whether neighboring sentences have anaphors or lexical chains. If the next sentence has anaphors or lexical chains of the current sentence and the current sentence does not have anaphors or lexical chains of the previous sentence, the indexing engine sets the window size as 2.</Paragraph> <Paragraph position="4"> Unless neighboring sentences have anaphors or lexical chains, the window size is 1. Figure 2 shows an example in which the window size is</Paragraph> <Paragraph position="6"> After setting the context window, the indexing engine assigns scores to the content words in the window by using a 2-pass scoring method. In the first pass, the indexing engine calculates local scores of the content words. The scores indicate the magnitude of influences that each content word causes to answer candidates in a document. For example, when www.yahoo.co.kr is an answer candidate in the sentence, &quot;Yahoo Korea (www.yahoo.co.kr) starts a new service.&quot;, Yahoo Korea has the higher score than service since it has much more strong clue to www.yahoo.co.kr. We call the score a local score because the score is obtained from information between two adjacent words in a document. The indexing engine assigns local scores to content words according to 2 scoring features described below.</Paragraph> <Paragraph position="7"> a36 Term frequency: the frequency of each content word in a context window. The indexing engine give high scores to content words that frequently occurs with answer candidates For example, email receives a higher score than members in Figure 2.</Paragraph> <Paragraph position="8"> a37 Distance: the distance between an answer candidate and a target content word. The indexing engine gives high scores to content words that are near to answer candidates.</Paragraph> <Paragraph position="9"> For example, when Jinsup Yeom is an answer candidate in Figure 2, CEO obtains a higher score than service.</Paragraph> <Paragraph position="10"> The indexing engine does not use high-level information like definition characteristics (IS-A relation between words in a sentence) and grammatical roles because it is difficult for the indexing engine to correctly extract the high-level information from documents in real fields. In other words, most of the web documents are described in a user's free style with additional tags and includes a lot of images and tables. The fact makes it more difficult for the indexing engine to detect sentence boundaries and to extract topic words from sentences. Therefore, the indexing engine uses law-level information like the term frequencies and the distances after considering the cost for the additional analysis and indexing time.</Paragraph> <Paragraph position="11"> The indexing engine calculates local scores by two steps. It first calculates the distance weight between an answer candidate and a target content word, as shown in Equation 1.</Paragraph> <Paragraph position="13"> In Equation 1, ),(, jikd wadistw is the distance weight of the content word w that is located at the jth position in the kth context window of a document d. ),( jidist is the distance between the answer candidate ia , which is located at the ith position, and the content word jw , which is located at the jth position. c is a constant value, and we set c to 1 on experiment. The indexing engine then adds up the distance weights of content words with an identical lexical form in each context window, as shown in Equation 2.</Paragraph> <Paragraph position="14"> Equation 2 is described as a well-known dynamic programming method. According to Equation 2, the more frequent content words are, the higher scores the content words receive. In Equation 2, ),( )(, nposin kd waLS is the local score of the nth content word w when n identical content words exist in the kth context window of a document d, and pos(n) is the position of the nth content word. After recursively solving Equation 2, the indexing engine receives a local score, ),(, waLS ikd , between the ith answer candidate and the content word w in the kth context window.</Paragraph> <Paragraph position="15"> Figure 3 shows the calculation process of local scores. After calculating the local scores, the indexing engine saves the local scores with the position information of the relevant answer candidate in the answer DB.</Paragraph> <Paragraph position="16"> Yahoo Korea and each service that is located in the two adjacent sentences.</Paragraph> <Paragraph position="18"> The second pass is divided into three steps; construction of pseudo-documents, calculation of global scores, and summation of global scores and local scores. In the first step, the indexing engine constructs pseudo-documents. A pseudo-document is a virtual document that consists of content words occurring with an answer candidate in some documents. The pseudo-document is named after the answer candidate. Figure 4 shows an example of the pseudo-documents.</Paragraph> <Paragraph position="19"> In the next step, the indexing engine calculates global scores of each answer candidate, as shown in Equation (3). The global score mean how much the answer candidate is associated with each term that occurs in several documents.</Paragraph> <Paragraph position="21"> Equation 3 is similar to a well-known TF[?]IDF equation (Fox (1983)). However, the equation is different when it comes to the concept of a document. We assume that there is no difference between a pseudo-document and a real document. Therefore, the TF component, ))_/(5.05.0( tfMaxtfw[?]+ in Equation 3, means the normalized frequency of the content word w in the pseudo-document adpseudo _ that is named after the answer candidate a. The IDF component, )log(/)/log( NnN , means the normalized reciprocal frequency of the pseudo-documents including the content word w.</Paragraph> <Paragraph position="22"> The value of TF[?]IDF, ),_( wdpseudoGS a , means the global score between the answer candidate a and the content word w. In detail, tfw is the term frequency of the content word w in adpseudo _ .</Paragraph> <Paragraph position="23"> Max_tf is the maximum value among the frequencies of content words in adpseudo _ . n is the number of the pseudo-documents that include the content word w. N is the total number of the pseudo-documents. Figure 5 shows a calculation process of the global scores.</Paragraph> <Paragraph position="25"> In the last step, the indexing engine adds up the global scores and the local scores, as shown in Equation (4).</Paragraph> <Paragraph position="27"> In Equation 4, ),(, waLS ikd is the local score between the answer candidate ai and the content word w in the kth content window of the document d, and ),_( wdpseudoGS ia is the global score. a and b are weighting factors. After summing up two scores, the indexing engine updates the answer DB with the scores.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Lexico-syntactic query processing </SectionTitle> <Paragraph position="0"> For identifying users' asking points, the searching engine takes a user's query and converts it into a suitable form using the PLO dictionary. The PLO dictionary contains the semantic markers of words. Query words are converted into semantic markers before pattern matching. For example, the query &quot;Who is the CEO of Yahoo Korea?&quot; is translated into &quot;%who auxiliary-verb %person preposition Yahoo Korea symbol&quot;. In the example, %person and %who are the semantic markers. The content words out of the PLO dictionary keep their lexical forms. The functional words (e.g.</Paragraph> <Paragraph position="1"> auxiliary verb, preposition) are converted into POS's. After conversion, the searching engine matches the converted query against one of predefined lexico-syntactic patterns, and classifies the query into the one of the 105 semantic categories. When two or more patterns match the query, the searching engine returns the first matched category. Table 2 shows some lexico-syntactic patterns. The above sample query matches the first pattern in Table 2.</Paragraph> <Paragraph position="2"> (%person|@person) j? (sf)* $ (%person|@person) j? %ident j? (sf)* $ (%person|@person) j? (%about)? @req (%person|@person) j? (%ident)? @req (%person|@person) jp ef (sf)* $ %which (%person|@person) tel_num (%tel_num|@tel_num) (%num)? j? (sf)*$ (%tel_num|@tel_num) (%num)? j? %what (%tel_num|@tel_num) j? (%about)? @req (%tel_num|@tel_num) j? (%what_num)</Paragraph> </Section> <Section position="3" start_page="0" end_page="21" type="sub_section"> <SectionTitle> 2.3 Answer scoring and ranking </SectionTitle> <Paragraph position="0"> The searching engine calculates the similarities between query and answer candidates, and ranks the answer candidates according to the similarities. To check the similarities, the searching engine uses the AND operation of a well-known p-Norm model (Salton, Fox and Wu (1983)), as shown in Equation 5.</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> In Equation 5, A is an answer candidate, and ati is the ith term score in the context window of the answer candidate. qi is the ith term score in the query. p is the P-value in the p-Norm model.</Paragraph> <Paragraph position="5"> MAYA consumes a relatively short time for answer scoring and ranking phase because the indexing engine has already calculated the scores of the terms that affect answer candidates.</Paragraph> <Paragraph position="6"> In other words, the searching engine simply adds up the weights of co-occurring terms, as shown in Equation 5. Then, the engine ranks answer candidates according to the similarities. The method for answer scoring is similar to the method for document scoring of traditional IR engines. However, MAYA is different in that it indexes, retrieves, and ranks answer candidates, but not documents.</Paragraph> </Section> </Section> class="xml-element"></Paper>