File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1118_metho.xml
Size: 21,629 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1118"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 939-946, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Integrating linguistic knowledge in passage retrieval for question answering</Title> <Section position="3" start_page="939" end_page="940" type="metho"> <SectionTitle> 2 Question answering with dependency relations </SectionTitle> <Paragraph position="0"> Our Dutch question answering system, Joost (Bouma et al., 2005), consists of two streams: a table look-up strategy using off-line information extraction and an on-line strategy using passage retrieval and on-the-fly answer extraction. In both strategies we use syntactic information produced by a wide-coverage dependency parser for Dutch, Alpino (Bouma et al., 2001). In the off-line strategy we use syntactic patterns to extract information from unrestricted text to be stored in fact tables (Jijkoun et al., 2004). For the on-line strategy, we assume that there is a certain overlap between syntactic relations in the question and in passages containing the answers. Furthermore, we also use strategies for reasoning over dependency rules to capture semantic relationships that are expressed by different syntactic patterns (Bouma et al., 2005).</Paragraph> <Paragraph position="1"> Our focus is set on open-domain question answering using data from the CLEF competition on Dutch QA. We have parsed the entire corpus provided by CLEF with about 4,000,000 sentences in about 190,000 documents. The dependency trees are stored in XML and are directly accessible from the QA system. Syntactic patterns for off-line information extraction are run on the entire corpus. For the on-line QA strategy we use traditional information retrieval to select relevant passages from the corpus to be processed by the answer extraction modules.</Paragraph> <Paragraph position="2"> This step is necessary to reduce the search space for the QA system to make it feasible to run on-line QA.</Paragraph> <Paragraph position="3"> As segmentation level we used paragraphs marked in the corpus (about 1.1 million).</Paragraph> <Paragraph position="4"> Questions are parsed within the QA system using the same parser. Using their analysis, the system determines the question type and, hence, the expected answer type. According to the type, we try to find the answer first in the fact database (if an appropriate table exists) and then (as fallback) in the corpus using the on-line QA strategy.</Paragraph> <Section position="1" start_page="939" end_page="940" type="sub_section"> <SectionTitle> 2.1 Passage retrieval in Joost </SectionTitle> <Paragraph position="0"> Information retrieval is one of the bottle-necks in the on-line strategy of our QA system. The system relies on the passages retrieved by this component and fails if IR does not provide relevant documents. Traditional IR uses a bag-of-words approach using plain text keywords to be matched with word-vectors describing documents. The result is usually a ranked list of documents. Simple techniques such as stemming and stop word removal are used to improve the performance of such a system. This is also the base-line approach for passage retrieval in our QA system. null The passage retrieval component in Joost includes an interface to seven off-the shelf IR systems. One of the systems supported is Lucene from the Apache Jakarta project (Jakarta, 2004). Lucene is a widely-used open-source Java library with several extensions and useful features. This was the IR engine of our choice in the experiments described here. For the base-line we use standard settings and a public Dutch text analyzer for stemming and stop word removal. Now, the goal is to extend the base-line by incorporating linguistic information produced by the syntactic analyzer. Figure 1 shows a dependency tree produced for one of the sentences in the CLEF corpus. We like to include as much information from the parsed data as possible to find better matches between an analyzed question and passages that contain answers. From the parse trees, we extract various kinds of linguistic features and syntactic units to be stored in the index. Besides the dependency relations the parser also produces part-of-speech (POS) tags, named entity labels and linguistic root forms. It also recognizes compositional compounds and particle verbs. All this information might be useful for our passage retrieval component.</Paragraph> <Paragraph position="1"> Lucene supports multiple index fields that can be filled with different types of data. This is a useful Het embargo tegen Irak werd ingesteld na de inval in Koeweit in 1990. (The embargo against Iraq has been declared after the invasion of Kuwait in 1990.) feature since it allows one to store various kinds of information in different fields in the index. Henceforth, we will call these data fields index layers and, thus, the index will be called a multi-layer index. We distinguish between token layers, type layers and annotation layers. Token layers include one item per token in the corpus. Table 1 lists token layers defined in our index.</Paragraph> <Paragraph position="2"> We included various combinations of features derived from the dependency trees to make it possible to test their impact on IR. Features are simply concatenated (using special delimiting symbols between the various parts) to create individual items in the layer. For example, the RootHead layer contains concatenated dependent-head bigrams taken from the dependency relations in the tree. Tokens in the text layer and in the root layer have been split at hyphens and underscores to split compositional compounds and particle verbs (Alpino adds underscores neer stelde de Verenigde Naties een embargo in tegen Irak? (When did the United Nations declare the embargo against Iraq?) between the compositional parts). Type layers include only specific types of tokens in the corpus, e.g. named entities or compounds (see table 2).</Paragraph> <Paragraph position="3"> Annotation layers include only the labels of (certain) token types. So far, we defined only one annotation layer for named entity labels. This layer may contain the items 'ORG', 'PER' or 'LOC' if such a named entity occurs in the text passage.</Paragraph> </Section> </Section> <Section position="4" start_page="940" end_page="942" type="metho"> <SectionTitle> 3 Query formulation </SectionTitle> <Paragraph position="0"> Questions are analyzed in the same way as sentences in documents. Hence, we can extract appropriate units from analyzed questions to be matched with the various layers in the index. For example, we can extract root-head word pairs to be matched with the RootHead layer. In this way, each layer can be queried using keywords of the same type. Furthermore, we can also use linguistic labels to restrict our query terms in several ways. For example, we can use part-of-speech labels to exclude keywords of a certain word class. We can also use the syntactic relation name to define query constraints. Each token layer can be restricted in this way (even if the feature used for the restriction is not part of the layer). For example, we can limit our set of root keywords to nouns even though part-of-speech labels are not part of the root layer. We can also combine constraints, for example, RootPOS keywords can be restricted to nouns that are in an object relation within the question. null Another feature of Lucene is the support of key-word weights. Keywords can be &quot;boosted&quot; using so-called &quot;boost factors&quot;. Furthermore, keywords can also be marked as &quot;required&quot;. These two features can be applied to all kinds of keywords (token layer, type layer, annotation layer keywords, and restricted keywords).</Paragraph> <Paragraph position="1"> The following list summarizes possible keyword types in our passage retrieval component: basic: a keyword in one of the index layers restricted: token-layer keywords can be restricted to a certain word class and/or a certain relation type. We use only the following word class restrictions: noun, name, adjective, verb; and the following relation type restrictions: direct object, modifier, apposition and subject weighted: keywords can be weighted using a boost factor required: keywords can be marked as required Query keywords from all types can be combined into a single query. We connect them in a disjunctive way which is the default operation in Lucene. The query engine provides ranked query results and, therefore, each disjunction may contribute to the ranking of the retrieved documents but does not harm the query if it does not produce any matching results. We may, for example, form a query with the following elements: (1) all plain text tokens; (2) named entities (ne) boosted with factor 2; (3) RootHead bigrams where the root is in an object relation; (4) RootRel keywords for all nouns. Applying these parameters to the question in figure 2 we get the following</Paragraph> <Paragraph position="3"> Now, query terms from various keyword types may refer to the same index layer. For example, we may use weighted plain text keywords restricted to nouns together with unrestricted plain text keywords. To 1Note that stop words have been removed.</Paragraph> <Paragraph position="4"> combine them we use a preference mechanism to keep queries simple and to avoid disjunctions with conflicting keyword parameters: (a) Restricted key-word types are more specific than basic keywords; (b) Keywords restricted in relation type and POS are more specific than keywords with only one restriction; (c) Relation type restrictions are more specific than POS restrictions. Using these rules we define that weights of more specific keywords overwrite weights of less specific ones. Furthermore, we define that the &quot;required-marker&quot; ('+') overwrites key-word weights. Using these definitions we would get the following query if we add two elements to the query from above: (5) plain text keywords in an object relation with boost factor 3 and (6) plain text keywords labeled as names marked as required.</Paragraph> <Paragraph position="6"> Finally, we can also use the question type determined by question analysis in the retrieval component. The question type corresponds to the expected answer type, i.e. we expect an entity of that type in the relevant text passages. In some cases, the question type can be mapped to one of the named entity labels assigned by the parser, e.g. a name question is looking for names of persons (ne = PER), a question for a capital is looking for a location (ne = LOC) and a question for organizations is looking for the name of an organization (ne = ORG). Hence, we can add another keyword type, the expected answer type to be matched with named entity labels in the ne layer, cf. (Prager et al., 2000).</Paragraph> <Paragraph position="7"> There are many possible combinations of restrictions even with the small set of POS labels and relation types listed above. However, many of them are useless because they cannot be instantiated. For example, an adjective cannot appear in subject relation to its head. For simplicity we limit ourselves to the following eight combined restrictions (POS + relation type): names + {direct object, modifier, apposition, subject} and nouns + {direct object, modifier, apposition, subject}. These can be applied to all token layers in the same way as the other restrictions using single constraints.</Paragraph> <Paragraph position="8"> Altogether we have 109 different keyword types using the layers and the restrictions defined above.</Paragraph> <Paragraph position="9"> Now the question is to select appropriate keyword types among them with the optimal parameters (weights) to maximize retrieval performance. The following section describes the optimization procedure used to adjust query parameters.</Paragraph> </Section> <Section position="5" start_page="942" end_page="944" type="metho"> <SectionTitle> 4 Optimization of query parameters </SectionTitle> <Paragraph position="0"> In the previous sections we have seen the internal structure of the multi-layer index and the queries we use in our passage retrieval component. Now we have to address the question of how to select layers and restrict keywords to optimize the performance of the system according to the QA task. For this we employ an automatic optimization procedure that learns appropriate parameter settings from example data. We use annotated training material that is described in the next section. Thereafter, the optimization procedure is introduced.</Paragraph> <Section position="1" start_page="942" end_page="942" type="sub_section"> <SectionTitle> 4.1 CLEF questions and evaluation </SectionTitle> <Paragraph position="0"> We used results from the CLEF competition on Dutch QA from the years 2003 and 2004 for training and evaluation. They contain natural language questions annotated with their answers found in the CLEF corpus (answer strings and IDs of documents in which the answer was found). Most of the questions are factoid questions such as 'Hoeveel inwoners heeft Zweden?' (How many inhabitants does Sweden have?). Altogether there are 631 questions with 851 answers.2 Standard measures for evaluating information retrieval results are precision and recall. However, for QA several other specialized measures have been proposed, e.g. mean reciprocal rank (MRR) (Vorhees, 1999), coverage and redundancy (Roberts and Gaizauskas, 2004). MRR accounts only for the first passage retrieved containing an answer and disregards the following passages. Coverage and redundancy on the other hand disregard the ranking completely and focus on the sets of passages retrieved. However, in our QA system, the IR score 2Each question may have multiple possible answers. We also added some obvious answers which were not in the original test set when encountering them in the corpus. For example, names and numbers can be spelled differently (Kim Jong Il vs. Kim Jong-Il, Saoedi-Arabi&quot;e vs. Saudi-Arabi&quot;e, bijna vijftig jaar vs. bijna 50 jaar) (on which the retrieval ranking is based) is one of the clues used by the answer identification modules.</Paragraph> <Paragraph position="1"> Therefore, we use the mean of the total reciprocal ranks (MTRR), cf. (Radev et al., 2002), to combine features of all three measures:</Paragraph> <Paragraph position="3"> Ai is the set of retrieved passages containing an answer to question number i (subset of Ri) and rankRi(d) is the rank of document d in the list of retrieved passages Ri.</Paragraph> <Paragraph position="4"> In our experiments we used the provided answer string rather than the document ID to judge if a retrieved passage was relevant or not. In this way, the IR engine may provide passages with correct answers from other documents than the ones marked in the test set. We do simple string matching between answer strings and words in the retrieved passages.</Paragraph> <Paragraph position="5"> Obviously, this introduces errors where the matching string does not correspond to a valid answer in the context. However, we believe that this does not influence the global evaluation figure significantly and therefore we use this approach as a reasonable compromise when doing automatic evaluation.</Paragraph> </Section> <Section position="2" start_page="942" end_page="944" type="sub_section"> <SectionTitle> 4.2 Learning query parameters </SectionTitle> <Paragraph position="0"> As discussed earlier, there is a large variety of possible keyword types that can be combined to query the multi-layer index. Furthermore, we have a number of parameters to be set when formulating a query, e.g. the keyword weights. Selecting the appropriate keywords and parameters is not straightforward. We like to carry out a systematic search for optimizing parameters rather than using our intuition. Here, we use the information retrieval engine as a black box with certain input parameters. We do not know how the ranking is done internally or how the output is influenced by parameter changes. However, we can inspect and evaluate the output of the system.</Paragraph> <Paragraph position="1"> Hence, we need an iterative approach for testing several settings to optimize query parameters. The output for each setting has to be evaluated according to a certain objective function. For this, we need an automatic procedure because we want to check many different settings in a batch run. The performance of the system can be measured in several ways, e.g. us- null ing the MTRR scores described in the previous section. We have chosen to use this measure and the annotated CLEF questions to evaluate the retrieval performance automatically.</Paragraph> <Paragraph position="2"> We decided to use a simplified genetic algorithm to optimize query parameters. This algorithm is implemented as an iterative &quot;trial-and-error beam search&quot; through possible parameter settings. The optimization loop works as follows (using a sub-set of the CLEF questions): 1. Run initial queries (one keyword type per IR run) with default weights.</Paragraph> <Paragraph position="3"> 2. Produce a number of new settings by combining two previous ones (= crossover). For this, select two settings from an N-best list from the previous IR runs. Apply mutation operations (see next step) until the new settings are unique (among all settings we have tried so far).</Paragraph> <Paragraph position="4"> 3. Change some of the new settings at random (= mutation) using pre-defined mutation operations.</Paragraph> <Paragraph position="5"> 4. Run the queries using the new settings and evaluate the retrieval output (determine fitness).</Paragraph> <Paragraph position="6"> 5. Continue with 2 until some stop condition is satisfied. This optimization algorithm is very simple but requires some additional parameters. First of all, we have to set the size of the population, i.e. the number of IR runs (individuals) to be kept for the next iteration. We decided to keep the population small with only 25 individuals. Then we have to decide how to evaluate fitness to rank retrieval results. This is done using the MTRR measure. Natural selection using these rankings is simplified to a top-N search without giving individuals with lower fitness values a chance to survive. This also means that we can update the population directly when a new IR run is finished. We also have to set a maximum number of new settings to be created. In our experiments we limit the process to a maximum of 50 settings that may be tried simultaneously. A new setting is created as soon as there is a spot available.</Paragraph> <Paragraph position="7"> An important part of the algorithm is the combination of parameters. We simply merge the settings of two previous runs (parents) to produce a new setting (a child). That means that all keyword types (with their restrictions) from both parents are included in the child's setting. Parents are selected at random without any preference mechanism. We also use a very simple strategy in cases where both parents contain the same keyword type. In these cases we compute the arithmetic mean of the weight assigned to this type in the parents' settings (default weight is one). If the keyword type is marked as required in one of the parents, it will also be marked as required in the child's setting (which will overwrite the keyword weight if it is set in the other parent).</Paragraph> <Paragraph position="8"> Another important principle in genetic optimization is mutation. It refers to a randomized modification of settings when new individuals are created. First, we apply mutation operations where new settings are not unique.3 Secondly, mutation operations are applied with fixed probabilities to new settings.</Paragraph> <Paragraph position="9"> In most genetic algorithms, settings are converted to genes consisting of bit strings. A mutation operation is then defined as flipping the value of one randomly chosen bit. In our approach, we do not use bit strings but define several mutation operations to modify parameters directly. The following operations have been defined: * a new keyword type is added to new settings with a chance of 0.2 * a keyword type is removed from the settings with a chance of 0.1 * a keyword weight (boost factor) is modified by a random value between -5 and 5 with a chance of 0.2 (but only if the weight remains a positive value) * a keyword type is marked as required with a chance of 0.01 All these parameters are intuitively chosen. We assigned rather high probabilities to the mutation operations to reduce the risk of local maximum traps. Note that there is no obvious condition for termination. In randomized approaches like this one the development of the fitness score is most likely not monotonic and therefore, it is hard to predict when we should stop the optimization process. However, we expect the scores to converge at some point and we may stop if a certain number of new settings does not improve the scores anymore.</Paragraph> <Paragraph position="10"> 3We require unique settings in our implementation because we want to avoid re-computation of fitness values for settings that have been tried already. &quot;Good&quot; settings survive anyway using our top-N selection approach.</Paragraph> </Section> </Section> class="xml-element"></Paper>