File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2506_metho.xml

Size: 20,917 bytes

Last Modified: 2025-10-06 14:09:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2506">
  <Title>A Novel Approach to Focus Identification in Question/Answering Systems</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Textual Question Answering
</SectionTitle>
    <Paragraph position="0"> First the target question is processed to derive (a) the semantic class of the expected answer and (b) what key-words constitute the queries used to retrieve relevant paragraphs. Question processing relies on external resources to identify the class of the expected answer, typically in the form of semantic ontologies (Answer Type Ontology).</Paragraph>
    <Paragraph position="1"> Second, the semantic class of the expected answer is later used to (1) filter out paragraphs that do not contain any word that can be cast in the same class as the expected answer, and (2) locate and extract the answers from the paragraphs. Finally, the answers are extracted and ranked based on their unification with the question.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Question Processing
</SectionTitle>
      <Paragraph position="0"> To determine what a question asks about, several forms of information can be used. Since questions are expressed in natural language, sometimes their stems, e.g., who, what or where indicate the semantic class of the expected answer, i.e. PERSON, ORGANIZATION or LO-CATION, respectively. To identify words that belong to such semantic classes, Name Entity Recognizers are used, since most of these words represent names. Name Entity (NE) recognition is a natural language technology that identifies names of people, organizations, locations and dates or monetary values.</Paragraph>
      <Paragraph position="1"> However, most of the time the question stems are either ambiguous or they simply do not exist. For example, questions having what as their stem may ask about anything. In this case another word from the question needs to be used to determine the semantic class of the expected answer. In particular, the additional word is semantically classified against an ontology of semantic classes. To determine which word indicates the semantic class of the expected answer, the syntactic dependencies1 between the question words may be employed (Harabagiu 1Syntactic parsers publicly available, e.g., (Charniak, 2000; et al., 2000; Pasca and Harabagiu, 2001; Harabagiu et al., 2001).</Paragraph>
      <Paragraph position="2"> Sometimes the semantic class of the expected answers cannot be identified or is erroneously identified causing the selection of erroneous answers. The use of text classification aims to filter out the incorrect set of answers that Q/A systems provide.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Paragraph Retrieval
</SectionTitle>
      <Paragraph position="0"> Once the question processing has chosen the relevant keywords of questions, some term expansion techniques are applied: all nouns and adjectives as well as morphological variations of nouns are inserted in a list. To find the morphological variations of the nouns, we used the CELEX (Baayen et al., 1995) database. The list of expanded keywords is then used in the boolean version of the SMART system to retrieve paragraphs relevant to the target question. Paragraph retrieval is preferred over full document retrieval because (a) it is assumed that the answer is more likely to be found in a small text containing the question keywords and at least one other word that may be the exact answer; and (b) it is easier to process syntactically and semantically a small text window for unification with the question than processing a full document. null</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Answer Extraction
</SectionTitle>
      <Paragraph position="0"> The procedure for answer extraction that we used is reported in (Pasca and Harabagiu, 2001), it has 3 steps: Step 1) Identification of Relevant Sentences: The Knowledge about the semantic class of the expected answer generates two cases: (a) When the semantic class of the expected answers is known, all sentences from each paragraph, that contain a word identified by the Named Entity recognizer as having the same semantic classes as the expected answers, are extracted. (b) The semantic class of the expected answer is not known, all sentences, that contain at least one of the keywords used for paragraph retrieval, are selected.</Paragraph>
      <Paragraph position="1"> Step 2) Sentence Ranking: We compute the sentence ranks as a by product of sorting the selected sentences. To sort the sentences, we may use any sorting algorithm, e.g., the quicksort, given that we provide a comparison function between each pair of sentences. To learn the comparison function we use a simple neural network, namely, the perceptron, to compute a relative comparison between any two sentences. This score is computed by considering four different features for each sentence as explained in (Pasca and Harabagiu, 2001).</Paragraph>
      <Paragraph position="2"> Step 3) Answer Extraction: We select the top 5 ranked sentences and return them as Collins, 1997), can be used to capture the binary dependencies between the head of each phrase.</Paragraph>
      <Paragraph position="3"> answers. If we lead fewer than 5 sentences to select from, we return all of them.</Paragraph>
      <Paragraph position="4"> Once the answers are extracted we can apply an additional filter based on text categories. The idea is to match the categories of the answers against those of the questions. Next section addresses the problem of question and answer categorization.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="8" type="metho">
    <SectionTitle>
3 Text and Question Categorization
</SectionTitle>
    <Paragraph position="0"> To exploit category information for Q/A we categorize both answers and questions. For the former, we define as categories of an answer a the categories of the document that contain a. For the latter, the problem is more critical as it is not clear what can be considered as categories of a question.</Paragraph>
    <Paragraph position="1"> To define question categories we assume that users have a specific domain in mind when they formulate their requests. Although, this can be considered a strong assumption, it is verified in practical cases. In fact, to formulate a sound question about a topic, the questioner needs to know some basic concepts about that topic. As an example consider a random question from TREC-92: &amp;quot;How much folic acid should an expectant mother get daily?&amp;quot; The folic acid and get daily concepts are related to the expectant mother concept since medical experts prescribe such substance to pregnant woman with a certain frequency. The hypothesis that the question was generated without knowing the relations among the above concepts is unlikely. Additionally, such specific relations are frequent and often they characterize domains. Thus, the user, by referring to some relations, automatically determines specific domains or categories. In summary, the idea of question categorization is: (a) users cannot formulate a consistent question on a domain that do not know, and (b) specific questions that express relation among concepts automatically define domains.</Paragraph>
    <Paragraph position="2"> It is worth noting that the specificity of the questions depends on the categorization schemes which documents are divided in. For example the following TREC question: null &amp;quot;What was the name of the first Russian astronaut to do a spacewalk?&amp;quot; may be considered generic, but if a categorization scheme includes categories like Space Conquest History or Astronaut and Spaceship the above question is clearly specific on the above categories.</Paragraph>
    <Paragraph position="3"> The same rationale cannot be applied to very short questions like: Where is Belize located?, Who 2TREC-9 questions are available at http:// trec.nist.gov/qa questions 201-893.</Paragraph>
    <Paragraph position="4"> invented the paper clip? or How far away is the moon? In these cases we cannot assume that a question category exists. However, our aim is to provide an additional answer filtering mechanism for stand-alone Q/A systems. This means that when question categorization is not applicable, we can deactivate such a mechanism. null The automatic models that we have study to classify questions and answers are: Rocchio (Ittner et al., 1995) and SVM (Vapnik, 1995) classifiers. The former is a very efficient TC that can be used for real scenario applications. This is a very appealing property considering that Q/A systems are designed to operate on the web. The second is one of the best figure TC that provides good accuracy with a few training data.</Paragraph>
    <Section position="1" start_page="0" end_page="8" type="sub_section">
      <SectionTitle>
3.1 Rocchio and SVM Text Classifiers
</SectionTitle>
      <Paragraph position="0"> Rocchio and Support Vector Machines are both based on the Vector Space Model. In this approach, the document d is described as a vector ~d =&lt;wdf1;::;wdfjFj&gt; in a jFjdimensional vector space, where F is the adopted set of features. The axes of the space, f1;::;fjFj 2 F, are the features extracted from the training documents and the vector components wdfj 2&lt; are weights that can be evaluated as described in (Salton, 1989).</Paragraph>
      <Paragraph position="1"> The weighing methods that we adopted are based on the following quantities: M, the number of documents in the training-set, Mf, the number of documents in which the features f appears and ldf, the logarithm of the term frequency defined as:</Paragraph>
      <Paragraph position="3"> where, odf are the occurrences of the features f in the document d (TF of features f in document d).</Paragraph>
      <Paragraph position="4"> Accordingly, the document weights is:</Paragraph>
      <Paragraph position="6"> where the IDF(f) (the Inverse Document Frequency) is defined as log( MM f ).</Paragraph>
      <Paragraph position="7"> Given a category C and a set of positive and negative examples, P and -P, Rocchio and SVM learning algorithms use the document vector representations to derive a hyperplane3, ~a PS ~d + b = 0. This latter separates the documents that belong to C from those that do not belong to C in the training-set. More precisely, 8~d positive examples (~d 2 P), ~a PS ~d + b , 0, otherwise (~d 2 -P) ~aPS ~d + b &lt; 0. ~d is the equation variable, while the gradient ~a and the constant b are determined by the target learning algorithm. Once the above parameters are available, it is possible to define the associated classification 3The product between vectors is the usual scalar product. function, `c : D !fC;;g, from the set of documents D to the binary decision (i.e., belonging or not to C). Such decision function is described by the following equation:</Paragraph>
      <Paragraph position="9"> Eq. 2 shows that a category is accepted only if the product ~a PS ~d overcomes the threshold !b. Rocchio and SVM are characterized by the same decision function4. Their difference is the learning algorithm to evaluate the b and the~a parameters: the former uses a simple heuristic while the second solves an optimization problem.</Paragraph>
      <Paragraph position="10">  The learning algorithm of the Rocchio text classifier is the simple application of the Rocchio's formula (Eq. 3) (Rocchio, 1971). The parameters ~a is evaluated by the equation:</Paragraph>
      <Paragraph position="12"> where P is the set of training documents that belongs to C and %0 is a parameter that emphasizes the negative information. This latter can be estimated by picking-up the value that maximizes the classifier accuracy on a training subset called evaluation-set. A method, named the Parameterized Rocchio Classifier, to estimate good parameters has been given in (Moschitti, 2003b).</Paragraph>
      <Paragraph position="13"> The above learning algorithm is based on a simple and efficient heuristic but it does not ensure the best separation of the training documents. Consequently, the accuracy is lower than other TC algorithms.</Paragraph>
      <Paragraph position="14">  The major advantage of SVM model is that the parameters ~a and b are evaluated applying the Structural Risk Minimization principle (Vapnik, 1995), stated in the statistical learning theory. This principle provides a bound for the error on the test-set. Such bound is minimized if the SVMs are chosen in a way that j~aj is minimal. More precisely the parameters ~a and b are a solution of the following optimization problem:  It can be proven that the minimum j~aj leads to a maximal margin5 (i.e. distance) between negative and positive examples.</Paragraph>
      <Paragraph position="15">  sification algorithm for SVM is described in (Joachims, 1999) and it can be downloaded from the web site http://svmlight.joachims.org/.</Paragraph>
      <Paragraph position="16"> In summary, SVM provides a better accuracy than Rocchio but this latter is better suited for real applications. null</Paragraph>
    </Section>
    <Section position="2" start_page="8" end_page="8" type="sub_section">
      <SectionTitle>
3.2 Question Categorization
</SectionTitle>
      <Paragraph position="0"> In (Moschitti, 2003b; Joachims, 1999), Rocchio and SVM text classifiers have reported to generate good accuracy. Therefore, we use the same models to classify questions. These questions can be considered as a particular case of documents, in which the number of words is small. Due to the small number of words, a large collection of questions needs to be used for training the classifiers when reaching a reliable statistical word distribution. Practically, large number of training questions is not available. Consequently, we approximate question word statistics using document statistics and we learn question categorization functions on category documents.</Paragraph>
      <Paragraph position="1"> We define for each question q a vector ~q = &lt;wq1;::;wqjFqj&gt;, where wqi 2 &lt; are the weights associated to the question features in the feature set Fq, e.g. the set of question words. Then, we evaluate four different methods computing the weights of question features, which in turn determine five models of question categorization: null Method 1: We use lqf, the logarithm (evaluated similarly to Eq. 1) of the word frequency f in the question q, together with the IDF derived from training documents as follows:</Paragraph>
      <Paragraph position="3"> This weighting mechanism uses the Inverse Document Frequency (IDF) of features instead of computing the Inverse Question Frequency. The rationale is that question word statistics can be estimated from the word document distributions. When this method is applied to the Rocchio-based Text Categorization model, by substituting wdf with wqf we obtain a model call the RTC0 model. When it is applied to the SVM model, by substituting wdf with wqf, we call it SVM0.</Paragraph>
      <Paragraph position="4"> Method 2: The weights of the question features are computed by the formula 5 employed in Method 1, but they are used in the Parameterized Rocchio Model (Moschitti, 2003b). This entails that %0 from formula 3 as well as the threshold b are chosen to maximize the categorization accuracy of the training questions. We call this model of categorization PRTC.</Paragraph>
      <Paragraph position="5"> Method 3: The weights of the question features are computed by formula 5 employed in Method 1, but they are used in an extended SVM model, in which two additional conditions enhance the optimization problem expressed by Eq. 4. The two new conditions are:  where Pq and -Pq are the set of positive and negative examples of training questions for the target category C. We call this question categorization model QSVM.</Paragraph>
      <Paragraph position="6"> Method 4: We use the output of the basic Q/A system to assign a category to questions. Each question has associated up to five answer sentences. In turn, each of the answers is extracted from a document, which is categorized. The category of the question is chosen as the most frequent category of the answers. In case that more than one category has the maximal frequency, the set of categories with maximal frequency is returned. We named this ad-hoc question categorization method QATC (Q/A and TC based model).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="8" end_page="8" type="metho">
    <SectionTitle>
4 Answer Filtering and Re-Ranking Based
on Text Categorization
</SectionTitle>
    <Paragraph position="0"> Many Q/A systems extract and rank answers successfully, without employing any TC information. For such systems, it is interesting to evaluate if TC information improves the ranking of the answers they generate. The question category can be used in two ways: (1) to re-rank the answers by pushing down in the list any answer that is labeled with a different category than the question; or (2) to simply eliminate answers labeled with categories different than the question category.</Paragraph>
    <Paragraph position="1"> First, a basic Q/A system has to be trained on documents that are categorized (automatically or manually) in a predefined categorization scheme. Then, the target questions as well as the answers provided by the basic Q/A system are categorized. The answers receive the categorization directly from the categorization scheme, as they are extracted from categorized documents. The questions are categorized using one of the models described in the previous section. Two different impacts of question categorization on Q/A are possible: + Answers that do not match at least one of the categories of the target questions are eliminated. In this case the precision of the system should increase if the question categorization models are enough accurate. The drawback is that some important answers could be lost because of categorization errors.</Paragraph>
    <Paragraph position="2"> + Answers that do not match the target questions (as before) get lowered ranks. For example, if the first answer has categories different from the target question, it could shift to the last position in case of all other answers have (at least) one category in common with the question. In any case, all questions will be shown to the final users, preventing the lost of relevant answers.</Paragraph>
    <Paragraph position="3"> An example of the answer elimination and answer re-ranking is given in the following. As basic Q/A system we adopted the model described in Section 2. We trained6 it with the entire Reuters-21578 corpus7. In particular we adopted the collection Apt'e split. It includes 12,902 documents for 90 classes, with a fixed splitting between test-set and learning data (3,299 vs. 9,603). A description of some categories of this corpus is given in Table 1.</Paragraph>
    <Paragraph position="4">  Table 2 shows the five answers generated (with their corresponding rank) by the basic Q/A system, for one example question. The category of the document from which the answer was extracted is displayed in column 1. The question classification algorithm automatically assigned the Crude category to the question.</Paragraph>
    <Paragraph position="5"> The processing of the question identifies the word say as indicating the semantic class of the expected answer and for paragraph retrieval it used the keywords</Paragraph>
    <Paragraph position="7"> as well as all morphological variations for the nouns.</Paragraph>
    <Paragraph position="8"> For each answer from Table 2, we have underlined the words matched against the keywords and emphasized the word matched in the class of the expected answer, whenever such a word was recognized (e.g., for answers 1 and 3 only). For example, the first answer was extracted because words producers, product and directorate general could be matched against the key-words production, Director and General from the question and moreover, the word said has the same semantic class as the word say, which indicates the semantic class of the expected answer.</Paragraph>
    <Paragraph position="9"> The ambiguity of the word plants cause the basic Q/A system to rank the answers related to Cocoa and Grain plantations higher than the correct answer, which is ranked as the third one. If the answer re-ranking or elimination methods are adopted, the correct answer reaches  interview in Algiers that such imports.</Paragraph>
    <Paragraph position="10"> the top as it was assigned the same category as the question, namely the Crude category.</Paragraph>
    <Paragraph position="11"> Next section describes in detail our experiments to prove that question categorization add some important information to select relevant answers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML