File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2029_metho.xml

Size: 7,762 bytes

Last Modified: 2025-10-06 14:08:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2029">
  <Title>Automatic Derivation of Surface Text Patterns for a Maximum Entropy Based Question Answering System</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 KM Corpus
</SectionTitle>
    <Paragraph position="0"> A corpus of question-answer pairs was obtained from Knowledge Master (1999). We refer to this corpus as the 1Work done while the author was an intern at IBM TJ Watson Research Center during Summer 2002.</Paragraph>
    <Paragraph position="1"> KM database. Each of the pairs in KM represents a trivia question and its corresponding answer, such as the ones used in the trivia card game. The question-answer pairs in KM were filtered to retain only questions that look similar to the ones presented in the TREC task2. Some examples of QA pairs in KM:  1. Which country was invaded by the Libyan troops in 1983? - Chad 2. Who led the 1930 Salt March in India? - Mohandas</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Gandhi
3 Unsupervised Construction of Training
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Set for Pattern Extraction
</SectionTitle>
      <Paragraph position="0"> We use an unsupervised technique that uses the QA in KM as seeds to learn patterns. This method was first described in Ravichandran and Hovy (2002). However, in this work we have enriched the pattern format by inducing specific semantic types of QTerms, and have learned many more patterns using the KM.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Algorithm for sentence construction
</SectionTitle>
      <Paragraph position="0"> 1. For every question, we run a Named Entity Tagger HMMNE 3 and identify chunks of words, that signify entities. Each such entity obtained from the Question is defined as a Question term (QTerm).</Paragraph>
      <Paragraph position="1"> The Answer Term (ATerm) is the Answer given by the KM corpus.</Paragraph>
      <Paragraph position="2"> 2. Each of the question-answer pairs is submitted as query to a popular Internet search engine4. We use the top 50 relevant documents after stripping off the HTML tags. The text is then tokenized to smoothen white space variations and chopped to individual sentences.</Paragraph>
      <Paragraph position="3"> 3. For every sentence obtained from Step (3) apply  HMMNE and retain only those sentences that contains at least one of the QTerms plus the ATerm. For example, we obtain the following sentences for the QA pair &amp;quot;Which country was invaded by the Libyan troops in 1983? - Chad&amp;quot;:  1. More than 7,000 Libyan troops entered Chad. 2. An OUA peacekeeping force of 3,500 troops replaced the Libyan forces in the remainder of Chad.</Paragraph>
      <Paragraph position="4"> 3. In the summer of 1983, GUNT forces launched an offen null sive against government positions in northern and eastern Chad.</Paragraph>
      <Paragraph position="5"> The underlined words indicate the QTerms and the ATerms that helped to select the sentence as a potential way of answering the Question. The algorithm described above was applied to each of the 16,228 QA pairs in our KM database. A total of more than 250K sentences was obtained.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Sentence Canonicalization
</SectionTitle>
      <Paragraph position="0"> Every sentence obtained from the sentence construction algorithm is canonicalized. Canonicalization of a sentence is performed on the basis of the information provided by HMMNE, the QTerms and the ATerm. Canonicalization in this context may be defined as the generalization of a sentence based on the following process:  1. Apply HMMNE to each sentence obtained from the sentence construction algorithm.</Paragraph>
      <Paragraph position="1"> 2. Identify the QTerms and ATerm in the answer sentence. null 3. Replace the ATerm by the tag &amp;quot;a3 ANSWERa4 &amp;quot;. 4. Replace each identified Named Entity by the class of entity it represents.</Paragraph>
      <Paragraph position="2"> 5. If a given Named Entity is also a QTerm, indicate it by the tag &amp;quot;QT&amp;quot;.</Paragraph>
      <Paragraph position="3">  1. Every sentence obtained from sentence canonicalization algorithm is delimited by the tags &amp;quot;a3 STARTa4 &amp;quot; and &amp;quot;a3 ENDa4 &amp;quot; and then passed through a Suffix Tree. The Suffix Tree algorithm obtains the counts of all sub-strings of the sentence. 2. From the Suffix Tree we obtain only those sub-strings that are at least a trigram, contain both the &amp;quot;a3 ANSWERa4 &amp;quot; and the &amp;quot;a3 QTa4 &amp;quot; tag and have at least a count of 3 occurrences.</Paragraph>
      <Paragraph position="4">  Some examples of patterns obtained from the Suffix Tree algorithm are as follows:  1. son of a5 PERSON QTa6 and a5 ANSWERa6 2. of the a5 ANSWERa6a8a5 DISEASE QTa6 3. of a5 ANSWERa6 at a5 LOCATION QTa6 4. a5 ANSWERa6 was the a5 ORDINALa6a9a5 OCCUPATION QTa6 to 5. a5 ANSWERa6 was elected a5 OCCUPATION QTa6 of the  A set of 22,353 such patterns were obtained by the application of the pattern extraction algorithm from more than 250,000 sentences. Some patterns are very general and applicable to many questions, such as the ones in examples (7) and (8) while others are more specific to a few questions, such as examples (9) and (10). Having obtained these patterns we now can learn the appropriate &amp;quot;weights&amp;quot; to use these patterns in a Question Answering System.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Maximum Entropy Training
</SectionTitle>
    <Paragraph position="0"> For these experiments we use the Maximum Entropy formulation (Della Pietra et al., 1995) and model the distri-</Paragraph>
    <Paragraph position="2"> The patterns derived above are used as features to model the distribution a37a39a38a41a40a43a42a44a46a45a48a47a49a45a48a50a52a51 , which predicts the &amp;quot;correctness&amp;quot; of the configuration of the question, a47 , the predicted answer tag, a44 , and the answer candidate, a50 . The training data for the algorithm consists of TREC-8, TREC-9, and a subset of the KM questions which have been judged to have answers in the TREC corpus5. The total number of questions available for training is shown in Table 1.</Paragraph>
    <Paragraph position="3"> We perform 3 sets of experiment with different choice of feature sets for training: 1. In the first experiment, the patterns obtained automatically from the web are trained along with the expected type of answer using the Maximum Entropy Framework. We refer to this system as the  30 different expected answer types (the ones recognized by HMMNE).</Paragraph>
    <Paragraph position="4"> 2. In the second experiment we use a Statistical QA system that contains bag of words, syntactic and named-entity features. We refer to this system as the IBM TREC11 System. Details of this system appear in (Ittycheriah and Roukos, 2002). This system has approximately 8,000 features.</Paragraph>
    <Paragraph position="5"> 3. In the third experiment we add the patterns as ad null ditional features to the base system IBM TREC11 and train the system. We refer to this system as the ME PAT System. Hence, the total number of features in this system is equal to the sum of the ones in Pat Only and IBM TREC11 system.</Paragraph>
    <Paragraph position="6"> These systems were trained on TREC-9 and KM and for picking the optimum model we used TREC-8 as held-out test data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML