File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1006_metho.xml

Size: 10,858 bytes

Last Modified: 2025-10-06 14:07:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1006">
  <Title>Learning Surface Text Patterns for a Question Answering System</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Learning of Patterns
</SectionTitle>
    <Paragraph position="0"> We describe the pattern - learning algorithm with an example. A table of patterns is constructed for each individual question type by the following procedure (Algorithm 1) .</Paragraph>
    <Paragraph position="1">  1. Select an example for a given question type. Thus for BIRTHYEAR questions we select &amp;quot;Mozart 1756&amp;quot; (we refer to &amp;quot;Mozart&amp;quot; as the question term and &amp;quot;1756&amp;quot; as the answer term).</Paragraph>
    <Paragraph position="2"> 2. Submit the question and the answer term as queries to a search engine. Thus, we give the query +&amp;quot;Mozart&amp;quot; +&amp;quot;1756&amp;quot; to AltaVista (http://www.altavista.com).</Paragraph>
    <Paragraph position="3"> 3. Download the top 1000 web documents provided by the search engine.</Paragraph>
    <Paragraph position="4"> 4. Apply a sentence breaker to the documents.</Paragraph>
    <Paragraph position="5"> 5. Retain only those sentences that contain both the question and the answ er term.</Paragraph>
    <Paragraph position="6"> Tokenize the input text, smooth variations in white space characters, and remove html and other extraneous tags, to allow simple regular expression matching tools such as egrep to be used.</Paragraph>
    <Paragraph position="7"> 6. Pass each retained sentence through a suffix tree constru ctor. This finds all substrings, of all lengths, along with their counts. For example consider the sentences &amp;quot;The great composer Mozart (1756 -1791) achieved fame at a young age&amp;quot; &amp;quot;Mozart (1756-1791) was a genius&amp;quot;, and &amp;quot;The whole world would always be inde bted to the great music of Mozart (1756 -1791)&amp;quot;. The longest matching  substring for all 3 sentences is &amp;quot;Mozart (1756-1791)&amp;quot;, which the suffix tree would extract as one of the outputs along with the score of 3.</Paragraph>
    <Paragraph position="8"> 7. Pass each phrase in the suffix tree through a f ilter to retain only those phrases that contain both the question and the answer term. For the example, we extract only those phrases from the suffix tree that contain the words &amp;quot;Mozart&amp;quot; and &amp;quot;1756&amp;quot;. 8. Replace the word for the question term by the tag &amp;quot;&lt;NAME &gt;&amp;quot; and the word for the answer term by the term &amp;quot;&lt;ANSWER&gt;&amp;quot;. This procedure is repeated for different examples of the same question type. For BIRTHDATE we also use &amp;quot;Gandhi 1869&amp;quot;, &amp;quot;Newton 1642&amp;quot;, etc.</Paragraph>
    <Paragraph position="9"> For BIRTHDATE, the above steps produce the following o utput:  a. born in &lt;ANSWER&gt; , &lt;NAME&gt; b. &lt;NAME&gt; was born on &lt;ANSWER&gt; , c. &lt;NAME&gt; ( &lt;ANSWER&gt; d. &lt;NAME&gt; ( &lt;ANSWER - )  ...</Paragraph>
    <Paragraph position="10"> These are some of the most common substrings of the extracted sentences that contain both &lt;NAME&gt; and &lt;ANSWER&gt;.</Paragraph>
    <Paragraph position="11"> Since the suffix tre e records all substrings, partly overlapping strings such as c and d are separately saved, which allows us to obtain separate counts of their occurrence frequencies. As will be seen later, this allows us to differentiate patterns such as d (which records a still living person, and is quite precise) from its more general substring c (which is less precise).</Paragraph>
    <Paragraph position="12">  pattern.</Paragraph>
    <Paragraph position="13"> 1. Query the search engine by using only the question term (in the example, only &amp;quot;Mozart&amp;quot;).</Paragraph>
    <Paragraph position="14"> 2. Down load the top 1000 web documents provided by the search engine.</Paragraph>
    <Paragraph position="15"> 3. As before, segment these documents into individual sentences.</Paragraph>
    <Paragraph position="16"> 4. Retain only those sentences that contain the question term.</Paragraph>
    <Paragraph position="17"> 5. For each pattern obtained from Algorithm 1, check the presence of each pattern in the sentence obtained from above for two instances: i) Presence of the pattern with &lt;ANSWER&gt; tag matched by any word.</Paragraph>
    <Paragraph position="18"> ii) Presence of the pattern in the sentence  with &lt;ANSWER&gt; tag matched by the correct answer term.</Paragraph>
    <Paragraph position="19"> In our example, for the pattern &amp;quot;&lt;NA ME&gt; was born in &lt;ANSWER&gt;&amp;quot; we check the presence of the following strings in the answer sentence i) Mozart was born in &lt;ANY_WORD&gt; ii) Mozart was born in 1756 Calculate the precision of each pattern by the formula P = C a / C o where C a = total number of patterns wi th the answer term present C o = total number of patterns present with answer term replaced by any word 6. Retain only the patterns matching a sufficient number of examples (we choose the number of examples &gt; 5).</Paragraph>
    <Paragraph position="20"> We obtain a table of regular expression patte rns for a given question type, along with the precision of each pattern. This precision is the probability of each pattern containing the answer and follows directly from the principle of maximum likelihood estimation. For BIRTHDATE the following table is  For a given question type a good range of p atterns was obtained by giving the system as few as 10 examples. The rather long list of patterns obtained would have been very difficult for any human to come up with manually.</Paragraph>
    <Paragraph position="21"> The question term could appear in the documents obtained from the web in va rious ways. Thus &amp;quot;Mozart&amp;quot; could be written as &amp;quot;Wolfgang Amadeus Mozart&amp;quot;, &amp;quot;Mozart, Wolfgang Amadeus&amp;quot;, &amp;quot;Amadeus Mozart&amp;quot; or &amp;quot;Mozart&amp;quot;. To learn from such variations, in step 1 of Algorithm 1 we specify the various ways in which the question term could be spe cified in the text. The presence of any of these names would cause it to be tagged as the original question term &amp;quot;Mozart&amp;quot;. The same arrangement is also done for the answer term so that presence of any variant of the answer term would cause it to be treat ed exactly like the original answer term. While easy to do for BIRTHDATE, this step can be problematic for question types such as DEFINITION, which may contain various acceptable answers. In general the input example terms have to be carefully selected so that the questions they represent do not have a long list of possible answers, as this would affect the confidence of the precision scores for each pattern. All the answers need to be enlisted to ensure a high confidence in the precision score of each p attern, in the present framework.</Paragraph>
    <Paragraph position="22"> The precision of the patterns obtained from one QA -pair example in algorithm 1 is calculated from the documents obtained in algorithm 2 for other examples of the same question type. In other words, the precision scores are calculated by cross - checking the patterns across various examples of the same type. This step proves to be very significant as it helps to eliminate dubious patterns, which may appear because the contents of two or more websites may be the same, or t he same web document reappears in the search engine output for algorithms 1 and 2.</Paragraph>
    <Paragraph position="23"> Algorithm 1 does not explicitly specify any particular question type. Judicious choice of the QA example pair therefore allows it to be used for many question types without change.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Finding Answers
</SectionTitle>
    <Paragraph position="0"> Using the patterns to answer a new question we employ the following algorithm:  1. Determine the question type of the new question. We use our existing QA system (Hovy et al., 2002b; 2001) to do so. 2. The question term in the que stion is identified, also using our existing system. 3. Create a query from the question term and perform IR (by using a given answer document corpus such as the TREC - 10 collection or web search otherwise).</Paragraph>
    <Paragraph position="1"> 4. Segment the documents obtained into sentences and smooth out white space variations and html and other tags, as before.</Paragraph>
    <Paragraph position="2"> 5. Replace the question term in each sentence by the question tag (&amp;quot;&lt;NAME&gt;&amp;quot;, in the case of BIRTHYEAR).</Paragraph>
    <Paragraph position="3"> 6. Using the pattern table developed for that particular question type, search for the presence of each pattern. Select words matching the tag &amp;quot;&lt;ANSWER&gt;&amp;quot; as the answer.</Paragraph>
    <Paragraph position="4"> 7. Sort these answers by their pattern's precision scores. Discard duplicates (by elementary string comparisons). Return the top 5 answers.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> From our Webclopedia QA Typology (Hovy et al., 2002a) we selected 6 different question types: BIRTHDATE, LOCATION, INVENTOR, DISCOVERER, DEFINITION, WHY -FAMOUS. The pattern table for each of these question types was constructed using Algorithm 1.</Paragraph>
    <Paragraph position="1"> Some of the patterns obtaine d along with their precision are as follows  For each question type, we extracted the corresponding questions from the TREC - 10 set. These questions were run through the testing phase of the algorithm. Two sets of experiments were performed. In the first case, the TREC corpus was used as the input source and IR was perfo rmed by the IR component of our QA system (Lin, 2002). In the second case, the web was the input source and the IR was performed by the AltaVista search engine.</Paragraph>
    <Paragraph position="2"> Results of the experiments, measured by  The results indicate that the system performs better on the Web data than on the TREC corpus. The abundance of data on the web makes it easier for the system to loc ate answers with high precision scores (the system finds many examples of correct answers among the top 20 when using the Web as the input source). A similar result for QA was obtained by Brill et al. (2001). The TREC corpus does not have enough candidat e answers with high precision score and has to settle for answers extracted from sentences matched by low precision patterns. The WHY -FAMOUS question type is an exception and may be due to the fact that the system was tested on a small number of questions .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML