File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1006_evalu.xml
Size: 6,066 bytes
Last Modified: 2025-10-06 13:58:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1006"> <Title>Learning Surface Text Patterns for a Question Answering System</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Shortcoming and Extensions </SectionTitle> <Paragraph position="0"> No external knowledge has been added to these patterns. We frequently observe the need for matching part of speech and/or semantic types, however. For example, the question: &quot;Where are the Rocky Mountains located?&quot; is answ ered by &quot;Denver's new airport, topped with white fiberglass cones in imitation of the Rocky Mountains in the background , continues to lie empty&quot;, because the system picked the answer &quot;the background&quot; using the pattern &quot;the <NAME> in <ANSWER>,&quot;. Using a nam ed entity tagger and/or an ontology would enable the system to use the knowledge that &quot;background&quot; is not a location.</Paragraph> <Paragraph position="1"> DEFINITION questions pose a related problem. Frequently the system's patterns match a term that is too general, though correct technicall y. For &quot;what is nepotism?&quot; the pattern &quot;<ANSWER>, <NAME>&quot; matches &quot;...in the form of widespread bureaucratic abuses: graft , nepotism...&quot;; for &quot;what is sonar?&quot; the pattern &quot;<NAME> and related <ANSWER>s&quot; matches &quot;...while its sonar and related underseas systems a re built...&quot;.</Paragraph> <Paragraph position="2"> The patterns cannot handle long - distance dependencies. For example, for &quot;Where is London?&quot; the system cannot locate the answer in &quot;London, which has one of the most busiest airports in the world, lies on the banks of the river Thames&quot; due to t he explosive danger of unrestricted wildcard matching, as would be required in the pattern &quot;<QUESTION>, (<any_word>)*, lies on <ANSWER>&quot;. This is one of the reasons why the system performs very well on certain types of questions from the web but performs poorly with documents obtained from the TREC corpus. The abundance and variation of data on the Internet allows the system to find an instance of its patterns without losing answers to long term dependencies. The TREC corpus, on the other hand, typically contains fewer candidate answers for a given question and many of the answers present may match only long - term dependency patterns.</Paragraph> <Paragraph position="3"> More information needs to be added to the text patterns regarding the length of the answer phrase to be expected. The syst em searches in the range of 50 bytes of the answer phrase to capture the pattern. It fails to perform under certain conditions as exemplified by the question &quot;When was Lyndon B. Johnson born?&quot;. The system selects the sentence &quot;Tower gained national attent ion in 1960 when he lost to democratic Sen. Lyndon B. Johnson, who ran for both re election and the vice presidency&quot; using the pattern &quot;<NAME> <ANSWER>-&quot;. The system lacks the information that the <ANSWER> tag should be replaced exactly by one word. Sim ple extensions could be made to the system so that instead of searching in the range of 50 bytes for the answer phrase it could search for the answer in the range of 1-2 chunks (basic phrases in English such as simple NP, VP, PP, etc.).</Paragraph> <Paragraph position="4"> A more serious limi tation is that the present framework can handle only one anchor point (the question term) in the candidate answer sentence. It cannot work for types of question that require multiple words from the question to be in the answer sentence, possibly apart fro m each other. For example, in &quot;Which county does the city of Long Beach lie?&quot;, the answer &quot;Long Beach is situated in Los Angeles County&quot; requires the pattern. &quot;<QUESTION_TERM_1> situated in <ANSWER> <QUESTION_TERM_2>&quot;, where <QUESTION_TERM_1> and <QUESTIO N_TERM_2> represent the terms &quot;Long Beach&quot; and &quot;county&quot; respectively. The performance of the system depends significantly on there being only one anchor word, which allows a single word match between the question and the candidate answer sentence. The pr esence of multiple anchor words would help to eliminate many of the candidate answers by simply using the condition that all the anchor words from the question must be present in the candidate answer sentence.</Paragraph> <Paragraph position="5"> The system does not classify or make any disti nction between upper and lower case letters. For example, &quot;What is micron?&quot; is answered by &quot;In Boise, Idaho, a spokesman for Micron, a maker of semiconductors , said Simms are ' a very high volume product for us ...' &quot;. The answer returned by the system wou ld have been perfect if the word &quot;micron&quot; had been capitalized in the question.</Paragraph> <Paragraph position="6"> Canonicalization of words is also an issue.</Paragraph> <Paragraph position="7"> While giving examples in the bootstrapping procedure, say, for BIRTHDATE questions, the answer term could be written in many ways ( for example, Gandhi's birth date can be written as &quot;1869&quot;, &quot;Oct. 2, 1869&quot;, &quot;2nd October 1869&quot;, &quot;October 2 1869&quot;, and so on). Instead of enlisting all the possibilities a date tagger could be used to cluster all the variations and tag them with the same t erm.</Paragraph> <Paragraph position="8"> The same idea could also be extended for smoothing out the variations in the question term for names of persons (Gandhi could be written as &quot;Mahatma Gandhi&quot;, &quot;Mohandas Karamchand Gandhi&quot;, etc.).</Paragraph> </Section> class="xml-element"></Paper>