File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1608_evalu.xml
Size: 7,861 bytes
Last Modified: 2025-10-06 13:59:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1608"> <Title>Extracting Structural Paraphrases from Aligned Monolingual Corpora</Title> <Section position="7" start_page="2" end_page="2" type="evalu"> <SectionTitle> 6 Paraphrases and Question Answering </SectionTitle> <Paragraph position="0"> The ultimate goal of our work on paraphrases is to enable the development of high-precision question answering system (cf. (Katz and Levin, 1988; Soubbotin and Soubbotin, 2001; Hermjakob et al., 2002)). We believe that a knowledge base of paraphrases is the key to overcoming challenges presented by the expressiveness of natural languages.</Paragraph> <Paragraph position="1"> Because the same semantic content can be expressed in many different ways, a question answering system must be able to cope with a variety of alternative phrasings. In particular, an answer stated in a form that differs from the form of the question presents significant problems: When did Colorado become a state? (1a) Colorado became a state in 1876.</Paragraph> <Paragraph position="2"> (1b) Colorado was admitted to the Union in 1876.</Paragraph> <Paragraph position="3"> Who killed Abraham Lincoln? (2a) John Wilkes Booth killed Abraham Lincoln.</Paragraph> <Paragraph position="4"> (2b) John Wilkes Booth ended Abraham Lincoln's life with a bullet.</Paragraph> <Paragraph position="5"> In the above examples, question answering systems have little difficulty extracting answers if the answers are stated in a form directly derived from the question, e.g., (1a) and (2a); simple keyword matching techniques with primitive named-entity detection technology will suffice. However, question answering systems will have a much harder time extracting answers from sentences where they are not obviously stated, e.g., (1b) and (2b). To relate question to answers in those examples, a system would need access to rules like the following: X became a state in Y== X was admitted to the Union in Y X killed Y== X ended Y's life We believe that such rules are best formulated at the syntactic level: structural paraphrases represent a good level of generality and provide much more accurate results than keyword-based approaches. The simplest approach to overcoming the &quot;paraphrase problem&quot; in question answering is via key-word query expansion when searching for candidate answers:</Paragraph> <Paragraph position="7"> The major drawback of such techniques is over-generation of bogus answer candidates. For example, it is a well-known result that query expansion based on synonymy, hyponymy, etc. may actually degrade performance if done in an uncontrolled manner (Voorhees, 1994). Typically, keyword-based query expansion techniques sacrifice significant amounts of precision for little (if any) increase in recall.</Paragraph> <Paragraph position="8"> The problems associated with keyword query expansion techniques stem from the fundamental deficiencies of &quot;bag-of-words&quot; approaches; in short, they simply cannot accurately model the semantic content of text, as illustrated by the following pairs of sentences and phrases that have the same word content, but dramatically different meaning: (3a) The bird ate the snake.</Paragraph> <Paragraph position="9"> (3b) The snake ate the bird.</Paragraph> <Paragraph position="10"> (4a) the largest planet's volcanoes (4b) the planet's largest volcanoes (5a) the house by the river (5b) the river by the house (6a) The Germans defeated the French.</Paragraph> <Paragraph position="11"> (6b) The Germans were defeated by the French.</Paragraph> <Paragraph position="12"> The above examples are nearly indistinguishable in terms of lexical content, yet their meanings are vastly different. Naturally, because one text fragment might be an appropriate answer to a question while the other fragment may not be, a question answering system seeking to achieve high precision must provide mechanisms for differentiating the semantic content of the pairs.</Paragraph> <Paragraph position="13"> While paraphrase techniques at the keyword-level vastly overgenerate, paraphrase techniques at the phrase-level undergenerate, that is, they are often too specific. Although paraphrase rules can easily be formulated at the string-level, e.g., using regular expression matching and substitution techniques (Soubbotin and Soubbotin, 2001; Hermjakob et al., 2002), such a treatment fails to capture important linguistic generalizations. For example, the addition of an adverb typically does not alter the validity of a paraphrase; thus, a phrase-level rule &quot;X killed Y&quot; == &quot;X ended Y's life&quot; would not be able to match an answer like &quot;John Wilkes Booth suddenly ended Abraham Lincoln's life with a bullet&quot;. String-level paraphrases are also unable to handle syntactic phenomenona like passivization, which are easily captured at the syntactic level.</Paragraph> <Paragraph position="14"> We believe that answering questions at level of syntactic relations, that is, matching parsed representations of questions with parsed representations of candidates, addresses the issues presented above. Syntactic relations, basically simplified versions of dependency structures derived from the Link Parser, can capture significant portions of the meaning present in text documents, while providing a flexible foundation on which to build machinery for paraphrases.</Paragraph> <Paragraph position="15"> Our position is that question answering should be performed at the level of &quot;key relations&quot; in addition to keywords. We have begun to experiment with relations indexing and matching techniques described above using an electronic encyclopedia as the test corpus. We identified a particular set of linguistic phenomena where relation-based indexing can dramatically boost the precision of a question answering system (Katz and Lin, 2003). As an example, consider a sample output from a baseline keyword-based IR system: What do frogs eat? (R1) Alligators eat many kinds of small animals that live in or near the water, including fish, snakes, frogs, turtles, small mammals, and birds.</Paragraph> <Paragraph position="16"> (R2) Some bats catch fish with their claws, and a few species eat lizards, rodents, birds, and frogs.</Paragraph> <Paragraph position="17"> (R3) Bowfins eat mainly other fish, frogs, and crayfish. null (R4) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders. null ...</Paragraph> <Paragraph position="18"> (R32) Kookaburras eat caterpillars, fish, frogs, insects, small mammals, snakes, worms, and even small birds.</Paragraph> <Paragraph position="19"> Of the 32 sentences returned, only (R4) correctly answers the user query; the other results answer a different question--&quot;What eats frogs?&quot; A bag-of-words approach fundamentally cannot differentiate between a query in which the frog is in the subject position and a query in which the frog is in the object position. Compare this to the results produced by our relations matcher: What do frogs eat? (R4) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders. null By examining subject-verb-object relations, our system can filter out irrelevant results and return only the correct responses.</Paragraph> <Paragraph position="20"> We are currently working on combining this relations-indexing technology with the automatic paraphrase generation technology described earlier. For example, our approach would be capable of automatically learning a paraphrase like X eat Y == Y is a prey of X; a large collection of such paraphrases would go a long way in overcoming the brittleness associated with a relations-based indexing scheme.</Paragraph> </Section> class="xml-element"></Paper>