File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0506_metho.xml

Size: 20,390 bytes

Last Modified: 2025-10-06 14:09:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0506">
  <Title>Cooperative Question Answering in Restricted Domains: the WEBCOOP Experiment</Title>
  <Section position="3" start_page="0" end_page="1" type="metho">
    <SectionTitle>
2 The WEBCOOP Architecture
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 A Corpus Based Approach
</SectionTitle>
      <Paragraph position="0"> To have a more accurate perception of how cooperativity is realized in man-man communication, we collected a corpus of question answer pairs (QA pairs) found in a number of web sites dedicated to different kinds of large public domains. 60% of the corpus is dedicated to tourism (our implementation being based on this application domain), 22% to health and the other QA pairs are dedicated to sport, shopping and education. The analysis of this corpus aims at identifying the external form and the conceptual categories of questions, as well as categorizing the different cooperative functions deployed by humans in their discourse. Our main claim is that an automatic cooperative QA system could be induced from natural productions without loosing too much of the cooperative contents produced by humans. We noted that human responses are much more diverse than any machine could produce in the near future. Nevertheless, it is possible to normalize these forms to more stereotyped utterances.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Architecture
</SectionTitle>
      <Paragraph position="0"> The general architecture of the system (figure 1) is inspired from our corpus analysis. Our system being a direct QA system, it does not have any user model.</Paragraph>
      <Paragraph position="1"> In WEBCOOP, NL responses are produced  from first order logical formulas constructed from reasoning processes carried out by an inference engine. Our approach requires the development of a knowledge extractor from web pages (Benamara and Saint Dizier, 2004b) (viewed as a passage retrieval component) and the elaboration of a robust question parser. We assume that the most relevant documents to the user's question are found using standard information retrieval techniques and that the relevant paragraphs that respond to the question keywords are correctly extracted from those documents (Harabagiu and Maiorano, 1999). Then, our knowledge extractor transforms each relevant paragraphs into a logical representation. The WEBCOOP inference engine has to decide, via cooperative rules, what is relevant and how to organize it in a way that allows for the realization of a coherent and informative response. Responses are structured in two parts. The first part contains explanation elements in natural language. It is a first level of cooperativity that reports user misconceptions in relation with the domain knowledge (answer explanation). The second part is the most important and the most original. It reflects the know-how of the cooperative system, going beyond the cooperative statements given in part one. It is based on intensional description techniques and on intelligent relaxation procedures going beyond classical generalization methods used in AI. This component also includes additional dedicated cooperative rules that make a thorough use of the domain ontology and of general knowledge. In WEBCOOP, responses provided to users are built in web style by integrating natural language generation (NLG) techniques with hypertexts in order to produce dynamic responses (Dale et al., 1998).</Paragraph>
      <Paragraph position="2"> We claim that responses in natural language must make explicit in some way, via explanations and justifications, the mechanisms that led to the answer. For each type of inference used in WEBCOOP, we define general and underspecified natural language templates (Reiter, 1995) that translate the reasoning mechanisms in accessible terms. A template is composed of three parts, S, F, and R,where: -S are specified elements, -F are functions that choose for each concept in the ontology, its appropriate lexicalization, - R are logical formulas representing the rest of the response to be generated.</Paragraph>
      <Paragraph position="3"> The underspecified elements, F and R,depend on the question, on local semantic factors and on the type of solution elaborated. Their generation relies on ontological knowledge, general linguistic knowledge and lexicalisation and aggregation functions. Templates have been induced from a number of QA pairs found in large public domaines. Responses have been normalized without loosing too much of their accuracy in order to get stereotyped response forms usable in NL generation frameworks. A large portion of underspecified elements, within a template, is presented as an hyperlink to the user as illustrated in the examples in the next section. Here is an example of a template dedicated to one of our relaxation schemas. It is used when the question focus is relaxed using its sister nodes in the ontology. Specified elements are in italic: un autre type de lexicalisation(mother node):</Paragraph>
      <Paragraph position="5"> At the moment, in WEBCOOP we have 28 basic templates.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
2.3 Two Typical Examples
</SectionTitle>
      <Paragraph position="0"> The following examples illustrate WEBCOOP outputs.</Paragraph>
      <Paragraph position="1"> Example1. Suppose one wishes to rent a 15 person country cottage in Corsica and (1) that observations made on the related web pages or (2) that a constraint or a regulation, indicates that the maximum capacity of a country cottage in Corsica is 10 persons (figure 1).</Paragraph>
      <Paragraph position="2"> The first part of the response relates the detection of a false presupposition or the viola- null A template fragment of the form (fragment) +  ,indicates that that fragment occurs in the generated response at least one time.</Paragraph>
      <Paragraph position="3"> tion of an integrity constraint for respectively cases (1) and (2) above. Case (2) entails the production of the following message, generated by a process that evaluates the question logical formula against the knowledge base: A chalet  query relaxation a second step, the know-how component of the cooperative system generates a set of flexible solutions as shown in the figure above, since the first part of the response is informative but not really productive for the user. The three flexible solutions proposed emerge from know-how co-operative rules based on relaxation procedures designed to be minimal and conceptually relevant. The first flexible solution is based on a cardinality relaxation, while in the last two solutions, relaxation operates gradually on concepts such as the type of accommodation (hotel or pension) or the region (possibly a close-by region, with similar characteristics), via the domain model and the ontology. Dynamically created links are underlined. The user can then, at will, get more precise information, dynamically generated from the data base of indexed web pages. For technical details on how relaxed responses are elaborated and generated in NL see (Benamara and Saint Dizier, 2004a).</Paragraph>
      <Paragraph position="4"> Example 2. Suppose a user asks for means of transportation to go to Geneva airport.In WEBCOOP, we have a variable-depth intensional calculus which allows us, experimentally, to tune the degree of intensionality of responses in terms of the abstraction level in the ontology of the generalizes. This choice is based on a conceptual metrics that determines the ontological proximity between two concepts. The goal is to have a level of abstraction adequate for the user. A supervisor manages both the abstraction level and the display of the elaborated intensional answers (IA). The retrieved IA are structured in two parts. First, the generation of a response with generalizations and exceptions: all trains, buses and taxis go to the airport. Then, a sorted list of the retrieved extensional answers is generated according to the frequency and to the cost of transportation. This strategy avoids the problem of having to guess the user's intent. For technical details on how IA are elaborated and generated in NL see (Benamara, 2004).</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Knowledge Representation for the
</SectionTitle>
      <Paragraph position="0"> Tourism Domain: a Typology A first question about knowledge, for automating the production of cooperative responses, concerns the type and the typology of knowledge involved and where such knowledge can be best represented: in databases, in knowledge bases, in texts (involving knowledge extraction  or fragments of text extractions). So far, the different forms of knowledge we have identified are, roughly: 1. general-purpose, factual information (places, distances, proper names, etc.), 2. descriptive information like flight schedules, hotel fares, etc. that we find in general in databases, 3. common sense knowledge and constraints such as: for a given trip, the arrival time is greater that the departure time, 4. hierarchical knowledge: such as a hotel is a kind of tourist accommodation.This knowledge is often associated with properties that define the object, for example a restaurant is characterized by its type of food, category, localization, etc.</Paragraph>
      <Paragraph position="1"> 5. procedures or instructions that describe how to prepare a trip or how to book a room in a given hotel category.</Paragraph>
      <Paragraph position="2"> 6. definitions, 7. regulations, warnings, 8. classification criteria of objects according  to specific properties such as sorting hotels according to their category.</Paragraph>
      <Paragraph position="3"> 9. interpretation functions, for example, of fuzzy terms (e.g. expensive, far from the beach).</Paragraph>
      <Paragraph position="4"> Items 8 and 9 have a quite different nature, but they are closely related to the domain at stake.</Paragraph>
    </Section>
    <Section position="5" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Knowledge Representation in
</SectionTitle>
      <Paragraph position="0"/>
    </Section>
  </Section>
  <Section position="4" start_page="1" end_page="4" type="metho">
    <SectionTitle>
WEBCOOP
</SectionTitle>
    <Paragraph position="0"> Let us now consider how these forms of knowledge are represented. WEBCOOP has two main forms for encoding knowledge: (1) general knowledge and domain knowledge represented by means of a deductive knowledge base, that includes facts, rules and integrity constraints and (2) a large set of indexed texts, where indexes are logical formulae. Our semantic representation is based on a simplified version of the Lexical Conceptual Structure (LCS). Let us review these below.</Paragraph>
    <Paragraph position="1"> The kernel-satellite structure of the tourism domain requires that we study, for this application, portability and data integration aspects for each satellite domain. At this level of complexity there is no ready-made method that we can use; furthermore, most of the work is done manually. The results of the integration reflect our own intuitions coupled with and applied on generic data available on the web.</Paragraph>
    <Paragraph position="2"> a. The knowledge base iscodedinProlog.</Paragraph>
    <Paragraph position="3"> It includes basic knowledge, e.g. country names coded as facts or distance graphs between towns, coded as facts and rules. It also includes rules which play at least two roles: data abstraction (e.g. to describe the structure of an object, besides e.g. part-of descriptions found in the ontology): hotel stay cost(Hotel ID, NbNights, Total) :- hotel(Hotel ID, Night rate), Total is NbNights * Night rate.</Paragraph>
    <Paragraph position="4"> and the encoding of conditional situations: book flight(A) :person(A), age(A, AG), AG &gt; 17.</Paragraph>
    <Paragraph position="5"> which says that you can book a flight if you are at least 18 years old. Finally the knowledge base contains integrity constraints. For example, the constraint: constraint([chalet(X), capacity(X,C), C&gt; 10], fail).</Paragraph>
    <Paragraph position="6"> indicates that 'a chalet cannot accommodate more than 10 persons'.</Paragraph>
    <Paragraph position="7"> The ontology, described below, contains data which can be interpreted as facts (e.g. hierarchical relations), rules or integrity constraints (as simple as domain constraints for property values). Currently, our KB contains 170 rules and 47 integrity constraints, which seems to cover a large number of situations.</Paragraph>
    <Paragraph position="8"> b. The ontology is basically conceptual where nodes are associated with concept lexicalizations and essential properties. Each node is represented by the predicate : onto-node(concept, lex, properties) where concept is described using properties and lex are possible lexicalisations of concept. Most lexicalisations are entries in the lexicon (except for paraphrases), where morphological and grammatical aspects are described. For example, for hotel,wehave: onto-node(hotel, [[htel], [htel, rsidence]], [night-rate, nb-of-rooms]).</Paragraph>
    <Paragraph position="9"> There are several well-designed public domain ontologies on the net. Our ontology is inspired from two existing French ontologies, that we had to customize: TourinFrance  and the bilingual (French and English) thesaurus of tourism and leisure activities  which includes 2800 French terms. We manually integrated these ontologies in WEBCOOP (Doan et al., 2002) by removing concepts that are either too specific (i.e. too low level), like some basic aspects of ecology or rarely considered, as e.g. the economy of tourism. We also removed quite surprising classifications like sanatorium under tourist accommodation. We finally reorganized some concept hierarchies, so that they 'look' more intuitive for a large public. Finally, we found that some hierarchies are a little bit odd,  forexample,wefoundatthesamelevelaccommodation capacity and holiday accommodation whereas, in our case, we consider that capacity is a property of the concept tourist accommoda-</Paragraph>
    <Paragraph position="11"> tion. We have, at the moment, an organization of 1000 concepts in our tourism ontology which describe accommodation and transportation and a few other satellite elements (geography, health, immigration).</Paragraph>
    <Paragraph position="12"> c. The lexicon contains nouns, verbs and adjectives related to the tourism domain, extracted from both corpora and ontologies. The lexicon contains also determiners, connectors and prepositions. The lexicon is constructed directly from the revised ontologies for nouns. Nouns contain basic information (e.g. predicative or not, count/mass, deverbal) coded by hand, their 'semantic' type, directly characterized by their ancestor in the ontology, and a simple semantic representation. Verbs are those found in our corpora. We have a large verb KB (VOLEM project)(Fernandez et al., 2002) of 1700 verbs in French, Spanish and Catalan.</Paragraph>
    <Paragraph position="13"> The verb lexicon is extracted from this KB almost without modification. For tourism, including request verbs, we have 150 verbs. Since verbs are central in NLG, it is crucial that they get much information, in our system: thematic roles, selectional restrictions, syntactic alternations, Wordnet classification, and semantic representation (a conceptual representation, a simplification of the Lexical Conceptual Structure). d. Indexed texts. Our knowldge extractor, which is based on the domain ontology, transforms each text fragment into the following logical representation : text(F, http) where F is a first-order formula that represents knowledge extracted (in general) from a web page, with address http (or explicit text).</Paragraph>
    <Paragraph position="14"> For example, indexed texts about airport transportations in various countries have the following form:</Paragraph>
    <Paragraph position="16"> [?]localization(cointrin,in(geneva)), www.gva.ch).</Paragraph>
    <Paragraph position="17"> Indexed paragraphs also describe categories such as: procedures, regulations, warnings or classifications. Texts identified as such are indexed by indicating (1) the category in which they fall, (2) a keyword or a formula that identifies the nature of the procedure, regulation, etc., and (3) the text itself, generally used as such in a response.</Paragraph>
    <Paragraph position="18"> e. Query representation and evaluation. Processing a query allows for the identification of: the type of the query (yes/no, Boolean or entity, etc.), the question focus and the construction of its semantic representation in first-order logic. For example, the question: what are the means of transportation to go to Geneva airport ? has the following logical representation: (entity,meansoftransportation(Y ),</Paragraph>
    <Paragraph position="20"> Given a fragment of text, we infer that it is an answer to a question by two different ways: (1) from the deductive knowledge base, in that case, responses are variable instances or (2) from the indexed text base, and in that case, responses are formulae which unify with the query formula. In this latter case, roughly, unification proceeds as follows. Let Q (conjunction of terms</Paragraph>
    <Paragraph position="22"> ) be the question formula and F (conjunction of f j ) be a formula associated with an indexed text. F is a response to Q iff for all q  rewrites, via rules of the knowledge base, into a conjunction of f j , e.g.: airportof(Z,geneva) rewrites into: airport(Z) [?] localisation(Z,in(geneva)).</Paragraph>
    <Section position="1" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.3 Inference Needs for Providing
Cooperative Responses
</SectionTitle>
      <Paragraph position="0"> We develop a general typology of cooperative functions. The aim is to identify the types and sources of knowledge associated with each of these functions. In terms of portability, we think that annotating in QA corpora of a specific domain the various cooperative functions used should help identify the needs in terms of knowledge for the development of each cooperative function. It remains, then, to evaluate the validity and the adequacy of the inference schemas, but these can only be evaluated a posteriori, whereas the types of knowledge can be evaluated a priori.</Paragraph>
      <Paragraph position="1"> Another perspective is that, given the description of the forms of knowledge associated with an application, it may be possible to anticipate what kinds of cooperative functions could be implemented for this application.</Paragraph>
      <Paragraph position="2"> We decompose cooperative functions into two main classes: Response Elaboration (ER) and Additional Information (ADR).The first class includes response units that propose alternatives to the question whereas the latter contains a variety of complements of information, which are useful but not absolutely necessary such as precision, suggestion or warnings. Figure 4 shows the different kinds of knowledge involved for each of the cooperative functions that belong to the ER class  In the tourism domain, queries are very diverse in form and contents. From that point of view, they are closer to open domains than to closed domains, as advocated in the introduction. Questions about tourism, as revealed by our corpora studies, include false presuppositions (FP), misunderstandings (MIS), concept relaxations (RR), intensional responses (IR).</Paragraph>
      <Paragraph position="3"> For the moment, we investigate only questions of type boolean and questions about entities and we use the inference schemas: FP, MIS, RR and IR cited above. We think it is important to make explicit in the response the types of knowledge used in the inferences and to show how they are organized and lexicalized. As described in example 1 of section 2.3, the explanation given in italic in the response :another accommodation type:hotel,pension, indicates that a relaxation based on the ontological type of the concept chalet was carried out.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML