File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4014_metho.xml
Size: 18,616 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4014"> <Title>HITIQA: A Data Driven Approach to Interactive Analytical Question Answering</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Factoid vs. Analytical QA </SectionTitle> <Paragraph position="0"> The process of automated question answering is now fairly well understood for most types of factoid questions. Factoid questions display a fairly distinctive &quot;answer type&quot;, which is the type of the information item needed for the answer, e.g., &quot;person&quot; or &quot;country&quot;, etc. Most existing factoid QA systems deduct this expected answer type from the form of the question using a finite list of possible answer types. For example, &quot;How long was the Titanic?&quot; expects some length measure as an answer, probably in yards and feet, or meters. This is generally a very good strategy that has been exploited successfully in a number of automated QA systems that appeared in recent years, especially in the context of</Paragraph> </Section> <Section position="4" start_page="0" end_page="1" type="metho"> <SectionTitle> TREC QA </SectionTitle> <Paragraph position="0"> evaluations (Harabagiu et al., 2002; Hovy et al., 2000; Prager at al., 2001).</Paragraph> <Paragraph position="1"> This answer-typing process is not easily applied to analytical questions because the type of an answer for analytical questions cannot always be anticipated due to their inherently exploratory character. In contrast to a factoid question, an analytical question has an unlimited variety of syntactic forms with only a loose connection between their syntax and the expected answer. Given the unlimited potential of the formation of analytical questions, it would be counter-productive to restrict them to a limited number of question/answer types.</Paragraph> <Paragraph position="2"> Therefore, the formation of an answer in analytical QA should instead be guided by the user's interest as expressed in the question, as well as through an interactive dialogue with the system.</Paragraph> <Paragraph position="3"> In this paper we argue that the semantics of an analytical question is more likely to be deduced from the TREC QA is the annual Question Answering evaluation sponsored by the U.S. National Institute of Standards and Technology www.trec.nist.gov information that is considered relevant to the question than through a detailed analysis of its particular form. Determining &quot;relevant&quot; information is not the same as finding an answer; indeed we can use relatively simple information retrieval methods (keyword matching, etc.) to obtain perhaps 200 &quot;relevant&quot; documents from a database. This gives us an initial answer space to work from in order to determine the scope and complexity of the answer, but we are nowhere near the answer yet. In our project, we use structured templates, which we call frames, to map out the content of pre-retrieved documents, and subsequently to delineate the possible meaning of the question before we can attempt to formulate an answer.</Paragraph> </Section> <Section position="5" start_page="1" end_page="4" type="metho"> <SectionTitle> 3 Text Framing </SectionTitle> <Paragraph position="0"> In HITIQA we use a text framing technique to delineate the gap between the meaning of the user's question and the system's &quot;understanding&quot; of this question. The framing process does not attempt to capture the entire meaning of the passages; instead it imposes a partial structure on the text passages that would allow the system to systematically compare different passages against each other and against the question. Framing is just sufficient enough to communicate with the user about the differences in their question and the returned text. In particular, the framing process may uncover topics or aspects within the answer space which the user has not explicitly asked for, and thus may be unaware of their existence. If these topics or aspects align closely with the user's question, we may want to make the user aware of them and let him/her decide if they should be included in the answer.</Paragraph> <Paragraph position="1"> Frames are built from the retrieved data, after clustering it into several topical groups. Retrieved documents are first broken down into passages, mostly exploiting the naturally occurring paragraph structure of the original sources, filtering out duplicates. The remaining passages are clustered using a combination of hierarchical clustering and n-bin classification (Hardy et al., 2002). Typically three to six clusters are generated.</Paragraph> <Paragraph position="2"> Each cluster represents a topic theme within the retrieved set: usually an alternative or complimentary interpretation of the user's question. Since clusters are built out of small text passages, we associate a frame with each passage that serves as a seed of a cluster. We subsequently merge passages, and their associated frames whenever anaphoric and other cohesive links are detected.</Paragraph> <Paragraph position="3"> HITIQA starts by building a general frame on the seed passages of the clusters and any of the top N (currently N=10) scored passages that are not already in a cluster. The general frame represents an event or a relation involving any number of entities, which make up the frame's attributes, such as LOCATION, PERSON, COUNTRY, ORGANIZATION, etc. Attributes are extracted from text passages by BBN's Identifinder, which tags 24 types of named entities. The event/relation itself could be pretty much anything, e.g., accident, pollution, trade, etc. and it is captured into the TOPIC attribute from the central verb or noun phrase of the passage. In general frames, attributes have no assigned roles; they are loosely grouped around the TOPIC.</Paragraph> <Paragraph position="4"> We have also defined three slightly more specialized typed frames by assigning roles to selected attributes in the general frame. These three &quot;specialized&quot; frames are: (1) a Transfer frame with three roles including FROM, TO and OBJECT; (2) a two-role Relation frame with AGENT and OBJECT roles; and (3) a one-role Property frame.</Paragraph> <Paragraph position="5"> These typed frames represent certain generic events/relationships, which then map into more specific event types in each domain. Other frame types may be defined if needed, but we do not anticipate there will be more than a handful all together.</Paragraph> <Paragraph position="6"> Where the general frame is little more than just a &quot;bag of attributes&quot;, the typed frames capture some internal structure of an event, but only to the extent required to enable an efficient dialogue with the user. Typed frames are &quot;triggered&quot; by appearance of specific words in text, for example the word export may trigger a Transfer frame. A single text passage may invoke one or more typed frames, or none at all. When no typed frame is invoked, the general frame is used as default. If a typed frame is invoked, HITIQA will attempt to identify the roles, e.g. FROM, TO, OBJECT, etc. This is done by mapping general frame attributes selected from text onto the typed attributes in the frames. In any given domain, e.g., weapon non-proliferation, both the trigger words and the role identification rules can be specialized from a training corpus of typical documents and questions. For example, the role-ID rules rely both on syntactic cues and the expected entity types, which are domain adaptable.</Paragraph> <Paragraph position="7"> Domain adaptation is desirable for obtaining more focused dialogue, but it is not necessary for HITIQA to work. We used both setups under different conditions: the generic frames were used with TREC document collection to measure impact of IR precision on QA accuracy (Small et al., 2004). The domain-adapted frames were used for sessions with intelligence analysts working with the WMD Domain (see below). Currently, the adaptation process includes manual tuning followed by corpus bootstrapping using an unsupervised learning method (Strzalkowski & Wang, 1996). We generally rely on BBN's Identifinder for extraction of basic entities, and use bootstrapping to define additional entity types as well as to assign roles to attributes.</Paragraph> <Paragraph position="8"> The version of HITIQA reported here and used by analysts during the evaluation has been adapted to the Scalability is certainly an outstanding issue here, and we are working on effective frame acquisition methods, which is outside of the scope of this paper.</Paragraph> <Paragraph position="9"> Weapons of Mass Destruction Non-Proliferation domain (WMD domain, henceforth). Figure 1b contains an example passage from this data set. In the WMD domain, the typed frames were mapped onto WMDTransfer 3-role frame, and two 2-role frames WMDTreaty and WMDDevelop. Adapting the frames to WMD domain required only minimal modification, such as adding WEAPON entity to augment Identifinder entity set, specializing OBJECT attribute in WMDTransfer to WEAPON, generating a list of international weapon control treaties, etc.</Paragraph> <Paragraph position="10"> HITIQA frames define top-down constraints on how to interpret a given text passage, which is quite different from MUC template filling task (Humphreys et al., 1998). What we're trying to do here is to &quot;fit&quot; a frame over a text passage. This means also that multiple frames can be associated with a text passage, or to be exact, with a cluster of passages. Since most of the passages that undergo the framing process are part of some cluster of very similar passages, the added redundancy helps to reinforce the most salient features for extraction. This makes the framing process potentially less error-prone than MUC-style template filling .</Paragraph> <Paragraph position="11"> The Bush Administration claimed that Iraq was within one year of producing a nuclear bomb. On 30 November 1990... Leonard Spector said that Iraq possesses 200 tons of natural uranium imported and smuggled from several countries. Iraq possesses a few working centrifuges and the blueprints to build them. Iraq imported centrifuge materials from Nukem of the FRG and from other sources. One decade ago, Iraq imported 27 pounds of weapons-grade uranium from France, ... FIGURE 1b: A text passage from the WMD domain data A very similar framing process is applied to the user's question, resulting in one or more Goal frames, which are subsequently compared to the data frames obtained from retrieved text passages. A Goal frame can be a general frame or any of the typed frames. The Goal frame generated from the question, &quot;Has Iraq been able to import uranium?&quot; is shown in Figure 2. This frame is of WMDTransfer type, with 3 role attributes TRF_TO, TRF_FROM and TRF_OBJECT, plus the relation type (TRF_TYPE). Each role attribute is defined over an underlying general frame attribute (given in parentheses), which is used to compare frames of different types.</Paragraph> <Paragraph position="12"> HITIQA automatically judges a particular data frame as relevant, and subsequently the corresponding segment of text as relevant, by comparison to the Goal frame. The data frames are scored based on the number of conflicts found with the Goal frame. The conflicts are mismatches on values of corresponding attributes. If a MUC, the Message Understanding Conference, funded by DARPA, involved the evaluation of information extraction systems applied to a common task.</Paragraph> <Paragraph position="13"> We do not have enough data to make a definite comparison at this time.</Paragraph> <Paragraph position="14"> data frame is found to have no conflicts, it is given the highest relevance rank, and a conflict score of zero.</Paragraph> <Paragraph position="15"> All other data frames are scored with an increasing value based on the number of conflicts, score 1 for frames with one conflict with the Goal frame, score 2 for two conflicts etc. Frames that conflict with all information found in the query are given the score 99 indicating the lowest rank. Currently, frames with a conflict score 99 are excluded from further processing as outliers. The frame in Figure 3 is scored as relevant to the user's query and included in the answer space.</Paragraph> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> 4 Clarification Dialogue </SectionTitle> <Paragraph position="0"> Data frames with a conflict score of zero form the initial kernel answer space and HITIQA proceeds by generating an answer from this space. Depending upon the presence of other frames outside of this set, the system may initiate a dialogue with the user. HITIQA begins asking the user questions on these near-miss frame groups, groups with one or more conflicts, with the largest group first. In order to keep the dialogue from getting too winded, we set thresholds on number of conflicts and group size that are considered by the dialogue manager.</Paragraph> <Paragraph position="1"> A 1-conflict frame has only a single attribute mis-match with the Goal frame. This could be a mismatch on any of the general frame attributes, for example, LO-CATION, ORGANIZATION, TIME, etc., or in one of the role-assigned attributes, TO, FROM, OBJECT, etc. A special case arises when the conflict occurs on the TOPIC attribute, which indicates the event type. Since all other attributes match, we may be looking at potentially different events of the same kind involving the same entities, possibly occurring at the same location or time. The purpose of the clarification dialogue in this case is to probe which of these topics may be of interest to the user. Another special case arises when the Goal frame is of a different type than a data frame. The purpose of the clarification dialogue in this case is to see if the user wishes to expand the answer space to include events of a different type. This situation is illustrated in the exchange shown in Figure 4. Note that the user can examine a partial answer prior to answering clarification questions.</Paragraph> <Paragraph position="2"> User: &quot;Has Iraq been able to import uranium?&quot; [a partial answer displayed in an answer window] HITIQA: &quot;Are you also interested in background information on the uranium development program in Iraq?&quot; FIGURE 4: Clarification question generated for the Iraq/uranium question The clarification question in Figure 4 is generated by comparing the Goal frame in Figure 2 to a partly matching frame (Figure 5) generated from some other text passage. We note first that the Goal frame for this example is of WMDTransfer type, while the data frame in Figure 5 is of the type WMDDevelop. Nonetheless, both frames match on their general-frame attributes WEAPON and LOCATION. Therefore, HITIQA asks the user if it should expand the answer space to include development of uranium in Iraq as well.</Paragraph> <Paragraph position="3"> During the dialogue, as new information is obtained from the user, the Goal frame is updated and the scores of all the data frames are reevaluated. If the user responds the equivalent of &quot;yes&quot; to the system clarification question in the dialogue in Figure 4, a corresponding WMDDevelop frame will be added to the set of active Goal frames and all WMDDevelop frames obtained from text passages will be re-scored for possible inclusion in the answer.</Paragraph> <Paragraph position="4"> tion that generated the dialogue in Figure 4.</Paragraph> <Paragraph position="5"> The user may end the dialogue at any point using the generated answer given the current state of the frames.</Paragraph> <Paragraph position="6"> Currently, the answer is simply composed of text passages from the zero conflict frames. In addition, HITIQA will generate a &quot;headline&quot; for the text passages in the answer space. This is done using a combination of text templates and simple grammar rules applied to the attributes of the passage frame.</Paragraph> </Section> <Section position="7" start_page="4" end_page="5" type="metho"> <SectionTitle> 5 HITIQA Qualitative Evaluations </SectionTitle> <Paragraph position="0"> In order to assess our progress thus far, and to also develop metrics to guide future evaluation, we invited a group of analysts employed by the US government to participate in two three-day workshops, held in September and October 2003.</Paragraph> <Paragraph position="1"> The two basic objectives of the workshops were: 1. To perform a realistic assessment of the usefulness and usability of HITIQA as an end-to-end system, from the information seeker's initial questions to completion of a draft report.</Paragraph> <Paragraph position="2"> 2. To develop metrics to compare the answers obtained by different analysts and evaluate the quality of the support that HITIQA provides.</Paragraph> <Paragraph position="3"> The analysts' primary task was preparation of reports in response to scenarios - complex questions that usually encompassed multiple sub-questions. The scenarios were developed in conjunction with several U.S. government offices. These scenarios, detailing information required for the final report, were not normally used directly as questions to HITIQA, instead, they were treated as a basis to issues possibly leading to a series of questions, as shown in Figure 1a.</Paragraph> <Paragraph position="4"> The results of these evaluations strongly validated our approach to analytical QA. At the same time, we learned a great deal about how analysts work, and about how to improve the interface.</Paragraph> <Paragraph position="5"> Analysts completed several questionnaires designed to assess their overall experience with the system. Many of the questions required the analysts to compare HITIQA to other tools they were currently using in their work. HITIQA scores were quite high, with mean score 3.73 out of 5. We scored particularly high in comparison to current analytic tools. We have also asked the analysts to cross-evaluate their product reports obtained from interacting with HITIQA. Again, the results were quite good with a mean answer quality score of 3.92 out of 5. While this evaluation was only preliminary, it nonetheless gave us confidence that our design is &quot;correct&quot; in a broad sense.</Paragraph> </Section> class="xml-element"></Paper>