File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1074_metho.xml

Size: 11,407 bytes

Last Modified: 2025-10-06 14:07:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1074">
  <Title>The Use of Dynamic Segment Scoring for Language-Independent Question Answeringa0</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. SYSTEM DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> In this section we present the system architecture of the proposed Q/A system and describe its components in detail. The system contains five different modules as shown in Figure 1. The top module is responsible for translating input queries and a set of documents to a common language. The common coalition language system developed at MIT Lincoln Laboratory (CCLINC)[8] performs the translation tasks. For the work reported here, we assume that queries are in English, documents are in either English or Korean, and answers are returned in English. Our focus in this paper is on the four modules between the two translation modules (modules contained in the box with a dotted line) in Figure 1.</Paragraph>
    <Paragraph position="1"> The Query Processing module and the Data Processing module use natural language processing techniques such as parsing, morphological stemming and part of speech and concept tagging for word sense disambiguation to extract critical query and document information. In addition, the Query Processing module categorizes queries and assigns appropriate answer concepts associated with each query. In the next two modules, candidate segments with optimal matching scores of keywords and answer concepts are extracted using dynamic sliding windowing techniques. The candidate segments are then further analyzed based on the similarities of proximity distributions of search keywords and rank ordered.</Paragraph>
    <Paragraph position="2"> A case example, a query and a document segment from the TREC-8 official data, is used throughout this section to illustrate functions of the four processing modules. Our illustration starts with the following query entering the Query Processing module.</Paragraph>
    <Paragraph position="3"> Query: In what year did Joe DiMaggio compile his 56-game hitting streak? Several processes take place within the Query Processing module: a preprocessing unit removes punctuation marks and extra spaces; a trained Brill tagger[1] tags each word with corresponding part of speech tags; a set of morphological rules and a concept trained Brill tagger convert words into their root forms and determine answer concepts; a proximity indexing unit records the keyword positions in queries; and a query identification/post processing unit removes stop words and formats the output, as shown below.</Paragraph>
    <Paragraph position="4"> Output of the Query Processing module: Question Special 101</Paragraph>
    <Paragraph position="6"> The output contains critical query information including answer concepts which are identified by categorizing queries using a method similar in spirit to extracting named entities[5, 4], named focuses[2], and question-answer tokens[3]. Each stemmed keyword is tagged with a POS tag, a concept tag, and an index number. The POS tags are used to discriminate search terms by assigning different weights, the concept tags are used to identify answer concepts, and the index numbers are used to compute proximity values between terms for matching.</Paragraph>
    <Paragraph position="7"> Documents, represented with symbol B in Figure 1, go through a similar procedure in the Data Processing Module as did a query in the Query Processing Module. Due to the large data size of the document collection, the documents are processed off line. The input and the output of the module for an example document segment is shown in Figure 2. The output of the data processing module is processed documents with stemmed words and their associated concepts, represented with symbol D in Figure 1.</Paragraph>
    <Paragraph position="8"> The Extraction of Candidate Segments module selects candidate segments that contain answers. The size of each candidate segment is determined by a dynamic sliding window, which uses an iterative procedure to maximize the score of a segment as its size changes.</Paragraph>
    <Paragraph position="9"> To ensure the optimal segmentation of a document, adjacent segments are overlapped while the size of the window can vary from one sentence to tens of sentences, as shown in Figure 3. To determine the optimal size for a current sliding window, the score for an initial window with one sentence is compared to scores corre-</Paragraph>
    <Paragraph position="11"> techniques: Three adjacent optimally formulated windows are shown. The top window segment with four sentences contains the query concept &amp;quot;TIME&amp;quot; and matching word &amp;quot;joe.&amp;quot; The second window with five sentences contains the query concept and six keywords. The last window with two sentences contains the query concept and five keywords.</Paragraph>
    <Paragraph position="12"> sponding to windows with increasing number of sentences. The scoring criteria is based on appearances of answer concepts and query keywords in candidate segments. Weighted scores are assigned to keywords in segments; the contribution of a match varies according to the query keyword's part of speech tag. Specifically, the score for a match decreases according to the following priority list in the order shown: (1) answer concept, (2) quoted keyword, (3) proper noun keyword, (4) noun keyword, and (5) all other keyword.</Paragraph>
    <Paragraph position="13"> Figure 3 shows an example case of using the dynamic sliding window technique. In this figure, the darkened window contains the answer to the example query, 1941. Optimally sized windows form candidate segments that are rank ordered based on their scores.</Paragraph>
    <Paragraph position="14"> Currently, we select and send top 200 segments per query (symbol E in Figure 1) to the Final Answer Formulation module.</Paragraph>
    <Paragraph position="15"> The Final Answer Formulation module takes an advantage of the keyword proximity distributions in queries and the corresponding statistical keyword distributions in candidate segments to further distinguish segments with high likelihoods of containing answers from those that merely contain search terms and query concepts.</Paragraph>
    <Paragraph position="16"> The module creates a list of proximity distributions from a keyword to the rest of keywords as shown in Figure 4. In this figure, the left hand column shows the distance distributions from a query key-word to the rest of query keywords. The index numbers for query keywords are used here to compute the distributions. The right column shows the corresponding distance distributions in a candidate segment. Once the distributions are available, the job of the Final Answer Formulation module is to search for candidate segments with similar keyword proximity distributions to those appeared in queries. By distance, we mean the word counts that separate two  a query and a candidate segment keywords.</Paragraph>
    <Paragraph position="17"> Recall the format of the output from the query processing module. Using the differences between index numbers to specify physical distance relationships among query keywords, we can compute the corresponding proximity distributions of keywords in candidate segments. We create a list of distributions by computing proximity distances from a keyword to the rest of keywords.</Paragraph>
    <Paragraph position="18">  The distance values grow from 2 for keyword joe to 8 for keyword streak. The solid line shows the distance distribution of the same keywords appearing in a candidate segment. The numbers vary from 6 for keyword joe to 11 for keyword streak. The pattern of gradual increase, however, in both lines indicates a similarity between the two distributions. The break in the solid line is caused by the missing term, compile, in the candidate segment. Frame (b) again shows the proximity distributions from keyword 56-game to the rest of keywords in the query and the candidate segment.</Paragraph>
    <Paragraph position="19"> The distance values for the candidate segment are 9, 3, 2, 1, and 2 while the corresponding distances in the query are 6, 4, 3, 1, and 2. Note that the last two data points are identical for both distributions. Again, we find a similar distribution pattern in both the query and the candidate segment. The similarities between the variances of the distributions in both a query and a candidate segment determine the likelihood of the particular segment containing an answer to the query. Table 1 shows the actual distance differences between keywords in the query and the candidate segment. Key-words year, joe, dimaggio, compile, 56-game, hit, and streak are represented by I, II, III, IV, V, VI, and VII, respectively. For each pair in the table, the first number represents the distance between the corresponding keywords (row/column) in the query while the second number shows the distance between the same keywords in the candidate segment. Blanks represent that distances can not be computed because the particular keyword pair could not be found in the candidate segment.</Paragraph>
    <Paragraph position="20"> The similarities between the variances of the distributions in both a query and a candidate segment determine the likelihood of the particular segment containing an answer to the query. For the experiments, we used a simplified version of the distribution matching where only adjacent query term distances were compared.</Paragraph>
    <Paragraph position="21"> The equation for assigning a final score for each candidate segment is as follows.</Paragraph>
    <Paragraph position="22">  number of term pairs in query  number of term pairs processed in query number of term pairs in query where symbol max is a normalization factor and symbol diff is the proximity difference between a query and a candidate segment for a given pair of keywords. Symbol std is the standard deviation of the distance values between two keywords in the candidate segments.</Paragraph>
    <Paragraph position="23"> The standard deviation term helps further differentiate scoring between a common pair and pairs which do not appear often.</Paragraph>
    <Paragraph position="24"> Once all candidate segments are scored, the top five1 segments are selected based on their final scores: a segment with the minimum length was chosen in cases when scores for multiple segments are equal. The top segment for the example candidate at this point is They wanted something about Joe. One day, though, someone ran a different notion by Dom: A book about 1941. If ever the major leagues had a magical, almost mythic year, it was 1941. There was Joe Dimaggio's 56-game hitting streak.</Paragraph>
    <Paragraph position="25"> The selected segments are then sent to the final answer framing stage where only the corresponding keywords matching desired question concepts are extracted. The final answer for the example query is &amp;quot;1941&amp;quot; which had associated concept tag &amp;quot;TIME.&amp;quot; This answer is the output fed into the translation module, if necessary, shown as symbol F in Figure 1. Presently, our system does not perform the final answer framing process using the concept tags. The system simply applys a set of rules to remove stop words to reduce the final answer size.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML