File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1021_metho.xml

Size: 15,523 bytes

Last Modified: 2025-10-06 14:07:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1021">
  <Title>Evaluating Question-Answering Techniques in Chinese</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TEMPLATE QUESTION
TYPE
TRANSLATION
PERSON which person
LOCATION which city
</SectionTitle>
    <Paragraph position="0"> (2) Question words are removed from the query. This is a form of &amp;quot;stop word&amp;quot; removal. Words like &amp;quot; &amp;quot; (which person) are removed from the query since they are unlikely to occur in relevant text.</Paragraph>
    <Paragraph position="1"> (3) Named entities in the query are marked up using BBN's IdentiFinder system. A named entity is kept as a word after segmentation.</Paragraph>
    <Paragraph position="2"> (5) The query is segmented to identify Chinese words.</Paragraph>
    <Paragraph position="3"> (6) Stop words are removed.</Paragraph>
    <Paragraph position="4"> (7) The query is formulated for the Hanquery search engine.  Hanquery is the Chinese version of Inquery (Broglio, Callan and Croft, 1996) and uses the Inquery query language that supports the specification of a variety of evidence combination methods. To support question answering, documents containing most of the query words were strongly preferred. If the number of query words left after the previous steps is greater than 4, then the operator #and (a probabilistic AND) is used. Otherwise, the probabilistic passage operator #UWn (unordered window) is used. The parameter n is set to twice the number of words in the query.</Paragraph>
    <Paragraph position="5"> Hanquery is used to retrieve the top 10 ranked documents. The answer extraction module then goes through the following steps:  (8) IdentiFinder is used to mark up named entities in the documents.</Paragraph>
    <Paragraph position="6"> (9) Passages are constructed from document sentences. We used passages based on sentence pairs, with a 1-sentence overlap.</Paragraph>
    <Paragraph position="7"> (10) Scores are calculated for each passage. The score is based on five heuristics: * First Rule: Assign 0 to a passage if no expected name entity is present. * Second Rule:  Calculate the number of match words in a passage. Assign 0 to the passage if the number of matching words is less than the threshold. Otherwise, the score of this passage is equal to the number of matching words (count_m). The threshold is defined as follows:</Paragraph>
    <Paragraph position="9"> count_q is the number of words in the query.</Paragraph>
    <Paragraph position="10"> * Third Rule: Add 0.5 to score if all matching words are within one sentence.</Paragraph>
    <Paragraph position="11"> * Fourth Rule: Add 0.5 to score if all matching words are in the same order  as they are in the original question.  * Fifth Rule: score = score + count_m/(size of matching window) (11) Pick the best passage for each document and rank them. (12) Extract the answer from the top passage:  Find all candidates according to the question type. For example, if the question type is LOCATION, then each location marked by IdentiFinder is an answer candidate. An answer candidate is removed if it appears in the original question. If no candidate answer is found, no answer is returned.</Paragraph>
    <Paragraph position="12"> Calculate the average distance between an answer candidate and the location of each matching word in the passage. Pick the answer candidate that has the smallest average distance as the final answer.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Evaluating the System
</SectionTitle>
    <Paragraph position="0"> We used 51 queries to do the initial evaluation of the question-answering system. We selected 26 queries from 240 questions collected from Chinese students in our department, because only these had answers in the test collection. The other 25 queries were constructed by either reformulating a question or asking a slightly different question. For example, given the question &amp;quot;which city is the biggest city in China?&amp;quot; we also generated the questions &amp;quot;where is the biggest city in China?&amp;quot; and &amp;quot;which city is the biggest city in the world?&amp;quot;.</Paragraph>
    <Paragraph position="1"> The results for these queries were evaluated in a similar, but not identical way to the TREC question-answering track.</Paragraph>
    <Paragraph position="2"> An &amp;quot;answer&amp;quot; in this system corresponds to the 50 byte responses in TREC and passages are approximately equivalent to the 250 byte TREC responses.</Paragraph>
    <Paragraph position="3"> For 33 of 51 queries, the system suggested answers. 24 of the 33 were correct. For these 24, the &amp;quot;reciprocal rank&amp;quot; is 1, since only the top ranked passage is used to extract answers.</Paragraph>
    <Paragraph position="4"> Restricting the answer extraction to the top ranked passage also means that the other 27 queries have reciprocal rank values of 0. In TREC, the reciprocal ranks are calculated using the highest rank of the correct answer (up to 5). In our case, using only the top passage means that the mean reciprocal rank of 0.47 is a lower bound for the result of the 50 byte task.</Paragraph>
    <Paragraph position="5"> As an example, the question &amp;quot; &amp;quot; (Which city is the biggest city in China?), the answer returned is (Shanghai). In the top ranked passage, &amp;quot;China&amp;quot; and &amp;quot;Shanghai&amp;quot; are the two answer candidates that have the smallest distances. &amp;quot;Shanghai&amp;quot; is chosen as the final answer since &amp;quot;China&amp;quot; appears in the original question. As an example of an incorrect response, the question &amp;quot; &amp;quot; (In which year did Jun Xie defeat a Russian player and win the world chess championship for the first time?) produced an answer of (today). There were two candidate answers in the top passage, &amp;quot;October 18&amp;quot; and &amp;quot;today&amp;quot;. Both were marked as DATE by Identifinder, but &amp;quot;today&amp;quot; was closer to the matching words. This indicates the need for more date normalization and better entity classification in the system.</Paragraph>
    <Paragraph position="6"> For 44 queries, the correct answer was found in the top-ranked passage. Even if the other queries are given a reciprocal rank of 0, this gives a mean reciprocal rank of 0.86 for a task similar to the 250 byte TREC task. In fact, the correct answer for 4 other queries was found in the top 5 passages, so the mean reciprocal rank would be somewhat higher. For 2 of the remaining 3 queries, Hanquery did not retrieve a document in the top 10 that contained an answer, so answer extraction could not work.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Further Improvements
</SectionTitle>
    <Paragraph position="0"> These results, although preliminary, are promising. We have made a number of improvements in the new version (v2) of the system. Some of these are described in this section.</Paragraph>
    <Paragraph position="1"> One of the changes is designed to improve the system's ability to extract answers for the questions that ask for a number. A number recognizer was developed to recognize numbers in Chinese documents. The numbers here are numbers other than DATE, MONEY and PERCENTAGE that are recognized by IdentiFinder. The version of IdentiFinder used in our system can only mark up seven types of name entities and this limits the system's ability to answer other types of questions. The number recognizer is the first example of the type of refinement to named entity recognition that must be done for better performance.</Paragraph>
    <Paragraph position="2"> An example of a question requiring a numeric answer is: &amp;quot; ? (What is the number of Clinton's presidency?)&amp;quot;. This question could be answered in Marsha v2 by extracting the marked up number from the best passage in the answer extraction part, while Marsha v1 could only return the top 5 passages that were likely to have the answer to this question.</Paragraph>
    <Paragraph position="3"> Another improvement relates to the best matching window of a passage. The size of the matching window in each passage is an important part of calculating the belief score for the passage. Locating the best matching window is also important in the answer-extraction processing because the final answer picked is the candidate that has the smallest average distance from the matching window. The best matching window of a passage here is the window that has the most query words in it and has the smallest window size. In the previous version of our system, we only consider the first occurrence of each query word in a passage and index the position accordingly. The matching window is thus from the word of the smallest index to the word of the largest index in the passage. It is only a rough approximation of the best matching window though it works well for many of the passages. In the second version of Marsha, we developed a more accurate algorithm to locate the best matching window of each passage. This change helped Marsha v2 find correct answers for some questions that previously failed. The following is an example of such a question.</Paragraph>
    <Paragraph position="4"> For the question &amp;quot; ? (How many people in the United States are below the poverty line?)&amp;quot; The best passage is as follows:</Paragraph>
    <Paragraph position="6"> This passage has two occurrences of query word &amp;quot; &amp;quot;.</Paragraph>
    <Paragraph position="7"> In v1, the first occurrence of &amp;quot; &amp;quot; is treated as the start of the matching window, whereas the second occurrence is actually the start of the best matching window. There are two numbers &amp;quot; &amp;quot; (more than 2 million) and &amp;quot; &amp;quot; (33.585 million) in the passage. The right answer &amp;quot; &amp;quot; (33.585 million) is nearer to the best matching window and &amp;quot; &amp;quot; (more than 2 million) is nearer to the estimated matching window.</Paragraph>
    <Paragraph position="8"> Therefore, the right answer can be extracted after correctly locating the best matching window.</Paragraph>
    <Paragraph position="9"> The third improvement is with the scoring strategies of passages. Based on the observation that the size of the best matching window of a passage plays a more important role than the order of the query words in a passage, we adjusted the score bonus for same order satisfaction from 0.5 to 0.05. This adjustment makes a passage with a smaller matching window get a higher belief score than a passage that satisfies the same order of query words but has a bigger matching window. As an example, consider the question: &amp;quot; ? (Who was the first president in the United States?)&amp;quot;.</Paragraph>
    <Paragraph position="10"> Passage 1 is the passage that has the right answer &amp;quot;  Passage 1 and Passage 2 both have all query words. The size of the best matching window in Passage 1 is smaller than that in Passage 2 while query words in Passage 2 have the same order as that in the question. The scoring strategy in Marsha v2 selects Passage 1 and extracts the correct answer while Marsha v1 selected Passage 2.</Paragraph>
    <Paragraph position="11"> Special processing of ordinals has also been considered in Marsha v2. Ordinals in Chinese usually start with the Chinese character &amp;quot; &amp;quot; and are followed by a cardinal. It is better to retain ordinals as single words during the query generation in order to retrieve better relevant documents. However, the cardinals (part of the ordinals in Chinese) in a passage are marked up by the number recognizer for they might be answer candidates for questions asking for a number. Thus ordinals in Chinese need special care in a QA system. In Marsha v2, ordinals appearing in a question are first retained as single words for the purpose of generating a good query and then separated in the post processing after relevant documents are retrieved to avoid answer candidates being ignored.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. Comparison with English Question
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Answering Systems
</SectionTitle>
      <Paragraph position="0"> Some techniques used in Marsha are similar to the techniques in English question answering systems developed by other researchers. The template matching in Marsha for deciding the type of expected answer for a question is basically the same as the one used in the GuruQA (Prager et al., 2000) except that the templates consist of Chinese word patterns instead of English word patterns. Marsha has the ability of providing answers to eight types of questions: PERSON, LOCATION, ORGANIZATION, DATE, TIME, MONEY, PERCENTAGE, and NUMBER. The first seven types correspond to the named entities from IdentiFinder developed by BBN. We developed a Chinese numberrecognizer ourselves which marks up numbers in the passages as answer candidates for questions asking for a number. The number could be represented as a digit number or Chinese characters. David A. Hull used a proper name tagger ThingFinder developed at Xerox in his question answering system. Five of the answer types correspond to the types of proper names from ThingFinder (Hull, 1999). The scoring strategy in Marsha is similar to the computation of score for an answer window in the LASSO QA system (Moldovan et al., 1999) in terms of the factors considered in the computation. Factors such as the number of matching words in the passage, whether all matching words in the same sentence, and whether the matching words in the passage have the same order as they are in the question are common to LASSO and Marsha.</Paragraph>
      <Paragraph position="1"> We have also implemented an English language version of Marsha. The system implements the answer classes PERSON, ORGANIZATION, LOCATION, and DATE.</Paragraph>
      <Paragraph position="2"> Queries are generated in the same fashion as Marsha. If there are any phrases in the input query (named entities from IdentiFinder, quoted strings) these are added to an Inquery query in a #N operator all inside a #sum operator. For example: Question: &amp;quot;Who is the author of &amp;quot;Bad Bad Leroy Brown&amp;quot; Inquery query: #sum( #uw8(author Bad Bad Leroy Brown) #6(Bad Bad Leroy Brown)) Where N is number of terms + 1 for named entities, and number of terms + 2 for quoted phrases. If a query retrieves no documents, a &amp;quot;back off&amp;quot; query uses #sum over the query terms, with phrases dropped. The above would become #sum(author Bad Bad Leroy Brown).</Paragraph>
      <Paragraph position="3"> The system was tested against the TREC9 question answering evaluation questions. The mean reciprocal rank over 682/693 questions was 0.300 with 396 questions going unanswered. The U.Mass. TREC9 (250 byte) run had a score of 0.367. Considering only the document retrieval, we find a document containing an answer for 471 of the questions, compared to 477 for the official TREC9 run which used expanded queries. This indicates that the Marsha heuristics have applicability to the English question answering task and are not limited to the Chinese question answering task.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML