File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1041_metho.xml

Size: 10,656 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1041">
  <Title>Answer Extraction</Title>
  <Section position="3" start_page="0" end_page="297" type="metho">
    <SectionTitle>
2 The Question-Answering System
</SectionTitle>
    <Paragraph position="0"> The system takes a natural-language query as input and produces a list of answers ranked in order of confidence. The top five answers were submitted to the TREC evaluation.</Paragraph>
    <Paragraph position="1"> Queries are processed in two stages. In the information retrieval stage, the most promising passages of the most promising documents are retrieved. In the linguistic processing stage, potential answers are extracted from these passages and ranked.</Paragraph>
    <Paragraph position="2"> The system can be divided into five main components. The information retrieval stage consists of a  single component, passage retrieval, and the linguistic processing stage circumscribes four components: entity extraction, entity classification, query classification, and entity ranking.</Paragraph>
    <Paragraph position="3"> Passage Retrieval Identify relevant documents, and within relevant documents, identify the passages most likely to contain the answer to the question.</Paragraph>
    <Paragraph position="4"> Entity Extraction Extract a candidate set of possible answers from the passages.</Paragraph>
    <Paragraph position="5"> Entity Classification The candidate set is a list of entities falling into a number of categories, including people, locations, organizations, quantities, dates, and linear measures. In some cases (dates, quantities, linear measures), entity classification is a side effect of entity extraction, but in other cases (proper nouns, which may be people, locations, or organizations), there is a separate classification step after extraction.</Paragraph>
    <Paragraph position="6"> Query Classification Determine what category of entity the question is asking for. For example, if the query is Who is the author of the book, The Iron Lady: A Biography of Margaret Thatcher? the answer should be an entity of type Person.</Paragraph>
    <Paragraph position="7"> Entity Ranking Assign scores to entities, representing roughly belief that the entity is the correct answer. There are two components of the score. The most-significant bit is whether or not the category of the entity (as determined by entity classification) matches the category that the question is seeking (as determined by query classification). A finer-grained ranking is imposed on entities with the correct category, through the use of frequency and other information. null The following sections describe these five components in detail.</Paragraph>
    <Section position="1" start_page="296" end_page="296" type="sub_section">
      <SectionTitle>
2.1 Passage Retrieval
</SectionTitle>
      <Paragraph position="0"> The first step is to find passages likely to contain the answer to the query. We use a modified version of the SMART information retrieval system (Buckley and Lewit, 1985; Salton, 1971) to recover a set of documents which are relevant to the question. We define passages as overlapping sets consisting of a sentence and its two immediate neighbors. (Passages are in one-one correspondence with with sentences, and adjacent passages have two sentences in common.) The score for passage i was calculated as</Paragraph>
      <Paragraph position="2"> where Sj, the score for sentence j, is the sum of IDF weights of non-stop terms that it shares with the query, plus an additional bonus for pairs of words (bigrams) that the sentence and query have in common. null The top 50 passages are passed on as input to linguistic processing.</Paragraph>
    </Section>
    <Section position="2" start_page="296" end_page="296" type="sub_section">
      <SectionTitle>
2.2 Entity Extraction
</SectionTitle>
      <Paragraph position="0"> Entity extraction is done using the Cass partial parser (Abney, 1996). From the Cass output, we take dates, durations, linear measures, and quantities.</Paragraph>
      <Paragraph position="1"> In addition, we constructed specialized code for extracting proper names. The proper-name extractor essentially classifies capitalized words as intrinsically capitalized or not, where the alternatives to intrinsic capitalization are sentence-initial capitalization or capitalization in titles and headings. The extractor uses various heuristics, including whether the words under consideration appear unambiguously capitalized elsewhere in the document.</Paragraph>
    </Section>
    <Section position="3" start_page="296" end_page="297" type="sub_section">
      <SectionTitle>
2.3 Entity Classification
</SectionTitle>
      <Paragraph position="0"> The following types of entities were extracted as potential answers to queries.</Paragraph>
      <Paragraph position="1"> Person, Location, Organization, Other Proper names were classified into these categories using a classifier built using the method described in (Collins and Singer, 1999). 1 This is the only place where entity classification was actually done as a separate step from entity extraction.</Paragraph>
      <Paragraph position="2"> Dates Four-digit numbers starting with 1... or 20.. were taken to be years. Cass was used to extract more complex date expressions (such as Saturday, January 1st, 2000).</Paragraph>
      <Paragraph position="3"> Quantities Quantities include bare numbers and numeric expressions' like The Three Stooges, 4 1//2 quarts, 27~o. The head word of complex numeric expressions was identified (stooges, quarts or percent); these entities could then be later identified as good answers to How many questions such as How many stooges were there ? Durations, Linear Measures Durations and linear measures are essentially special cases of quantities, in which the head word is a time unit or a unit of linear measure. Examples of durations are three years, 6 1/2 hours. Examples of linear measures are 140 million miles, about 12 feet.</Paragraph>
      <Paragraph position="4"> We should note that this list does not exhaust the space of useful categories. Monetary amounts (e.g., ~The classifier makes a three way distinction between Person, Location and Organization; names where the classifier makes no decision were classified as Other Named E~tity.  $25 million) were added to the system shortly after the Trec run, but other gaps in coverage remain. We discuss this further in section 3.</Paragraph>
    </Section>
    <Section position="4" start_page="297" end_page="297" type="sub_section">
      <SectionTitle>
2.4 Query Classification
</SectionTitle>
      <Paragraph position="0"> This step involves processing the query to identify the category of answer the user is seeking. We parse the query, then use the following rules to determine the category of the desired answer:</Paragraph>
      <Paragraph position="2"> * How few, great, little, many, much -+ Quemtity. We also extract the head word of the How expression (e.g., stooges in how many stooges) for later comparison to the head word of candidate answers.</Paragraph>
      <Paragraph position="3"> * How long --+ Duration or Linear Measure.</Paragraph>
      <Paragraph position="4"> How tall, wide, high, big, far --+ Linear Measure.</Paragraph>
      <Paragraph position="5"> * The wh-words Which or What typically appear with a head noun that describes the category of entity involved. These questions fall into two formats: What X where X is the noun involved, and What is the ... X. Here are a couple of examples: What company is the largest Japanese ship builder? What is the largest city in Germany? For these queries the head noun (e.g., company or city) is extracted, and a lexicon mapping nouns to categories is used to identify the category of the query. The lexicon was partly hand-built (including some common cases such as number --+ Quantity or year --~ Date). A large list of nouns indicating Person, Location or Organization categories was automatically taken from the contextual (appositive) cues learned in the named entity classifier described in (Collins and Singer, 1999).</Paragraph>
      <Paragraph position="6"> * In queries containing no wh-word (e.g., Name the largest city in Germany), the first noun phrase that is an immediate constituent of the matrix sentence is extracted, and its head is used to determine query category, as for What X questions.</Paragraph>
      <Paragraph position="7"> * Otherwise, the category is the wildcard Any.</Paragraph>
    </Section>
    <Section position="5" start_page="297" end_page="297" type="sub_section">
      <SectionTitle>
2.5 Entity Ranking
</SectionTitle>
      <Paragraph position="0"> Entity scores have two components. The first, mostsignificant, component is whether or not the entity's category matches the query's category. (If the query category is Any, all entities match it.) In most cases, the matching is boolean: either an entity has the correct category or not. However, there are a couple of special cases where finer distinctions are made. If a question is of the Date type, and the query contains one of the words day or month, then &amp;quot;full&amp;quot; dates are ranked above years. Conversely, if the query contains the word year, then years are ranked above full dates. In How many X questions (where X is a noun), quantified phrases whose head noun is also X are ranked above bare numbers or other quantified phrases: for example, in the query How many lives were lost in the Lockerbie air crash, entities such as 270 lives or almost 300 lives would be ranked above entities such as 200 pumpkins or 150. 2 The second component of the entity score is based on the frequency and position of occurrences of a given entity within the retrieved passages. Each occurrence of an entity in a top-ranked passage counts 10 points, and each occurrence of an entity in any other passage counts 1 point. (&amp;quot;Top-ranked passage&amp;quot; means the passage or passages that received the maximal score from the passage retrieval component.) This score component is used as a secondary sort key, to impose a ranking on entities that are not distinguished by the first score component.</Paragraph>
      <Paragraph position="1"> In counting occurrences of entities, it is necessary to decide whether or not two occurrences are tokens of the same entity or different entities. To this end, we do some normalization of entities. Dates are mapped to the format year-month-day: that is, last Tuesday, November 9, 1999 and 11/9/99 are both mapped to the normal form 1999 Nov 9 before frequencies are counted. Person names axe aliased based on the final word they contain. For example, Jackson and Michael Jackson are both mapped to the normal form Jackson. a</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML