XML Viewer - w04-0505

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0505_metho.xml
Size: 27,477 bytes
Last Modified: 2025-10-06 14:09:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0505">
  <Title>BioGrapher: Biography Questions as a Restricted Domain Question Answering Task</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 A Na&amp;quot;ive Baseline
</SectionTitle>
    <Paragraph position="0"> In this section we describe our baseline QA system.</Paragraph>
    <Paragraph position="1"> This system was used at TREC 2003, to produce answers to so-called person definition questions (Jijkoun et al., 2004; Voorhees, 2004). We present the results and give a short analysis of the system's performance; as we will see, this provides further motivation for the use of text classification for identifying biography-like documents.</Paragraph>
    <Paragraph position="2"> Definition questions at TREC 2003 The QA track at TREC 2003 featured a subtask devoted to definition questions. The latter came in three flavors: person definitions (e.g., Who is Colin Powell?), organization definitions (e.g., What is the U.N.?), and concept definitions (e.g., What is goth?). Here, we are only interested person definitions. null In response to a definition question, systems had to return an unordered set of snippets; each snippet was supposed to be a facet in the definition of the target. There were no limits placed on either the length of an individual answer string or on the number of snippets in the list, although systems were penalized for retrieving extraneous information.</Paragraph>
    <Paragraph position="3"> As our primary strategy for handling person definition questions, we consulted external resources. The main resource used is biography.com.</Paragraph>
    <Paragraph position="4"> While such resources contain biographies of many historical and well-known people, they often lack biographies of contemporary people that are not too well-known. To be able to deal with such cases we backed-off to using a web search engine (Google), and applied a na&amp;quot;ive heuristic approach. We hand-crafted a set of features (such as &amp;quot;born&amp;quot;, &amp;quot;graduated&amp;quot;, &amp;quot;suffered&amp;quot;, etc.) that we felt would trigger for biography-like snippets. Various subsets of the large feature set, together with the target of the definition question, were combined to form queries for the web search engine.</Paragraph>
    <Paragraph position="5"> Given a set of candidate answer snippets, we performed two filtering steps before presenting the final answer: we separated non-relevant snippets from valuable snippets and we identified semanticallyclose snippets. We addressed the first step by analyzing the distances between query terms submitted to the search engine and the sets of features, and by means of shallow syntactic aspects of the different features such as sentence boundaries. To address the second step we developed a snippet similarity metric based on stemming, stopword removal and keyword overlap by sorting and calculating the Levenshtein distance measure of similarity.1. An example of the snippets filtering can be found in Table 1. The table presents 3 of the returned snippets for the question Who is Sir John Hale?. The first and third snippet are filtered out, the first one for non-relevancy and the third for its semantic similarity with the second, shorter, snippet.</Paragraph>
    <Paragraph position="6"> 1The Levenshtein measure is a measure of the similarity between two strings, which are refered to as the source string s and the target string t. The distance is the number of deletions, insertions, or substitutions required to transform s into t  1 Sir Malcolm Bradbury (writer/teacher) Dead.</Paragraph>
    <Paragraph position="7"> Heart trouble. . . . Heywood Hale Broun (commentator, writer ) - Dead. John Brunner (author) Dead. Stroke. . . . Description: Debunks the rumor of his death. . .</Paragraph>
    <Paragraph position="8"> 2 . . . Professor Sir John Hale woke up, had . . . For her birthday in 1996, he wrote on the . . . John Hale died in his sleep - possibly following another stroke . . .</Paragraph>
    <Paragraph position="9"> 3 Observer On 29 July 1992, Professor Sir John Hale woke up . . . her birthday in 1996, he wrote on the . . . John Hale died in his sleep - possibly following another stroke.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Evaluation
</SectionTitle>
      <Paragraph position="0"> Evaluation of individual person definition questions was done using the F-measure: F = (b2 + 1)P * R)/(b2P + R), where P is precision (to be defined shortly), R is recall (to be defined shortly), and b was set to 5, indicating that precision was more important than recall. Length was used as a crude approximation to precision; it gives a system an allowance of 100 (non-white-space) characters for each correct snippet it retrieved. The precision score is set to one if the response is no longer than this allowance, otherwise it is downgraded using the function P = 1[?]((length [?] allowance)/length).</Paragraph>
      <Paragraph position="1"> As to recall, for each question, the TREC assessors marked some snippets as vital and the remainder as non-vital. The non-vital snippets act as &amp;quot;don't care&amp;quot; condition. That is, systems should be penalized for not retrieving vital nuggets, and for retrieving snippets that are not in the assessors' snippet lists at all, but should be neither penalized nor rewarded for returning a non-vital snippet. To implement the &amp;quot;don't care&amp;quot; condition, snippet recall is computed only over vital snippets (Voorhees, 2004).</Paragraph>
      <Paragraph position="2"> In total, 30 person definition questions were evaluated at the TREC 2003 QA track. The overall F score of a run was obtained by averaging over all the individual questions.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Results and Analysis
</SectionTitle>
      <Paragraph position="0"> The F score obtained by the naive system described in this section, on the TREC 2003 person definition questions, was 0.392. An analysis of the results shows that, for questions that could be answered from external biography resources, the baseline system obtains an F score of 0.586.</Paragraph>
      <Paragraph position="1"> In post-submission experiments we changed the subsets of features we use in the queries sent to Google as well as the number of queries/subsets we use. The snippet similarity threshold was also tuned in order to filter out more snippets. This resulted in fewer unanswered questions, while the average answer length was decreased as well, by close to 50%.</Paragraph>
      <Paragraph position="2"> All in all, an informal evaluation showed increase in recall, precision and in the overall F score.</Paragraph>
      <Paragraph position="3"> From our experience with our baseline system we learned the following important lesson: having a (relatively) small number of high quality biography sources as a basis for each question's answer is far better than using a broad and large variety of snippets returned by a web search engine. While extending available biography resources so as to seriously boost their coverage is not a feasible option, we want to do the next best thing: make sure we identify good biography-like documents online, so that we can use these to mine snippets from; to this end we will use text classification.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Preparing for Text Classification
</SectionTitle>
    <Paragraph position="0"> In the previous section we suggested that using a text classifier might improve the performance of biography QA. Using text classifiers, we aim to identify biography-like documents from which we can extract answers. In this section we detail the document representations on which we will operate.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Document and Text Representation
</SectionTitle>
      <Paragraph position="0"> Text classifiers represent a document as a set of features d = {f1, f2,. . . , fn} where n is the number of active features, that is, features that occur in the document. A feature f can be a word, a set of words, a stem or any phrasal structure, depending on its text representation. Each feature has a weight, usually representing the number of occurrences of this feature in the document.</Paragraph>
      <Paragraph position="1"> What is a suitable abstract representation of documents for our biography domain? We have defined 7 clusters, groups of words (terms/tokens) with a high degree of pair-wise semantic relatedness. Each cluster has a meta-tag symbol (as can be seen in Table 2) and all occurrences of members of a cluster were substituted by the cluster's meta-tag. An example of a document abstraction can be found in Table 3. This abstraction captures typical similarities between biographical strings; e.g., for the two sentences John Kennedy was born in 1917 and William Shakespeare was born in 1564 we get the same abstraction &lt;NAME&gt;&lt;NAME&gt; was born in &lt;YEAR&gt;.</Paragraph>
      <Paragraph position="2"> It is worth noting that some of the clusters, such as &lt;CAP&gt; and &lt;PLACE&gt;, &lt;CAP&gt; and &lt;PN&gt; and others may overlap. Looking at the example in Table 3, we see that Abbey was born in Chicago, Illinois, but the automatic abstractor misinterpreted the token &amp;quot;Ill.,&amp;quot; marking it is &lt;CAP&gt; for  capitalized (possibly meaningful) word, but not as &lt;NAME&gt; the name of the subject of the biography &lt;YEAR&gt; four digits surrounded by white space, probably a year &lt;D&gt; sequence of number of digits other than four digits, can be part of a date, age etc.</Paragraph>
      <Paragraph position="3"> &lt;CAP&gt; a capitalized word in the middle of a sentence that wasn't substituted by any other tag &lt;PN&gt; a proper name that is not the subject of the biography It substitutes any name out of a list of thousand names &lt;PLACE&gt; denotes a name of a place, city or country out of a list of more than thousand places &lt;MONTH&gt; denotes one of the twelve months &lt;NOM&gt; denotes a nominative  straction &lt;PLACE&gt;. A similar thing happens with the name &amp;quot;Wooldridge&amp;quot; that is not very common; it should have been &lt;PN&gt; instead of &lt;CAP&gt;.</Paragraph>
      <Paragraph position="4"> All procedures described below are preformed on abstract-meta-tagged documents.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Identifying Biography Documents
</SectionTitle>
    <Paragraph position="0"> Given a document, the task of a biography classifier is to decide whether a given document is a biography or not. In this section we address the problem of acquiring biography classifiers by training machine learning algorithms on data. We present two biography classification algorithms: a naive classifier based on Ripper (Cohen and Singer, 1996), and another based on SVM (Joachims, 1998). The two methods differ radically both in the way they represent the training data (i.e., document representation), and in their learning approaches. The naive classifier is obtained by a repetitive rule learning al-Original Lincoln, Abbey (b. Anna Marie Wooldridge) 1930 - Jazz singer, composer/arranger, movie actress; born in Chicago, Ill. While a teenager she sang at school and church functions and then toured locally with a dance band.</Paragraph>
    <Paragraph position="1"> Abstraction &lt;NAME&gt;, &lt;NAME&gt; ( b . &lt;PN&gt;&lt;CAP&gt; &lt;CAP&gt; ) &lt;YEAR&gt; - &lt;CAP&gt; singer , composer/arranger , movie actress ; born in &lt;PLACE&gt; &lt;CAP&gt; . While a teenager &lt;NOM&gt; sang at school and church functions and then toured locally with a dance band .</Paragraph>
    <Paragraph position="2">  gorithm. We modified this algorithm to specifically fit the task of identifying biographies. The SVM learns &amp;quot;linear decision boundaries&amp;quot; between the different classes. We employ here the implementation of SVMs by (Joachims, 1998). Next we discuss the details of how each algorithm was used for learning a biography classifier.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Naive Classifier
</SectionTitle>
      <Paragraph position="0"> We employ this algorithm for its simplicity and scalability. This algorithm learns user-friendly rules, i.e., human-readable conjunctions of propositions, which can be converted to queries for a Boolean search engine. Furthermore, it is known to exhibit relatively good results across a wide variety of classfication problems, including tasks that involve large collections of noisy data, similar to the large document collections that we face in definitional QA.</Paragraph>
      <Paragraph position="1"> The naive classifier consists of two main stages: (1) Rules building. This is similar to Ripper's first stage of building an initial rule set. Our algorithm deviates from standard implementations of Ripper in that the terms that serve as the literals in the rules are n-grams of various lengths. We feel that n-grams, as opposed to individual literals (as in (Cohen and Singer, 1996), better capture contextual effects, which could be crucial in text classification.</Paragraph>
      <Paragraph position="2"> Our learner learns the rules as follows. The set of k-most frequent n-grams representing the training documents is split into two frequency-ordered lists: TLP (term-list-positive) containing the positive example set and TLN (term-list-negative) containing the negative examples set. The vector vectorw is initialized to be TLP/(TLP [?] TLN), i.e., the most frequent n-grams extracted from the positive set that are not top frequent in the negative set.</Paragraph>
      <Paragraph position="3"> (2) Rule optimization. Instead of Ripper's rule pruning stage, our algorithm assigns a weight to each rule/n-gram r in the rules vector according to the formula g(n)*f(r)C , where g(n) is an increasing function in the length of the n-gram (longer n-grams receive higher weights), f(r) is the ratio of the frequency of r in the positive examples to its frequency in the negative examples, and C is the size of the training set. The normalization by C is merely for the purpose of tracking variations of the weights in different sizes of training sets. The preference for longer n-grams can be justified by the intuition that longer n-grams are more informative as they stand for stricter contexts. For example, the string &amp;quot;(&lt;NAME&gt; , &lt;NAME&gt; born in &lt;YEAR&gt;&amp;quot; seems more informative than the shorter string in &lt;YEAR&gt;).</Paragraph>
      <Paragraph position="4"> Training material. The corpus we used as our training set is a collection of 350 biographies. Most of the biographies were randomly sampled from biography.com, while the rest were collected from the web. About 130 documents from the New York Times (NYT) 2000 collection were randomly selected as negative example set. The volumes of the positive and negative sets are equal.</Paragraph>
      <Paragraph position="5"> Various considerations played a role in building this corpus. The biographies from biography.</Paragraph>
      <Paragraph position="6"> com are 'clean' in the sense that all of them were written as biographies. To enable the learning of features of informal biographies, some other 'noisy' biographies such as biography-like newspaper reviews were added. Furthermore, a small number of different biographies of the same person were manually added in order to enforce variation in style.</Paragraph>
      <Paragraph position="7"> We also added a small number of biographies from other different sources to avoid any bias towards the biography.com domain.</Paragraph>
      <Paragraph position="8"> Validation and tuning. We tuned the naive algorithm on a separate validation set of documents. The validation set was collected in the same way as the training set. It contained 60 biographies, of which 40 were randomly sampled from biography.</Paragraph>
      <Paragraph position="9"> com, 10 'clean' biographies were collected from various online sources, 10 other documents were noisy biographies such as newspaper reviews. In addition, another 40 non-biographical documents were randomly retrieved from the web.</Paragraph>
      <Paragraph position="10"> The vector vectorw is now used to rank the documents of the validation set V in order to set a threshold th that minimizes the false-positive and the falsenegative errors. Each document dj [?] V in the validation set is represented by a vector vectorx, where xi counts the occurrences of wi in dj. The score of the document is the normalized inner product of vectorx and vectorw given by the function score(dj) = vectorx*vectorwlength(dj). In the validation stage some heuristic modifications were applied by the algorithm. For example, when the person name tag is absent, the document gets the score of zero even though other parameters of the vector may be present. We also normalized document scores by document length.</Paragraph>
      <Paragraph position="11"> Support Vector Machines (SVMs) Now we describe the learning of a biography classifier using SVMs. Unlike many other classifiers, SVMs are capable of learning classification even of non-linearly-separable classes. It is assumed that classes that are non-linearly separable in one dimension may be linearly separable in higher dimension. SVMs offer two important advantages for text classification (Joachims, 1998; Sebastiani, 2002): (1) Term selection is often not needed, as SVMs tend to be fairly robust to overfitting and can scale up to considerable dimensionalities, and (2) No human and machine effort in parameter tuning on a validation set is needed, as there is a theoretically motivated &amp;quot;default&amp;quot; choice of parameter settings which have been shown to provide best results. null The key idea behind SVMs is to boost the dimension of the representation vectors and then to find the best line or hyper-plane from the widest set of parallel hyper-planes. This hyper-plane maximizes the distance between two elements in the set. The elements in the set are the support vectors. Theoretically, the classifier is determined by a very small number of examples defining the category frontier, the support vectors. Practically, finding the support vectors is not a trivial task.</Paragraph>
      <Paragraph position="12"> Training SVMs. The implementation used is SVM-light v.5 (Joachims, 1999). The classifier was run with its default setting, with linear kernel function and no kernel optimization tricks. The SVM-light was trained on the very same (meta-tagged) training corpus the naive classifier was trained on.</Paragraph>
      <Paragraph position="13"> Since SVM is supposed to be robust and to fit big and noisy collections, no feature selection method was applied. The special feature underlying SVMs is the high dimensional representation of a document, allowing categorization by a hyper-plane of high dimension; therefore each document was represented by the vector of its stems. The dimension was boosted to include all the stems from the positive set. The boosted vector dimension was 7935, the number of different stems in the collection. The number of support vectors discovered was 17, which turned out to be too small for this task.</Paragraph>
      <Paragraph position="14"> Testing this model on the test set (the same test set used to test the naive classifier from previous section) yielded very poor results. It seemed that the classification was totally random. Testing the classifier on smaller subsets of the training set (200, 300, 400 documents) exposed signs of convergence, suggesting the training set is too sparse for the SVM.</Paragraph>
      <Paragraph position="15"> To overcome sparse data, more documents were needed. The size of the training set was more than doubled. A total of 9968 documents was used as the training set. Just like the original training set, most of the biographies were randomly extracted from biography.com, while a few dozen biographies were manually extracted from various online sources to correct for a possible bias in biography.com. Training SVM-light on the new training set yielded 232 support vectors, which seems enough to perform this classification tasks.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In order to test the effectiveness of the biography classifiers in improving question answering, we integrated each one of them with the naive baseline biographical QA system and tested the integrated system, called BioGrapher (Figure 1). Before discussing the results of this experiment, we briefly mention how the two classifiers performed on the pure classification task. For this purpose, we created a test set including 47 documents that were retrieved from the web. The evaluation measure was the accuracy of the classifiers in recognizing biographies. The Ripper-based algorithm achieved 89% success, outranking the SVM which achieved 83%.</Paragraph>
    <Paragraph position="1"> A discussion of this difference is beyond the scope of this paper (see (Tsur, 2003) for details).</Paragraph>
    <Paragraph position="2"> We tested BioGrapher on 11 out of the 30 biographical questions in the TREC 2003 QA track.</Paragraph>
    <Paragraph position="3"> Those 11 questions were chosen as a test set because the baseline system (Section 3) scored poorly on them, suggesting that our baseline heuristics are incapable of effectively dealing with this type of questions.</Paragraph>
    <Paragraph position="4">  Two experiments were carried out, one for the Ripper-based classifier and another for the SVM-based one. For each definitional question Biographer submits two simple queries to a web search engine (e.g., Sir John Hale and Sir John Hale biography). It retrieves the top 20 documents returned by the search engine, thus obtaining, for each question, 40 documents amongst which it should find a biography. BioGrapher then classifies the documents into biographies and non-biographic documents. The distribution of documents that were classified as biographies can be found in Table 4.</Paragraph>
    <Paragraph position="5"> To simplify the experiments, and especially the  error analysis, we set up BioGrapher to return answer snippets from a single biography or biography-like document only. Recall, the test questions were such that there were no biographies for the question targets in the biography collection we used (biography.com): the biographies used were ones that BioGrapher identified on the web.</Paragraph>
    <Paragraph position="6"> We evaluated BioGrapher in the following manner.</Paragraph>
    <Paragraph position="7"> The assessor first determines whether the document from which BioGrapher extracts answer snippets is a proper biography or not. In case the document is not a pure biography the F-score given to this question is zero. Otherwise, the F-score was determined in the manner described in Section 3.2 BioGrapher with the Ripper-based Classifier The total number of documents that were classified as biographies is 41 (out of 440 retrieved documents). However, analysis of the results reveals that the false positive ratio is high; only 20 of the 41 chosen documents were proper biography-like pages, the other &amp;quot;biographies&amp;quot; were very noisy.</Paragraph>
    <Paragraph position="8"> For 4 out of the 11 test questions, a proper biography was returned as the top ranking document. While all 4 questions scored 0 at the original TREC evaluation, now their average F-score is 0.732, improving the average F-score over all biography questions by 9.6% to 0.4659.</Paragraph>
    <Paragraph position="9"> BioGrapher with the SVM Classifier The total number of documents that were classified as biographies is 38 (out of 440 retrieved docu2Obviously, the F-score for snippets extracted from documents incorrectly classified as biographies could be higher than zero because these documents could still contain valuable pieces of biographical information that would contribute to the answer's F-score. However, we decided to compute precision and recall only for snippets extracted from documents correctly classified as biographies as we think of the biography classifier as a means to identify (&amp;quot;on-the-fly&amp;quot;) quality documents that could in principle be added to a biography collection.</Paragraph>
    <Paragraph position="10"> ments). However, just like in the case of the Ripper-based classifier, an analysis of the results reveals that the false positive ratio is high; only 18 of the 38 chosen documents were biography-like.</Paragraph>
    <Paragraph position="11"> The SVM classifier managed to return proper biographies (as top ranking documents) for 5 out of 11 questions. The average F-score for those questions is 0.674 instead of the original zero, improving the average F-score over all biography questions by 9.7% to 0.4665.</Paragraph>
    <Paragraph position="12"> No biographies at all were retrieved for 4 of the 11 definition targets in the test set, the same four definition targets for which the Ripper-based classifier did not find biographies. A closer look reveals the same problems as with the Ripper-based classifier: a relatively high false-positive error ratio and weak ranking of the classified biographies.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Discussion
</SectionTitle>
    <Paragraph position="0"> The results of the experiments using both classifiers are quite similar. The system integrated with the SVM-based classifier achieved a slightly higher F-score but it still falls within the standard deviation.</Paragraph>
    <Paragraph position="1"> Our experiment serves as a proof of concept for the hypothesis that using text classification methods improves the baseline QA system in a principled way.</Paragraph>
    <Paragraph position="2"> In spite of the major improvement in the system's performance, we have found two main problems with the classifiers. First, although the classifiers managed to identify biography-like documents, they have a high false-positive ratio and too many errors in filtering out some of the non-pure-biography documents. This happens when the documents retrieved by the web search engine simply cannot be regarded as clean biographies by human assessors, although they do contain many biographic details.</Paragraph>
    <Paragraph position="3"> Second, most of the definition targets had biographies retrieved and even classified as biographies, but the biographies were ranked below other noisy biographical documents, therefore the best biography was not presented as a source from which to extract answer snippets. There are various obvious paths to improve over the current system: (1) Improve the classifiers by better training and other classification algorithms; (2) Enable the extraction of answers from 'noisy' biography-like documents in such a way that the gain in recall is not reversed by a loss of precision; and (3) Allow for the extraction of answer snippets from multiple biography-like documents, while avoiding to return overlapping snippets.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML