File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1063_metho.xml

Size: 19,987 bytes

Last Modified: 2025-10-06 14:10:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1063">
  <Title>QuestionBank: Creating a Corpus of Parse-Annotated Questions</Title>
  <Section position="5" start_page="497" end_page="497" type="metho">
    <SectionTitle>
3 Data Sources
</SectionTitle>
    <Paragraph position="0"> The raw question data for QuestionBank comes from two sources, the TREC 8-11 QA track test sets1, and a question classi er training set produced by the Cognitive Computation Group (CCG2) at the University of Illinois at Urbana-Champaign.3 We use equal amounts of data from each source so as not to bias the corpus to either data source.</Paragraph>
    <Section position="1" start_page="497" end_page="497" type="sub_section">
      <SectionTitle>
3.1 TREC Questions
</SectionTitle>
      <Paragraph position="0"> The TREC evaluations have become the standard evaluation for QA systems. Their test sets consist primarily of fact seeking questions with some imperative statements which request information, e.g. List the names of cell phone manufacturers. We included 2000 TREC questions in the raw data from which we created the question treebank. These 2000 questions consist of the test questions for the rst three years of the TREC QA track (1893 questions) and 107 questions from the 2003 TREC test set.</Paragraph>
    </Section>
    <Section position="2" start_page="497" end_page="497" type="sub_section">
      <SectionTitle>
3.2 CCG Group Questions
</SectionTitle>
      <Paragraph position="0"> The CCG provide a number of resources for developing QA systems. One of these resources is a set of 5500 questions and their answer types for use in training question classi ers. The 5500 questions were stripped of answer type annotation, duplicated TREC questions were removed and 2000 questions were used for the question treebank.</Paragraph>
      <Paragraph position="1"> The CCG 5500 questions come from a number of sources (Li and Roth, 2002) and some of these questions contain minor grammatical mistakes so that, in essence, this corpus is more representative of genuine questions that would be put to a working QA system. A number of changes in tokenisation were corrected (eg. separating contractions), but the minor grammatical errors were left unchanged because we believe that it is necessary for a parser for question analysis to be able to cope with this sort of data if it is to be used in a working QA system.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="497" end_page="499" type="metho">
    <SectionTitle>
4 Creating the Treebank
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="497" end_page="498" type="sub_section">
      <SectionTitle>
4.1 Bootstrapping a Question Treebank
</SectionTitle>
      <Paragraph position="0"> The algorithm used to generate the question tree-bank is an iterative process of parsing, manual correction, retraining, and parsing.</Paragraph>
      <Paragraph position="1">  Algorithm 1 Induce a parse-annotated treebank from raw data repeat Parse a new section of raw data Manually correct errors in the parser output Add the corrected data to the training set Extract a new grammar for the parser until All the data has been processed Algorithm 1 summarises the bootstrapping algorithm. A section of raw data is parsed. The parser output is then manually corrected, and added to the parser's training corpus. A new grammar is then extracted, and the next section of raw data is parsed. This process continues until all the data has been parsed and hand corrected.</Paragraph>
    </Section>
    <Section position="2" start_page="498" end_page="498" type="sub_section">
      <SectionTitle>
4.2 Parser
</SectionTitle>
      <Paragraph position="0"> The parser used to process the raw questions prior to manual correction was that of Bikel (2002)4, a retrainable emulation of Collins (1999) model 2 parser. Bikel's parser is a history-based parser which uses a lexicalised generative model to parse sentences. We used WSJ Sections 02-21 of the Penn-II Treebank to train the parser for the rst iteration of the algorithm. The training corpus for subsequent iterations consisted of the WSJ material and increasing amounts of processed questions. null</Paragraph>
    </Section>
    <Section position="3" start_page="498" end_page="498" type="sub_section">
      <SectionTitle>
4.3 Basic Corpus Development Statistics
</SectionTitle>
      <Paragraph position="0"> Our question treebank was created over a period of three months at an average annotation speed of about 60 questions per day. This is quite rapid for treebank development. The speed of the process was helped by two main factors: the questions are generally quite short (typically about 10 words long), and, due to retraining on the continually increasing training set, the quality of the parses output by the parser improved dramatically during the development of the treebank, with the effect that corrections during the later stages were generally quite small and not as time consuming as during the initial phases of the bootstrapping process.</Paragraph>
      <Paragraph position="1"> For example, in the rst week of the project the trees from the parser were of relatively poor quality and over 78% of the trees needed to be corrected manually. This slowed the annotation process considerably and parse-annotated questions  were being produced at an average rate of 40 trees per day. During the later stages of the project this had changed dramatically. The quality of trees from the parser was much improved with less than 20% of the trees requiring manual correction. At this stage parse-annotated questions were being produced at an average rate of 90 trees per day.</Paragraph>
    </Section>
    <Section position="4" start_page="498" end_page="499" type="sub_section">
      <SectionTitle>
4.4 Corpus Development Error Analysis
</SectionTitle>
      <Paragraph position="0"> Some of the more frequent errors in the parser output pertain to the syntactic analysis of WH-phrases (WHNP, WHPP, etc). In Sections 02-21 of the Penn-II Treebank, these are used more often in relative clause constructions than in questions.</Paragraph>
      <Paragraph position="1"> As a result many of the corpus questions were given syntactic analyses corresponding to relative clauses (SBAR with an embedded S) instead of as questions (SBARQ with an embedded SQ). Figure  Because the questions are typically short, an error like this has quite a large effect on the accuracy for the overall tree; in this case the f-score for the parser output (Figure 1(a)) would be only 60%. Errors of this nature were quite frequent in the rst section of questions analysed by the parser, but with increased training material becoming available during successive iterations, this error became less frequent and towards the end of  the project it was only seen in rare cases.</Paragraph>
      <Paragraph position="2"> WH-XP marking was the source of a number of consistent (though infrequent) errors during annotation. This occurred mostly in PP constructions containing WHNPs. The parser would output a structure like Figure 2(a), where the PP mother of the WHNP is not correctly labelled as a WHPP as in Figure 2(b).</Paragraph>
      <Paragraph position="3">  The parser output often had to be rearranged structurally to varying degrees. This was common in the longer questions. A recurring error in the parser output was failing to identify VPs in SQs with a single object NP. In these cases the verb and the object NP were left as daughters of the SQ node. Figure 3(a) illustrates this, and Figure 3(b) shows the corrected tree with the VP node inserted. null  On inspection, we found that the problem was caused by copular constructions, which, according to the Penn-II annotation guidelines, do not feature VP constituents. Since almost half of the question data contain copular constructions, the parser trained on this data would sometimes misanalyse non-copular constructions or, conversely, incorrectly bracket copular constructions using a VP constituent (Figure 4(a)).</Paragraph>
      <Paragraph position="4"> The predictable nature of these errors meant that they were simple to correct. This is due to the particular context in which they occur and the nite number of forms of the copular verb.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="499" end_page="501" type="metho">
    <SectionTitle>
5 Experiments with QuestionBank
</SectionTitle>
    <Paragraph position="0"> In order to test the effect training on the question corpus has on parser performance, we carried out a number of experiments. In cross-validation experiments with 90%/10% splits we use all 4000 trees in the completed QuestionBank as the test set. We performed ablation experiments to investigate the effect of varying the amount of question and non-question training data on the parser's performance. For these experiments we divided the 4000 questions into two sets. We randomly selected 400 trees to be held out as a gold standard test set against which to evaluate, the remaining 3600 trees were then used as a training corpus.</Paragraph>
    <Section position="1" start_page="499" end_page="499" type="sub_section">
      <SectionTitle>
5.1 Establishing the Baseline
</SectionTitle>
      <Paragraph position="0"> The baseline we use for our experiments is provided by Bikel's parser trained on WSJ Sections 02-21 of the Penn-II Treebank. We test on all 4000 questions in our question treebank, and also Section 23 of the Penn-II Treebank.</Paragraph>
      <Paragraph position="1">  While the coverage for both tests is high, the parser underperforms signi cantly on the question test set with a labelled bracketing f-score of 78.77 compared to 82.97 on Section 23 of the Penn-II Treebank. Note that unlike the published results for Bikel's parser in our evaluations we test on Section 23 and include punctuation.</Paragraph>
    </Section>
    <Section position="2" start_page="499" end_page="500" type="sub_section">
      <SectionTitle>
5.2 Cross-Validation Experiments
</SectionTitle>
      <Paragraph position="0"> We carried out two cross-validation experiments.</Paragraph>
      <Paragraph position="1"> In the rst experiment we perform a 10-fold cross-validation experiment using our 4000 question  treebank. In each case a randomly selected set of 10% of the questions in QuestionBank was held out during training and used as a test set. In this way parses from unseen data were generated for all 4000 questions and evaluated against the QuestionBank trees.</Paragraph>
      <Paragraph position="2"> The second cross-validation experiment was similar to the rst, but in each of the 10 folds we train on 90% of the 4000 questions in QuestionBank and on all of Sections 02-21 of the Penn-II Treebank.</Paragraph>
      <Paragraph position="3"> In both experiments we also backtest each of the ten grammars on Section 23 of the Penn-II Tree-bank and report the average scores.</Paragraph>
      <Paragraph position="4">  sults show a signi cant improvement of over 10% on the baseline f-score for questions. However, the tests on the non-question Section 23 data show not only a signi cant drop in accuracy but also a drop in coverage.</Paragraph>
      <Paragraph position="5">  II Treebank Sections 02-21 and 4000 questions Table 3 shows the results for the second cross-validation experiment using Sections 02-21 of the Penn-II Treebank and the 4000 questions in QuestionBank. The results show an even greater increase on the baseline f-score than the experiments using only the question training set (Table 2). The non-question results are also better and are comparable to the baseline (Table 1).</Paragraph>
    </Section>
    <Section position="3" start_page="500" end_page="501" type="sub_section">
      <SectionTitle>
5.3 Ablation Runs
</SectionTitle>
      <Paragraph position="0"> In a further set of experiments we investigated the effect of varying the amount of data in the parser's training corpus. We experiment with varying both the amount of QuestionBank and Penn-II Tree-bank data that the parser is trained on. In each experiment we use the 400 question test set and Section 23 of the Penn-II Treebank to evaluate against, and the 3600 question training set described above and Sections 02-21 of the Penn-II Treebank as the basis for the parser's training corpus. We report on three experiments: In the rst experiment we train the parser using only the 3600 question training set. We performed ten training and parsing runs in this experiment, incrementally reducing the size of the QuestionBank training corpus by 10% of the whole on each run.</Paragraph>
      <Paragraph position="1"> The second experiment is similar to the rst but in each run we add Sections 02-21 of the Penn-II Treebank to the (shrinking) training set of questions. null The third experiment is the converse of the second, the amount of questions in the training set remains xed (all 3600) and the amount of Penn-II Treebank material is incrementally reduced by 10% on each run.</Paragraph>
      <Paragraph position="2">  the parser in tests on the 400 question test set, and Section 23 of the Penn-II Treebank in ten parsing runs with the amount of data in the 3600 question training corpus reducing incrementally on each run. The results show that training on only a small amount of questions, the parser can parse questions with high accuracy. For example when trained on only 10% of the 3600 questions used in this experiment, the parser successfully parses all of the 400 question test set and achieves an f-score of 85.59. However the results for the tests on WSJ Section 23 are considerably worse. The parser never manages to parse the full test set, and the best score at 59.61 is very low.</Paragraph>
      <Paragraph position="3"> Figure 6 graphs the results for the second abla- null consists of a xed amount of Penn-II Treebank data (Sections 02-21) and a reducing amount of question data from the 3600 question training set. Each grammar is tested on both the 400 question test set, and WSJ Section 23. The results here are signi cantly better than in the previous experiment. In all of the runs the coverage for both test sets is 100%, f-scores for the question test set decrease as the amount of question data in the training set is reduced (though they are still quite high.) There is little change in the f-scores for the tests on Section 23, the results all fall in the range 82.36 to 82.46, which is comparable to the baseline score.</Paragraph>
      <Paragraph position="4">  tion experiment. In this case the training set is a xed amount of the question training set described above (all 3600 questions) and a reducing amount of data from Sections 02-21 of the Penn Treebank. The graph shows that the parser performs consistently well on the question test set in terms of both coverage and accuracy. The tests on Section 23, however, show that as the amount of Penn-II Tree-bank material in the training set decreases, the f-score also decreases.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="501" end_page="503" type="metho">
    <SectionTitle>
6 Long Distance Dependencies
</SectionTitle>
    <Paragraph position="0"> Long distance dependencies are crucial in the proper analysis of question material. In English wh-questions, the fronted wh-constituent refers to an argument position of a verb inside the interrog- null ative construction. Compare the super cially similar null 1. Who1 [t1] killed Harvey Oswald? 2. Who1 did Harvey Oswald kill [t1]? (1) queries the agent (syntactic subject) of the de- null scribed eventuality, while (2) queries the patient (syntactic object). In the Penn-II and ATIS treebanks, dependencies such as these are represented in terms of empty productions, traces and coindexation in CFG tree representations (Figure 8).  With few exceptions5 the trees produced by current treebank-based probabilistic parsers do not represent long distance dependencies (Figure 9). Johnson (2002) presents a tree-based method for reconstructing LDD dependencies in Penn-II trained parser output trees. Cahill et al. (2004) present a method for resolving LDDs 5Collins' Model 3 computes a limited number of whdependencies in relative clause constructions.  at the level of Lexical-Functional Grammar f-structure (attribute-value structure encodings of basic predicate-argument structure or dependency relations) without the need for empty productions and coindexation in parse trees. Their method is based on learning nite approximations of functional uncertainty equations (regular expressions over paths in f-structure) from an automatically f-structure annotated version of the Penn-II treebank and resolves LDDs at f-structure. In our work we use the f-structure-based method of Cahill et al. (2004) to reverse engineer empty productions, traces and coindexation in parser output trees. We explain the process by way of a worked example. We use the parser output tree in Figure 9(a) (without empty productions and coindexation) and automatically annotate the tree with f-structure information and compute LDD-resolution at the level of f-structure using the resources of Cahill et al. (2004). This generates the f-structure annotated tree6 and the LDD resolved f-structure in Figure 10.</Paragraph>
    <Paragraph position="1"> Note that the LDD is indicated in terms of a reentrancy 1 between the question FOCUS and the SUBJ function in the resolved f-structure. Given the correspondence between the f-structure and f-structure annotated nodes in the parse tree, we compute that the SUBJ function newly introduced and reentrant with the FOCUS function is an argument of the PRED 'kill' and the verb form 'killed' in the tree. In order to reconstruct the corresponding empty subject NP node in the parser output tree, we need to determine candidate anchor sites  for the empty node. These anchor sites can only be realised along the path up to the maximal projection of the governing verb indicated by &amp;quot;=# annotations in LFG. This establishes three anchor sites: VP, SQ and the top level SBARQ. From the automatically f-structure annotated Penn-II treebank, we extract f-structure annotated PCFG rules for each of the three anchor sites whose RHSs contain exactly the information (daughter categories plus LFG annotations) in the tree in Figure 10 (in the same order) plus an additional node (of whatever CFG category) annotated &amp;quot;SUBJ=#, located anywhere within the RHSs. This will retrieve rules of the form</Paragraph>
    <Paragraph position="3"> each with their associated probabilities. We select the rule with the highest probability and cut the rule into the tree in Figure 10 at the appropriate anchor site (as determined by the rule LHS). In our case this selects SQ ! NP[&amp;quot; SUBJ=#]V P[&amp;quot;=#] and the resulting tree is given in Figure 11. From this tree, it is now easy to compute the tree with the coindexed trace in Figure 8 (a).</Paragraph>
    <Paragraph position="4"> In order to evaluate our empty node and coindexation recovery method, we conducted two experiments, one using 146 gold-standard ATIS question trees and one using parser output on the corresponding strings for the 146 ATIS question trees.</Paragraph>
    <Paragraph position="5">  In the rst experiment, we delete empty nodes and coindexation from the ATIS gold standard trees and and reconstruct them using our method and the preprocessed ATIS trees. In the second experiment, we parse the strings corresponding to the ATIS trees with Bikel's parser and reconstruct the empty productions and coindexation. In both cases we evaluate against the original (unreduced) ATIS trees and score if and only if all of insertion site, inserted CFG category and coindexation match.</Paragraph>
    <Paragraph position="6">  and antecedents) Table 4 shows that currently the recall of our method is quite low at 39.38% while the accuracy is very high with precision at 96.82% on the ATIS trees. Encouragingly, evaluating parser output for the same sentences shows little change in the scores with recall at 38.75% and precision at 96.77%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML