XML Viewer - w04-3215

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3215_metho.xml
Size: 24,929 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3215">
  <Title>Object-Extraction and Question-Parsing using CCG</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Object Extraction
</SectionTitle>
    <Paragraph position="0"> Steedman (1996) presents a detailed study of various extraction phenomena. Here we focus on object extraction, since the dependencies in such cases are unbounded, and CCG has been designed to handle these cases. Correct dependency recovery for object extraction is also difficult for shallow methods such as Johnson (2002) and Dienes and Dubey (2003).</Paragraph>
    <Paragraph position="1"> We consider three types of object extraction: object relative clauses, free object relatives, and toughadjectives (Hockenmaier, 2003a). Examples of the first two from CCGbank are given in Figures 1 and 2, together with the normal-form derivation.</Paragraph>
    <Paragraph position="2"> The caption gives the number of sentences containing such a case in Sections 2-21 of CCGbank (the training data) and Section 00 (development data).</Paragraph>
    <Paragraph position="3"> The pattern of the two derivations is similar: the subject of the verb phrase missing an object is type-raised (T); the type-raised subject composes (B) with the verb-phrase; and the category for the relative pronoun ((NPnNP)=(S[dcl]=NP) or NP=(S[dcl]=NP)) applies to the sentence-missingits-object (S[dcl]=NP). Clark et al. (2002) show how the dependency between the verb and object can be captured by co-indexing the heads of the NPs in the relative pronoun category.</Paragraph>
    <Paragraph position="4"> Figure 3 gives the derivation for a toughadjective. The dependency between take and That can be recovered by co-indexing the heads of NPs in an excellent publication that I enjoy reading</Paragraph>
    <Paragraph position="6"> the categories for hard and got. These cases are relatively rare, with around 50 occurring in the whole of the treebank, and only two in the development set; the parser correctly recovers one of the two object dependencies for the tough-adjective cases in 00.</Paragraph>
    <Paragraph position="7"> For the free object relative cases in Section 00, the parser recovers 14 of the 17 gold-standard dependencies2 between the relative pronoun and the head of the relative clause. The precision is 14/15.</Paragraph>
    <Paragraph position="8"> For the three gold standard cases that are misanalysed, the category NP=S[dcl] is assigned to the relative pronoun, rather than NP=(S[dcl]=NP).</Paragraph>
    <Paragraph position="9"> For the cases involving object relative clauses the parser provides a range of errors for which it is useful to give a detailed analysis.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Analysis of Object Extraction Cases
</SectionTitle>
      <Paragraph position="0"> Figure 4 gives the 20 sentences in Section 00 which contain a relative pronoun with the category (NPnNP)=(S[dcl]=NP). There are 24 object dependencies in total, since some sentences contain more than one extraction (11), and some extractions involve more than one head (8, 18, 19). For evaluation, we determined whether the parser correctly re2One of the 16 sentences contains two such dependencies. covered the dependency between the head of the extracted object and the verb. For example, to get the two dependencies in sentence 18 correct, the parser would have to assign the correct lexical category to had, and return respect and confidence as objects.</Paragraph>
      <Paragraph position="1"> The parser correctly recovers 15 of the 24 object dependencies.3 Overall the parser hypothesises 20 extracted object dependencies, giving a precision of 15/20. Hockenmaier (2003a) reports similar results for a CCG parser using a generative model: 14/24 recall and 14/21 precision. The results here are a significant improvement over those in Clark et al.</Paragraph>
      <Paragraph position="2"> (2002), in which only 10 of the 24 dependencies were recovered correctly. Below is a detailed analysis of the mistakes made by the parser.</Paragraph>
      <Paragraph position="3"> For Sentence 1 the parser cannot provide any analysis. This is because the correct category for estimated, ((S[pt]nNP)=PP)=NP, is not in the tag dictionary's entry for estimated. Since estimated occurs around 200 times in the data, the supertagger only considers categories from the tag dictionary entry, and thus cannot provide the correct category as an option.</Paragraph>
      <Paragraph position="4">  signed, rather than gold standard, POS tags.</Paragraph>
      <Paragraph position="5"> 1. Commonwealth Edison now faces an additional court-ordered refund on its summer/winter rate differential collections that the Illinois Appellate Court has estimated at $140 million.</Paragraph>
      <Paragraph position="6"> 2. Mrs. Hills said many of the 25 countries that she placed under varying degrees of scrutiny have made genuine progress on this touchy issue.p 3. It's the petulant complaint of an impudent American whom Sony hosted for a year while he was on a Luce Fellowship in Tokyo - to the regret of both parties.p 4. It said the man, whom it did not name, had been found to have the disease after hospital tests. 5. Democratic Lt. Gov. Douglas Wilder opened his gubernatorial battle with Republican Marshall Coleman with an abortion commercial produced by Frank Greer that analysts of every political persuasion agree was a tour de force. 6. Against a shot of Monticello superimposed on an American flag, an announcer talks about the strong tradition of freedom and individual liberty that Virginians have nurtured for generations.p 7. Interviews with analysts and business people in the U.S. suggest that Japanese capital may produce the economic cooperation that Southeast Asian politicians have pursued in fits and starts for decades.</Paragraph>
      <Paragraph position="7"> 8. Another was Nancy Yeargin, who came to Greenville in 1985, full of the energy and ambitions that reformers wanted to reward. 9. Mostly, she says, she wanted to prevent the damage to self-esteem that her low-ability students would suffer from doing badly on the test.p 10. Mrs. Ward says that when the cheating was discovered, she wanted to avoid the morale-damaging public disclosure that a trial would bring.p 11. In CAT sections where students' knowledge of two-letter consonant sounds is tested, the authors noted that Scoring High concentrated on the same sounds that the test does - to the exclusion of other sounds that fifth graders should know.p 12. Interpublic Group said its television programming operations - which it expanded earlier this year - agreed to supply more than 4,000 hours of original programming across Europe in 1990.</Paragraph>
      <Paragraph position="8"> 13. Interpublic is providing the programming in return for advertising time, which it said will be valued at more than $75 million in 1990 and $150 million in 1991.p 14. Mr. Sherwood speculated that the leeway that Sea Containers has means that Temple would have to substantially increase their bid if they're going to top us.p 15. The Japanese companies bankroll many small U.S. companies with promising products or ideas, frequently putting their money behind projects that commercial banks won't touch.p 16. In investing on the basis of future transactions, a role often performed by merchant banks, trading companies can cut through the logjam that small-company owners often face with their local commercial banks.</Paragraph>
      <Paragraph position="9"> 17. A high-balance customer that banks pine for, she didn't give much thought to the rates she was receiving, nor to the fees she was paying.p 18. The events of April through June damaged the respect and confidence which most Americans previously had for the leaders of China.p 19. He described the situation as an escrow problem, a timing issue, which he said was rapidly rectified, with no losses to customers.p 20. But Rep. Marge Roukema (R., N.J.) instead praised the House's acceptance of a new youth training wage, a subminimum that GOP  italics; for sentences marked with a p the parser correctly recovers all dependencies involved in the object extraction. For Sentence 2 the correct category is assigned to the relative pronoun that, but a wrong attachment results in many as the object of placed rather than countries.</Paragraph>
      <Paragraph position="10"> In Sentence 5 the incorrect lexical category ((SnNP)n(SnNP))=S[dcl] is assigned to the relative pronoun that. In fact, the correct category is provided as an option by the supertagger, but the parser is unable to select it. This is because the category for agree is incorrect, since again the correct category, ((S[dcl]nNP)=NP)=(S[dcl]nNP), is not in the verb's entry in the tag dictionary.</Paragraph>
      <Paragraph position="11"> In Sentence 6 the correct category is assigned to the relative pronoun, but a number of mistakes elsewhere result in the wrong noun attachment.</Paragraph>
      <Paragraph position="12"> In Sentences 8 and 9 the complementizer category S[em]=S[dcl] is incorrectly assigned to the relative pronoun that. For Sentence 8 the correct analysis is available but the parsing model chose incorrectly. For Sentence 9 the correct analysis is unavailable because the correct category for suffer, ((S[b]nNP)=PP)=NP, is not in the verb's entry in the tag dictionary.</Paragraph>
      <Paragraph position="13"> In Sentence 13 the correct category is again assigned to the relative pronoun, but a wrong attachment results in return being the object of placed, rather than time.</Paragraph>
      <Paragraph position="14"> In Sentence 17 the wrong category S[em]=S[b] is assigned to the relative pronoun that. Again the problem is with the category for the verb, but for a different reason: the POS tagger incorrectly tags pine as a base form (VB), rather than VBP, which completely misleads the supertagger.</Paragraph>
      <Paragraph position="15"> This small study only provides anecdotal evidence for the reasons the parser is unable to recover some long-range object dependencies. However, the analysis suggests that the parser fails largely for the same reasons it fails on other WSJ sentences: wrong attachment decisions are being made; the lexical coverage of the supertagger is lacking for some verbs; the model is sometimes biased towards incorrect lexical categories; and the supertagger is occasionally led astray by incorrect POS tags.</Paragraph>
      <Paragraph position="16"> Note that the recovery of these dependencies is a difficult problem, since the parser must assign the correct categories to the relative pronoun and verb, and make two attachment decisions: one attaching the relative pronoun to the verb, and one attaching it to the noun phrase. The recall figures for the individual dependencies in the relative pronoun category are 16/21 for the verb attachment and 15/24 for the noun attachment.</Paragraph>
      <Paragraph position="17"> In conclusion, the kinds of errors made by the parser suggest that general improvements in the coverage of the lexicon and parsing models based on CCGbank will lead to better recovery of long-range object dependencies.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Parsing Questions
</SectionTitle>
    <Paragraph position="0"> Wide-coverage parsers are now being successfully used as part of open-domain QA systems, e.g. Pasca and Harabagiu (2001). The speed and accuracy of our CCG parser suggests that it could be used to parse answer candidates, and we are currently integrating the parser into a QA system. We would also like to apply the parser to the questions, for two reasons: the use of CCG allows the parser to deal with extraction cases, which occur relatively frequently in questions; and the comparison of potential answers with the question, performed by the answer extraction component, is simplified if the same parser is used for both.</Paragraph>
    <Paragraph position="1"> Initially we tried some experiments applying the parser to questions from previous TREC competitions. The results were extremely poor, largely because the questions contain constructions which appear very infrequently, if at all, in CCGbank.4 For example, there are no What questions with the general form of What President became Chief Justice after his precidency? in CCGbank, but this is a very common form of Wh-question. (There is a very small number (3) of similar question types beginning How or Which in Sections 2-21.) One solution is to create new annotated question data and retrain the parser, perhaps combining the data with CCGbank. However, the creation of gold-standard derivation trees is very expensive.</Paragraph>
    <Paragraph position="2"> A novel alternative, which we pursue here, is to annotate questions at the lexical category level only. Annotating sentences with lexical categories is simpler than annotating with derivations, and can be done with the tools and resources we have available. The key question is whether training only the supertagger on new question data is enough to give high parsing accuracy; in Section 6 we show that it is. The next Section describes the creation of the question corpus.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 A What-Question Corpus
</SectionTitle>
    <Paragraph position="0"> We have created a corpus consisting of 1,171 questions beginning with the word What, taken from the TREC 9-12 competitions (2000-2003). We chose to focus on What-questions because these are a com4An earlier version of our QA system used RASP (Briscoe and Carroll, 2002) to parse the questions, but this parser also performed extremely poorly on some question types.</Paragraph>
    <Paragraph position="1">  1. What are Cushman and Wakefield known for? 2. What are pomegranates? 3. What is hybridization? 4. What is Martin Luther King Jr.'s real birthday? 5. What is one of the cities that the University of Minnesota is located in? 6. What do penguins eat? 7. What amount of folic acid should an expectant mother take daily? 8. What city did the Flintstones live in? 9. What instrument is Ray Charles best known for playing? 10. What state does Martha Stewart live in? 11. What kind of a sports team is the Wisconsin Badgers? 12. What English word contains the most letters? 13. What king signed the Magna Carta? 14. What caused the Lynmouth floods?  mon form of question, and many contain cases of extraction, including some unbounded object extraction. A sample of questions from the corpus is given in Figure 5.</Paragraph>
    <Paragraph position="2"> The questions were tokenised according to the Penn Treebank convention and automatically POS tagged. Some of the obvious errors made by the tagger were manually corrected. The first author then manually labelled 500 questions with lexical categories. The supertagger was trained on the annotated questions, and used to label the remaining questions, which were then manually corrected. The performance of the supertagger was good enough at this stage to significantly reduce the effort required for annotation. The second author has verified a subset of the annotated sentences. The question corpus took less than a week to create.</Paragraph>
    <Paragraph position="3"> Figure 6 gives the derivations for some example questions. The lexical categories, which make up the annotation in the question corpus, are in bold. Note the first example contains an unbounded object extraction, indicated by the question clause missing an object (S[q]=NP) which is an argument of What. Table 1 gives the distribution of categories assigned to the first word What in each question in the corpus. The first row gives the category of object question What. The second row is the object question determiner. The third row is the subject question determiner. And  the final row is the root subject question What.</Paragraph>
    <Paragraph position="4"> For the examples in Figure 5, S[wq]=(S[q]=NP) appears in questions 1-6, (S[wq]=(S[q]=NP))=N in 7-11, (S[wq]=(S[dcl]nNP))=N in 12-13, and S[wq]=(S[dcl]nNP) in 14.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> A development set was created by randomly selecting 171 questions. For development purposes the remaining 1,000 questions were used for training; these were also used as a final cross-validation training/test set. The average length of the tokenised questions in the whole corpus is 8.6 tokens.</Paragraph>
    <Paragraph position="1"> The lexical category set used by the parser contains all categories which occur at least 10 times in CCGbank, giving a set of 409 categories. In creating the question corpus we used a small number of new category types, of which 3 were needed to cover common question constructions. One of these, (S[wq]=(S[dcl]nNP))=N, applies to What, as in the second example in Figure 6. This category does appear in CCGbank, but so infrequently that it is not part of the parser's lexical category set.</Paragraph>
    <Paragraph position="2"> Two more apply to question words like did and is; for example, (S[q]=(S[pss]nNP))=NP applies to is in What instrument is Ray Charles best known for playing?, and (S[q]=PP)=NP applies to is in What city in Florida is Sea World in?.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Supertagger Accuracy
</SectionTitle>
      <Paragraph position="0"> As an initial evaluation we tested the accuracy of just the supertagger on the development data. The supertagger was run in two modes: one in which a single category was assigned to each word, and one in which 1.5 categories were assigned to each  word, on average. Table 2 gives the per-word accuracy on the development question data for a number of supertagging models; SENT accuracy gives the percentage of sentences for which every word is assigned the correct category. Four supertagging models were used: one trained on CCGbank only; one trained on the 1,000 questions; one trained on the 1,000 questions plus CCGbank; and one trained on 10 copies of the 1,000 questions plus CCGbank.</Paragraph>
      <Paragraph position="1"> The supertagger performs well when trained on the question data, and benefits from a combination of the questions and CCGbank. To increase the influence of the questions, we tried adding 10 copies of the question data to CCGbank, but this had little impact on accuracy. However, the supertagger performs extremely poorly when trained only on CCGbank. One reason for the very low SENT accuracy figure is that many of the questions contain lexical categories which are not in the supertagger's category set derived from CCGbank: 56 of the 171 development questions have this property.</Paragraph>
      <Paragraph position="2"> The parsing results in Clark and Curran (2004b) rely on a supertagger per-word accuracy of at least 97%, and a sentence accuracy of at least 60% (for  racy of 11% confirms that our parsing system based only on CCGbank is quite inadequate for accurate question parsing.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Parser Accuracy
</SectionTitle>
      <Paragraph position="0"> Since the gold-standard question data is only labelled at the lexical category level, we are only able to perform a full evaluation at that level. However, the scores in Clark and Curran (2004b) give an indication of how supertagging accuracy corresponds to overall dependency recovery. In addition, in Section 6.3 we present an evaluation on object extraction dependencies in the development data.</Paragraph>
      <Paragraph position="1"> We applied the parser to the 171 questions in the development data, using the supertagger model from the third row in Table 2, together with a log-linear parsing model trained on CCGbank. We used the supertagging approach described in Section 2.1, in which a small number of categories is initially assigned to each word, and the parser requests more categories if a spanning analysis cannot be found. We used 4 different values for the parameter (which determines the average number of categories per word): 0.5, 0.25, 0.075 and 0.01.</Paragraph>
      <Paragraph position="2"> The average number of categories at each level for the development data is 1.1, 1.2, 1.6 and 3.8. The parser provided an analysis for all but one of the 171 questions.</Paragraph>
      <Paragraph position="3"> The first row of Table 3 gives the per-word, and sentence, category accuracy for the parser output.</Paragraph>
      <Paragraph position="4"> Figures are also given for the accuracy of the categories assigned to the first word What. The figures show that the parser is more accurate at supertagging than the single-category supertagger.</Paragraph>
      <Paragraph position="5"> The second row gives the results if the original supertagging approach of Clark et al. (2002) is used, i.e. starting with a high number of categories per word, and reducing the number if the sentence cannot be parsed within reasonable space and time constraints. The third row corresponds to our new supertagging approach, but chooses a derivation at random, by randomly traversing the packed chart representation used by the parser. The fourth row corresponds to the supertagging approach of Clark et al. (2002), together with a random selection of</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SUPERTAGGING / ACCURACY
PARSING METHOD WORD SENT WHAT
</SectionTitle>
    <Paragraph position="0"> Increasing av. cats 94.4 79 92 Decreasing av. cats 89.5 64 81  the derivation. The baseline method in the fifth row assigns to a word the category most frequently seen with it in the data; for unseen words N is assigned. The results in Table 3 demonstrate that our new supertagging approach is very effective. The reason is that the parser typically uses the first supertagger level, where the average number of categories per word is only 1.1, and the per-word/sentence category accuracies are 95.5 and 70.8%, repsectively. 136 of the 171 questions (79.5%) are parsed at this level. Since the number of categories per word is very small, the parser has little work to do in combining the categories; the supertagger is effectively an almost-parser (Bangalore and Joshi, 1999). Thus the parsing model, which is not tuned for questions, is hardly used by the parser. This interpretation is supported by the high scores for the random method in row 3 of the table.</Paragraph>
    <Paragraph position="1"> In contrast, the previous supertagging method of Clark et al. (2002) results in a large derivation space, which must be searched using the parsing model. Thus the accuracy of the parser is greatly reduced, as shown in rows 2 and 4.</Paragraph>
    <Paragraph position="2"> As a final test of the robustness of our results, we performed a cross-validation experiment using the 1,000 training questions. The 1,000 questions were randomly split into 10 chunks. Each chunk was used as a test set in a separate run, with the remaining chunks as training data plus CCGbank.</Paragraph>
    <Paragraph position="3"> Table 4 gives the results averaged over the 10 runs for the two supertagging approaches.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Object Extraction in Questions
</SectionTitle>
      <Paragraph position="0"> For the object extraction evaluation we considered the 36 questions in the development data which have the category (S[wq]=(S[q]=NP))=N assigned to What. Table 7 gives examples of the questions. We assume these are fairly representative of the kinds of object extraction found in other question types, and thus present a useful test set.</Paragraph>
      <Paragraph position="1"> We parsed the questions using the best performing configuration from the previous section. All but one of the sentences was given an analysis. The perword/sentence category accuracies were 90.2% and 71.4%, respectively. These figures are lower than for the corpus as a whole, suggesting these object extraction questions are more difficult than average.</Paragraph>
      <Paragraph position="2">  We inspected the output to see if the object dependencies had been recovered correctly. To get the object dependency correct in the first question in Table 7, for example, the parser would need to assign the correct category to take and return amount as the object of take. Of the 37 extracted object dependencies (one question had two such dependencies), 29 (78.4%) were recovered correctly. Given that the original parser trained on CCGbank performs extremely poorly on such questions, we consider this to be a highly promising result.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML