File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1604_metho.xml
Size: 26,365 bytes
Last Modified: 2025-10-06 14:10:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1604"> <Title>Detecting Parser Errors Using Web-based Semantic Filters</Title> <Section position="4" start_page="27" end_page="31" type="metho"> <SectionTitle> 2 Semantic Filtering </SectionTitle> <Paragraph position="0"> This section describes semantic filtering as implemented in the WOODWARD system. WOODWARD consists of two components: a semantic interpreter that takes a parse tree and converts it to a conjunction of first-order predicates, and a sequence of four increasingly sophisticated methods that check semantic plausibility of conjuncts on the Web. Below, we describe each component in turn.</Paragraph> <Section position="1" start_page="28" end_page="28" type="sub_section"> <SectionTitle> 2.1 Semantic Interpreter </SectionTitle> <Paragraph position="0"> The semantic interpreter aims to make explicit the relations that a sentence introduces, and the arguments to each of those relations. More specifically, the interpreter identifies the main verb relations, preposition relations, and semantic type relations in a sentence; identifies the number of arguments to each relation; and ensures that for every argument that two relations share in the sentence, they share a variable in the logical representation.</Paragraph> <Paragraph position="1"> Given a sentence and a Penn-Treebank-style parse of that sentence, the interpreter outputs a conjunction of First-Order Logic predicates. We call this representation a relational conjunction (RC). Each relation in an RC consists of a relation name and a tuple of variables and string constants representing the arguments of the relation. As an example, Figure 1 contains a sentence taken from the TREC 2003 corpus, parsed by the Collins parser. Figure 2 shows the correct RC for this sentence and the RC derived automatically from the incorrect parse.</Paragraph> <Paragraph position="2"> Due to space constraints, we omit details about the algorithm for converting a parse into an RC, but Moldovan et al. (2003) describe a method similar to ours.</Paragraph> </Section> <Section position="2" start_page="28" end_page="28" type="sub_section"> <SectionTitle> 2.2 Semantic Filters </SectionTitle> <Paragraph position="0"> Given the RC representation of a parsed sentence as supplied by the Semantic Interpreter, we test the parse using four web-based methods. Fundamentally, the methods all share the underlying principle that some form of co-occurrence of terms in the vast Web corpus can help decide whether a proposed relationship is semantically plausible.</Paragraph> <Paragraph position="1"> Traditional statistical parsers also use co-occurrence of lexical heads as features for making parse decisions. We expand on this idea in two ways: first, we use a corpus several orders of magnitude larger than the tagged corpora traditionally used to train statistical parses, so that the fundamental problem of data sparseness is ameliorated.</Paragraph> <Paragraph position="2"> Second, we search for targeted patterns of words to help judge specific properties, like the number of complements to a verb. We now describe each of our techniques in more detail.</Paragraph> </Section> <Section position="3" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 2.3 A PMI-Based Filter </SectionTitle> <Paragraph position="0"> A number of authors have demonstrated important ways in which search engines can be used to uncover semantic relationships, especially Turney's notion of pointwise mutual information (PMI) based on search-engine hits counts (Turney, 2001).</Paragraph> <Paragraph position="1"> WOODWARD's PMI-Based Filter (PBF) uses PMI scores as features in a learned filter for predicates. Following Turney, we use the formula below for the PMI between two terms t1 and t2:</Paragraph> <Paragraph position="3"> We use PMI scores to judge the semantic plausibility of an RC conjunct as follows. We construct a number of different phrases, which we call discriminator phrases, from the name of the relation and the head words of each argument. For example, the prepositional attachment &quot;operations of 65 cents&quot; would yield phrases like &quot;operations of&quot; and &quot;operations of * cents&quot;. (The '*' character is a wildcard in the Google interface; it can match any single word.) We then collect hitcounts for each discriminator phrase, as well as for the relation name and each argument head word, and compute a PMI score for each phrase, using the phrase's hitcount as the numerator in Equation 1.</Paragraph> <Paragraph position="4"> Given a set of such PMI scores for a single relation, we apply a learned classifier to decide if the PMI scores justify calling the relation implausible.</Paragraph> <Paragraph position="5"> This classifier (as well as all of our other ones) is trained on a set of sentences from TREC and the Penn Treebank; our training and test sets are described in more detail in section 3. We parsed each sentence automatically using Daniel Bikel's implementation of the Collins parsing model,1 trained on sections 2-21 of the Penn Treebank, and then applied our semantic interpreter algorithm to come up with a set of relations. We labeled each relation by hand for correctness. Correct relations are positive examples for our classifier, incorrect relations are negative examples (and likewise for all of our other classifiers). We used the LIBSVM software package2 to learn a Gaussian-kernel support vector machine model from the PMI scores collected for these relations.</Paragraph> <Paragraph position="6"> We can then use the classifier to predict if a relation is correct or not depending on the various PMI scores we have collected.</Paragraph> <Paragraph position="7"> Because we require different discriminator phrases for preposition relations and verb relations, we actually learn two different models.</Paragraph> <Paragraph position="8"> After extensive experimentation, optimizing for training set accuracy using leave-one-out crossvalidation, we ended up using only two patterns for verbs: &quot;noun verb&quot; (&quot;verb noun&quot; for nonsubjects) and &quot;noun * verb&quot; (&quot;verb * noun&quot; for non-subjects). We use the PMI scores from the argument whose PMI values add up to the lowest value as the features for a verb relation, with the intuition being that the relation is correct only if every argument to it is valid.</Paragraph> <Paragraph position="9"> For prepositions, we use a larger set of patterns.</Paragraph> <Paragraph position="10"> Letting arg1 and arg2 denote the head words of the two arguments to a preposition, and letting prep denote the preposition itself, we used the patterns &quot;arg1 prep&quot;, &quot;arg1 prep * arg2&quot;, &quot;arg1 prep the arg2&quot;, &quot;arg1 * arg2&quot;, and, for verb attachments, &quot;arg1 it prep arg2&quot; and &quot;arg1 them prep arg2&quot;. These last two patterns are helpful for preposition attachments to strictly transitive verbs.</Paragraph> </Section> <Section position="4" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 2.4 The Verb Arity Sampling Test </SectionTitle> <Paragraph position="0"> In our training set from the Penn Treebank, 13% of the time the Collins parser chooses too many or too few arguments to a verb. In this case, checking the PMI between the verb and each argument independently is insufficient, and there is not enough guments at once. We therefore use a different type of filter in order to detect these errors, which we call the Verb Arity Sampling Test (VAST).</Paragraph> <Paragraph position="1"> Instead of testing a verb to see if it can take a particular argument, we test if it can take a certain number of arguments. The verb predicate producing(VP1, NP3, NP2, NP1) in interpretation 2 of Figure 2, for example, has too many arguments.</Paragraph> <Paragraph position="2"> To check if this predicate can actually take three noun phrase arguments, we can construct a common phrase containing the verb, with the property that if the verb can take three NP arguments, the phrase will often be followed by a NP in text, and vice versa. An example of such a phrase is &quot;which it is producing.&quot; Since &quot;which&quot; and &quot;it&quot; are so common, this phrase will appear many times on the Web. Furthermore, for verbs like &quot;producing,&quot; there will be very few sentences in which this phrase is followed by a NP (mostly temporal noun phrases like &quot;next week&quot;). But for verbs like &quot;give&quot; or &quot;name,&quot; which can accept three noun phrase arguments, there will be significantly more sentences where the phrase is followed by a NP.</Paragraph> <Paragraph position="3"> The VAST algorithm is built upon this observation. For a given verb phrase, VAST first counts the number of noun phrase arguments. The Collins parser also marks clause arguments as being essential by annotating them differently. VAST counts these as well, and considers the sum of the noun and clause arguments as the number of essential arguments. If the verb is passive and the number of essential arguments is one, or if the verb is active and the number of essential arguments is two, VAST performs no check. We call these strictly transitive verb relations. If the verb is passive and there are two essential arguments, or if the verb is active and there are three, it performs the ditransitive check below. If the verb is active and there is one essential argument, it does the intransitive check described below. We call these two cases collectively nontransitive verb relations. In both cases, the checks produce a single real-valued score, and we use a linear kernel SVM to identify an appropriate threshold such that predicates above the threshold have the correct arity.</Paragraph> <Paragraph position="4"> The ditransitive check begins by querying Google for two hundred documents containing the phrase &quot;which it verb&quot; or &quot;which they verb&quot;. It downloads each document and identifies the sentences containing the phrase. It then POS-tags and NP-chunks the sentences using a maximum entropy tagger and chunker. It filters out any sen- null tences for which the word &quot;which&quot; is preceded by a preposition. Finally, if there are enough sentences remaining (more than ten), it counts the number of sentences in which the verb is directly followed by a noun phrase chunk, which we call an extraction. It then calculates the ditransitive score for verb v as the ratio of the number of extractions E to the number of filtered sentences F:</Paragraph> <Paragraph position="6"> The intransitive check performs a very similar set of operations. It fetches up to two hundred sentences matching the phrases &quot;but it verb&quot; or &quot;but they verb&quot;, tags and chunks them, and extracts noun phrases that directly follow the verb.</Paragraph> <Paragraph position="7"> It calculates the intransitive score for verb v using the number of extractions E and sentences S as:</Paragraph> <Paragraph position="9"/> </Section> <Section position="5" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 2.5 TextRunner Filter </SectionTitle> <Paragraph position="0"> TextRunner is a new kind of web search engine.</Paragraph> <Paragraph position="1"> Its design is described in detail elsewhere (Cafarella et al., 2006), but we utilize its capabilities in WOODWARD. TextRunner provides a search interface to a set of over a billion triples of the form (object string, predicate string, object string) that have been extracted automatically from approximately 90 million documents to date.</Paragraph> <Paragraph position="2"> The search interface takes queries of the form (string1,string2,string3), and returns all tuples for which each of the three tuple strings contains the corresponding query string as a substring. TextRunner's object strings are very similar to the standard notion of a noun phrase chunk. The notion of a predicate string, on the other hand, is loose in TextRunner; a variety of POS sequences will match the patterns for an extracted relation.</Paragraph> <Paragraph position="3"> For example, a search for tuples with a predicate containing the word 'with' will yield the tuple (risks, associated with dealing with, waste wood), among thousands of others.</Paragraph> <Paragraph position="4"> TextRunner embodies a trade-off with the PMI method for checking the validity of a relation. Its structure provides a much more natural search for the purpose of verifying a semantic relationship, since it has already arranged Web text into predicates and arguments. It is also much faster than querying a search engine like Google, both because we have local access to it and because commercial search engines tightly limit the number of queries an application may issue per day. On the other hand, the TextRunner index is at present still about two orders of magnitude smaller than Google's search index, due to limited hardware.</Paragraph> <Paragraph position="5"> The TextRunner semantic filter checks the validity of an RC conjunct in a natural way: it asks TextRunner for the number of tuples that match the argument heads and relation name of the conjunct being checked. Since TextRunner predicates only have two arguments, we break the conjunct into trigrams and bigrams of head words, and average over the hitcounts for each. For predicate P(A1,... ,An) with n [?] 2, the score becomes</Paragraph> <Paragraph position="7"> As with PBF, we learn a threshold for good predicates using the LIBSVM package.</Paragraph> </Section> <Section position="6" start_page="30" end_page="31" type="sub_section"> <SectionTitle> 2.6 Question Answering Filter </SectionTitle> <Paragraph position="0"> When parsing questions, an additional method of detecting incorrect parses becomes available: use a question answering (QA) system to find answers.</Paragraph> <Paragraph position="1"> If a QA system using the parse can find an answer to the question, then the question was probably parsed correctly.</Paragraph> <Paragraph position="2"> To test this theory, we implemented a lightweight, simple, and fast QA system that directly mirrors the semantic interpretation. It relies on TextRunner and KnowItNow (Cafarella et al., 2005) to quickly find possible answers, given the relational conjunction (RC) of the question.</Paragraph> <Paragraph position="3"> KnowItNow is a state of the art Information Extraction system that uses a set of domain independent patterns to efficiently find hyponyms of a class.</Paragraph> <Paragraph position="4"> We formalize the process as follows: define a question as a set of variables Xi corresponding to noun phrases, a set of noun type predicates Ti(Xi), and a set of relational predicates Pi(Xi1,...,Xik) which relate one or more variables and constants. The conjunction of type and relational predicates is precisely the RC.</Paragraph> <Paragraph position="5"> We define an answer as a set of values for each variable that satisfies all types and predicates</Paragraph> <Paragraph position="7"> The algorithm is as follows: 1. Compute the RC of the question sentence. The QA semantic filter runs the Question Answering algorithm described above. If the number of returned answers is above a threshold (1 in our case), it indicates the question has been parsed correctly. Otherwise, it indicates an incorrect parse. This differs from the TextRunner semantic filter in that it tries to find subclasses and instances, rather than just argument heads.</Paragraph> </Section> <Section position="7" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 2.7 The WOODWARD Filter </SectionTitle> <Paragraph position="0"> Each of the above semantic filters has its strengths and weaknesses. On our training data, TextRunner had the most success of any of the methods on classifying verb relations that did not have arity errors. Because of sparse data problems, however, it was less successful than PMI on preposition relations. The QA system had the interesting property that when it predicted an interpretation was correct, it was always right; however, when it made a negative prediction, its results were mixed.</Paragraph> <Paragraph position="1"> WOODWARD combines the four semantic filters in a way that draws on each of their strengths.</Paragraph> <Paragraph position="2"> First, it checks if the sentence is a question that does not contain prepositions. If so, it runs the QA module, and returns true if that module does.</Paragraph> <Paragraph position="3"> After trying the QA module, WOODWARD checks each predicate in turn. If the predicate is a preposition relation, it uses PBF to classify it. For nontransitive verb relations, it uses VAST.</Paragraph> <Paragraph position="4"> For strictly transitive verb relations, it uses Text-Runner. WOODWARD accepts the RC if every relation is predicted to be correct; otherwise, it rejects it.</Paragraph> </Section> </Section> <Section position="5" start_page="31" end_page="33" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> In our experiments we tested the ability of WOODWARD to detect bad parses. Our experiments proceeded as follows: we parsed a set of sentences, ran the semantic interpreter on them, and labeled each parse and each relation in the resulting RCs for correctness. We then extracted all of the necessary information from the Web and TextRunner.</Paragraph> <Paragraph position="1"> We divided the sentences into a training and test set, and trained the filters on the labeled RCs from the training sentences. Finally, we ran each of the filters and WOODWARD on the test set to predict which parses were correct. We report the results below, but first we describe our datasets and tools in more detail.</Paragraph> <Section position="1" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 3.1 Datasets and Tools </SectionTitle> <Paragraph position="0"> Because question-answering is a key application, we began with data from the TREC question-answering track. We split the data into a training set of 61 questions (all of the TREC 2002 and TREC 2003 questions), and a test set of 55 questions (all list and factoid questions from TREC 2004). We preprocessed the questions to remove parentheticals (this affected 3 training questions and 1 test question). We removed 12 test questions because the Collins parser did not parse them as questions,3 and that error was too easy to detect.</Paragraph> <Paragraph position="1"> 25 training questions had the same error, but we left them in to provide more training data.</Paragraph> <Paragraph position="2"> We used the Penn Treebank as our second data set. Training sentences were taken from section 22, and test sentences from section 23. Because PBF is time-consuming, we took a subset of 100 sentences from each section to expedite our experiments. We extracted from each section the first 100 sentences that did not contain conjunctions, and for which all of the errors, if any, were contained in preposition and verb relations.</Paragraph> <Paragraph position="3"> For our parser, we used Bikel's implementation of the Collins parsing model, trained on sections 2-21 of the Penn Treebank. We only use the top-ranked parse for each sentence. For the TREC data only, we first POS-tagged each question using Ratnaparkhi's MXPOST tagger. We judged each of the TREC parses manually for correctness, but scored the Treebank parses automatically.</Paragraph> </Section> <Section position="2" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 3.2 Results and Discussion </SectionTitle> <Paragraph position="0"> Our semantic interpreter was able to produce the appropriate RC for every parsed sentence in our data sets, except for a few minor cases. Two idiomatic expressions in the WSJ caused the semantic interpreter to find noun phrases outside of a clause to fill gaps that were not actually there. And in several sentences with infinitive phrases, the semantic interpreter did not find the extracted sub-ject of the infinitive expression. It turned out that none of these mistakes caused the filters to reject correct parses, so we were satisfied that our results mainly reflect the performance of the filters, rather than the interpreter.</Paragraph> </Section> <Section position="3" start_page="32" end_page="33" type="sub_section"> <SectionTitle> Baseline WOODWARD </SectionTitle> <Paragraph position="0"> sents. parser eff. filter prec. filter rec. F1 filter prec. filter rec. F1 red. err.</Paragraph> <Paragraph position="1"> the Collins parser parsed correctly. See the text for a discussion of our baseline and the precision and recall metrics. We weight precision and recall equally in calculating F1. Reduction in error rate (red. err.) reports the relative decrease in error (error calculated as 1 [?] F1) over baseline.</Paragraph> <Paragraph position="2"> In Table 1 we report the accuracy of our first three filters on the task of predicting whether a relation in an RC is correct. We break these results down into three categories for the three types of relations we built filters for: strictly transitive verb relations, nontransitive verb relations, and preposition relations. Since the QA filter works at the level of an entire RC, rather than a single relation, it does not apply here. These results show that the trends on the training data mostly held true: VAST was quite effective at verb arity errors, and TextRunner narrowly beat PBF on the remaining verb errors. However, on our training data PBF narrowly beat TextRunner on preposition errors, and the reverse was true on our test data.</Paragraph> <Paragraph position="3"> Our QA filter predicts whether a full parse is correct with an accuracy of 0.76 on the 17 TREC 2004 questions that had no prepositions. The Collins parser achieves the same level of accuracy on these sentences, so the main benefit of the QA filter for WOODWARD is that it never misclassifies an incorrect parse as a correct one, as was observed on the training set. This property allows WOODWARD to correctly predict a parse is correct whenever it passes the QA filter.</Paragraph> <Paragraph position="4"> Classification accuracy is important for good performance, and we report it to show how effective each of WOODWARD's components is. However, it fails to capture the whole story of a filter's performance. Consider a filter that simply predicts that every sentence is incorrectly parsed: it would have an overall accuracy of 55% on our WSJ corpus, not too much worse than WOODWARD's classification accuracy of 66% on this data. However, such a filter would be useless because it filters out every correctly parsed sentence.</Paragraph> <Paragraph position="5"> Let the filtered set be the set of sentences that a filter predicts to be correctly parsed. The performance of a filter is better captured by two quantities related to the filtered set: first, how &quot;pure&quot; the filtered set is, or how many good parses it contains compared to bad parses; and second, how wasteful the filter is in terms of losing good parses from the original set. We measure these two quantities using metrics we call filter precision and filter recall. Filter precision is defined as the ratio of correctly parsed sentences in the filtered set to total sentences in the filtered set. Filter recall is defined as the ratio of correctly parsed sentences in the filtered set to correctly parsed sentences in the unfiltered set. Note that these metrics are quite different from the labeled constituent precision/recall metrics that are typically used to measure statistical parser performance.</Paragraph> <Paragraph position="6"> Table 2 shows our overall results for filtering parses using WOODWARD. We compare against a baseline model that predicts every sentence is parsed correctly. WOODWARD outperforms this baseline in precision and F1 measure on both of our data sets.</Paragraph> <Paragraph position="7"> Collins (2000) reports a decrease in error rate of 13% over his original parsing model (the same model as used in our experiments) by performing a discriminative reranking of parses. Our WSJ test set is a subset of the set of sentences used in Collins' experiments, so our results are not directly comparable, but we do achieve a roughly similar decrease in error rate (20%) when we use our filtered precision/recall metrics. We also measured the labeled constituent precision and recall of both the original test set and the filtered set, and found a decrease in error rate of 37% according to this metric (corresponding to a jump in F1 from 90.1 to 93.8). Note that in our case, the error is re- null duced by throwing out bad parses, rather than trying to fix them. The 17% difference between the two decreases in error rate is probably due to the fact that WOODWARD is more likely to detect the worse parses in the original set, which contribute a proportionally larger share of error in labeled constituent precision/recall in the original test set.</Paragraph> <Paragraph position="8"> WOODWARD performs significantly better on the TREC questions than on the Penn Treebank data. One major reason is that there are far more clause adjuncts in the Treebank data, and adjunct errors are intrinsically harder to detect. Consider the Treebank sentence: &quot;The S&P pit stayed locked at its 30-point trading limit as the Dow average ground to its final 190.58 point loss Friday.&quot; The parser incorrectly attaches the clause beginning &quot;as the Dow . . . &quot; to &quot;locked&quot;, rather than to &quot;stayed.&quot; Our current methods aim to use key words in the clause to determine if the attachment is correct. However, with such clauses there is no single key word that can allow us to make that determination. We anticipate that as the paradigm matures we and others will design filters that can use more of the information in the clause to help make these decisions.</Paragraph> </Section> </Section> class="xml-element"></Paper>