File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-0902_concl.xml
Size: 7,040 bytes
Last Modified: 2025-10-06 13:53:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0902"> <Title>Extracting and Evaluating General World Knowledge from the Brown Corpus</Title> <Section position="9" start_page="0" end_page="0" type="concl"> <SectionTitle> 3 Conclusions and further work </SectionTitle> <Paragraph position="0"> We now know that large numbers of intuitively reasonable general propositions can be extracted from a corpus that has been bracketed in the manner of the Penn Treebank. The number of &quot;surviving&quot; propositions for the Brown corpus, based on the judgements of multiple judges, is certainly in the tens of thousands, and the duplication rate is a rather small fraction of the overall number (about 15%).</Paragraph> <Paragraph position="1"> Of course, there is the problem of screening out, as far as possible, the not-so-reasonable propositions. One step strongly indicated by our experiment on the effect of style is to restrict extraction to the kinds of texts that yield higher success rates - namely those written in straightforward, unadorned language. As we indicated, both style analysis techniques and our own proposition extraction methods could be used to select stylistically suitable materials from large online corpora.</Paragraph> <Paragraph position="2"> Even so, a significant residual error rate will remain.</Paragraph> <Paragraph position="3"> There are two remedies - a short-term, brute-force remedy, and a longer-term computational remedy. The brute-force remedy would be to hand-select acceptable propositions. This would be tedious work, but it would still be far less arduous than &quot;dreaming up&quot; such propositions; besides, most of the propositions are of a sort one would not readily come up with spontaneously (&quot;A per-son may paint a porch&quot;, &quot;A person may plan an attack&quot;, &quot;A house may have a slate roof&quot;, &quot;Superstition may blend with fact&quot;, &quot;Evidence of shenanigans may be gathered by a tape recorder&quot;, etc.) The longer-term computational remedy is to use a well-founded parser and grammar, providing syntactic analyses better suited to semantic interpretation than Treebank trees. Our original motivation for using the Penn Treebank, apart from the fact that it instantly provides a large number of parsed sentences from miscellaneous genres, was to determine how readily such parses might permit semantic interpretation. The Penn Tree-bank pays little heed to many of the structural principles and features that have preoccupied linguists for decades.</Paragraph> <Paragraph position="4"> Would these turn out to be largely irrelevant to semantics? We were actually rather pessimistic about this, since the Treebank data tacitly posit tens of thousands of phrase structure rules with inflated, heterogeneous right-hand sides, and phrase classifications are very coarse (notably, with no distinctions between adjuncts and complements, and with many clause-like constructs, whether infinitives, subordinate clauses, clausal adverbials, nominalized questions, etc., lumped together as &quot;SBAR&quot; and these are surely semantically crucial distinctions). So we are actually surprised at our degree of success in extracting sensible general propositions on the basis of such rough-and-ready syntactic annotations.</Paragraph> <Paragraph position="5"> Nonetheless, our extracted propositions in the &quot;something missing&quot; and &quot;hard to judge&quot; categories do quite often reflect the limitations of the Treebank analyses. For example, the incompleteness of the proposition &quot;A male-individual may attach an importance&quot; seen above as an illustration of judgement category 5 can be attributed to the lack of any indication that the PP[to] constituent of the verb phrase in the source sentence is a verb complement rather than an adjunct. Though our heuristics try to sort out complements from adjuncts, they cannot fully make up for the shortcomings of the Treebank annotations. It therefore seems clear that we will ultimately need to base knowledge extraction on more adequate syntactic analyses than those provided by the Brown annotations.</Paragraph> <Paragraph position="6"> Another general conclusion concerns the ease or difficulty of broad-coverage semantic interpretation. Even though our interpretive goals up to this point have been rather modest, our success in providing rough semantic rules for much of the Brown corpus suggests to us that full, broad-coverage semantic interpretation is not very far out of reach. The reason for optimism lies in the &quot;systematicity&quot; of interpretation. There is no need to handconstruct semantic rules for each and every phrase structure rule. We were able provide reasonably comprehensive semantic coverage of the many thousands of distinct phrase types in Brown with just 80 regular-expression patterns (each aimed at a class of related phrase types) and corresponding semantic rules. Although our semantic rules do omit some constituents (such as prenominal participles, non-initial conjuncts in coordination, adverbials injected into the complement structure of a verb, etc.) and gloss over subtleties involving gaps (traces), comparatives, ellipsis, presupposition, etc., they are not radical simplifications of what would be required for full interpretation. The simplicity of our outputs is due not so much to oversimplification of the semantic rules, as to the deliberate abstraction and culling of information that we perform in extracting general propositions from a specific sentence. Of course, what we mean here by semantic interpretation is just a mapping to logical form. Our project sheds no light on the larger issues in text understanding such as referent determination, temporal analysis, inference of causes, intentions and rhetorical relations, and so on. It was the relative independence of the kind of knowledge we are extracting of these issues that made our project attractive and feasible in the first place.</Paragraph> <Paragraph position="7"> Among the miscellaneous improvements under consideration are the use of lexical distinctions and WordNet abstraction to arrive at more reliable interpretations; the use of modules to determine the types of neuter pronouns and of traces (e.g., in &quot;She looked in the cookie jar, but it was empty&quot;, we should be able to abstract the proposition that a cookie jar may be empty, using the referent of &quot;it&quot;); and extracting properties of events by making use of information in adverbials (e.g.,, from &quot;He slept soundly&quot; we should be able to abstract the proposition that sleep may be sound; also many causal propositions can be inferred from adverbial constructions). We also hope to demonstrate extraction results through knowledge elicitaton questions (e.g.,&quot;What do you know about books?&quot;, etc.)</Paragraph> </Section> class="xml-element"></Paper>