File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/a92-1027_intro.xml
Size: 12,543 bytes
Last Modified: 2025-10-06 14:05:05
<?xml version="1.0" standalone="yes"?> <Paper uid="A92-1027"> <Title>An Efficient Chart-based Algorithm for Partial-Parsing of Unrestricted Texts</Title> <Section position="2" start_page="0" end_page="194" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Much of the research being done in parsing today is directed towards the problem of information extraction, sometimes referred to as &quot;message processing&quot; or &quot;populating data-bases&quot;. The goal of this work is to develop systems that can robustly extract information from massive corpora of unrestricted (&quot;open&quot;) texts. We have developed such a system and applied it to the task of extracting information about changes in employment found in articles from the &quot;Who's News&quot; column of the Wall Street Journal. We call the system &quot;Sparser&quot;. There are many possible design goals for a parser. For Sparser we have three: * It must handle unrestricted texts, to be taken directly from online news services without human intervention or preprocessing.</Paragraph> <Paragraph position="1"> * It must operate efficiently and robustly, and be able to handle articles of arbitrary size using a fixed, relatively small set of resources.</Paragraph> <Paragraph position="2"> * Its purpose is the identification and extraction specifically targeted, literal information in order to populate a database. Linguistically motivated structural descriptions of the text are a means to that end, not an end in themselves.</Paragraph> <Paragraph position="3"> consequences of these goals for the parser's design. The most important of these is how to cope with the fact that many of the words in the text will be unknown, which we will take up first. We then look at the consequences of designing the parser for the specific purpose of information extraction. In section two we will look at how well Sparser has been able to meet these goals after roughly fifteen months of development following more than two years of experimentation with other designs, and then go on in the later sections to situate and describe its phrase structure algorithm. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 &quot;Partial Parsing&quot; </SectionTitle> <Paragraph position="0"> Attempting to understand an entire text, or even to give all of it a structural description, is well beyond the state of the art for today's parser's on unrestricted texts. Instead, text extraction systems are typically designed to recover only a single kind of information. They focus their analyses on only those portions of the text where this information occurs, skipping over the other portions or giving them only a minimal analysis. Following an emerging convention, we will call such a system a partialparser.</Paragraph> <Paragraph position="1"> The competence of a partial parser may be compared with that of people who are learning a second language. They can be expected to know most of the language's syntax, function words, and morphology, to know certain idioms, the conventions for forming common constructions such as names, dates, times, or amounts of money, and the high frequency open-class vocabulary such as &quot;said&quot;, &quot;make&quot;, &quot;new&quot;, etc. However, their knowledge of the vocabulary and phrasings for specific subjects will be severely limited and particular.</Paragraph> <Paragraph position="2"> A few topics will be understood quite well, others not at all.</Paragraph> <Paragraph position="3"> Their limited comprehension notwithstanding, such people can nevertheless scan entire articles, accurately picking out and analyzing those parts that are about topics within their competence, while ignoring the rest except perhaps for isolated names and phrasal fragments. They will know when they have correctly understood a portion of the text that fell within their competence, and can gauge how thorough or reliable their understanding of these segments is.</Paragraph> </Section> <Section position="2" start_page="0" end_page="193" type="sub_section"> <SectionTitle> 1.2 The impact of unknown words </SectionTitle> <Paragraph position="0"> Mirroring this kind of partial but precise competence in a parser is not ~imply a matter of finding the portions of the text on the m,derstood topics and then proceeding to analyze them as a ncrmal system would. Such a strategy will not work because instances of unknown words and subject matter can occur at any granularity--not just paragraphs and sentences, but appositives, clausal adjuncts, adverbials, all the way down to adjectives and compound nouns within otherwise understandable NPs. For example, an understandable subject-verb combination may be separated by an appositive outside the system's competence, understandable pp-adjuncts separated by incomprehensible ones, and so on.</Paragraph> <Paragraph position="1"> The example below (from the Wall Street Journal for February 14, 1991) shows a case of a relative clause, off the topic of employment change, situated between an understood subject NP and an understood VP. (Understood segments shown in bold.) ... Robert A. Beck, a 65-year-old former Prudential chairman who originally bought the brokerage firm, was named chief executive of Prudential Bache ....</Paragraph> <Paragraph position="2"> As a result of this and other factors, the design of a partial parser must be adapted to a new set of expectations quite different from the customary experience working with carefully chosen example sentences or even with most question-answering systems. In particular: (1) Don't expect to complete full sentences Because the unknown vocabulary can occur at any point, one cannot assume that the parser will be able to reliably recover sentence boundaries, and its grammar should not depend on that ability.</Paragraph> <Paragraph position="3"> To this end, Sparser parses opportunistically and bottom up rather than predicting that an S will be completed. Its structural descriptions are typically a &quot;forest&quot; of minimal phrasal trees interspersed with unknown words. The only reliable boundaries are those signalled orthographically, such as paragraphs.</Paragraph> <Paragraph position="4"> (2) Develop new kinds of algorithms for connecting constituents separated by unparsed segments of the text.</Paragraph> <Paragraph position="5"> The standard phrase structure algorithms are based on the completion of rewrite rules that are driven by the adjacency of labeled constituents. When an off-topic and therefore uncompleted text segment intervenes, as in the example just above, an adjacency-based mechanism will not work, and some other mechanism will have to be employed.</Paragraph> <Paragraph position="6"> Sparser includes a semantically-driven search mechanism that scans the forest for compatible phrases whenever a syntactically incomplete or unattached phrase is left after conventional rules have been applied. It is sensitive to the kind of grammatical relation that would have to hold between the two constituents, e.g. subject - predicate, and constrains the search to be sensitive to the features of the partially parsed text between them, e.g. that if in its search it finds evidence of a tensed verb that is not contained inside a relative clause, then it should abort the search.</Paragraph> <Paragraph position="7"> (3) Supplement the phrase structure rule backbone of the parser with an independent means of identifying phrase boundaries.</Paragraph> <Paragraph position="8"> Very often, the off-topic, unknown vocabulary is encapsulated within quite understandable phrasal contexts. Consider the real example &quot;... this gold mining company was ...&quot;. Here the initial determiner establishes the beginning of a noun phrase, and the tensed auxiliary verb establishes that whatever phrase preceded it has finished (barring adverbs). Forming an NP over the entire phrase is appropriate, even when the words &quot;gold&quot; and &quot;mining&quot; are unknown because they are part of an open-ended vocabulary.</Paragraph> <Paragraph position="9"> Sparser includes a set of function word-driven phrase boundary rules. And it has a very successful heuristic for forming and categorizing text segments such as this example (&quot;successful&quot; in that it generated no false positives in the test described in SS2). Simply stated, if there is a rule in the grammar that combines the f'n'st and last edges in a bounded segment (e.g. a rule that would succeed on the phrase &quot;this company&quot;), then allow that rule to complete, covering the unknown words as well as the known.</Paragraph> </Section> <Section position="3" start_page="193" end_page="194" type="sub_section"> <SectionTitle> 1.3 Objects rather than expressions </SectionTitle> <Paragraph position="0"> Sparser was written to support tasks based on populating data bases with commercially significant literal information extracted in real time from online news services, and this requirement has permeated nearly every aspect of its design.</Paragraph> <Paragraph position="1"> In particular, it has meant that it is not adequate to have the output of an analysis be just a syntactic structural description (a parse tree), or even a logical form or its rendering into a database access language like SQL, as is done in many question-answering systems. Instead, the output must be tuples relating individual companies, people, titles, etc., most of which will have never been seen by the system before and which are not available in pre-stored tables.</Paragraph> <Paragraph position="2"> These requirements led to the following features of Sparser's design: * The system includes a domain model, wherein classes and individuals are represented as unique, first-class objects and indexed by keys and tables like they would be in a database, rather than stored as descriptive expressions.</Paragraph> <Paragraph position="3"> * While many individuals will have never been seen before, a very significant number will continually reoccur: specific titles, months, dates, numbers, etc., and they should be referenced directly. The system includes facilities for defining rules for the parser as a side-effect of defining object classes or individuals in the domain.</Paragraph> <Paragraph position="4"> * Interpretation is done as part of the parsing of linguistic form, rather than as a follow-on process as is customary. This is greatly facilitated by the next point: * semantic categories are used as the terms in the phrase structure rules. 1 Space limitations do not permit properly describing th~ tie-in from the parsing of structural form (the realm of thq parser proper) to the domain model/database. Briefly, a rule by-rule correspondence is supported between the syntax am 1 This is sometimes referred to as using a &quot;semantic grammar&quot; This nomenclature can be misleading, as the form and use of thq phrase structure grammar is just the same as in a conventional syntactically labeled grammar, i.e. phrases are formed on th~ basis of the adjacency of labeled constituents or terminals. AI that changes is that most of the terms in rules are now label like &quot;company&quot; or &quot;year&quot;, rather than &quot;NP&quot; or &quot;verb&quot;. the semantics, 2 whereby each rewrite rule is given a corresponding interpretation in the model. Individuals and types in the model are structured so that their compositionality mimics the compositional structure of the corresponding English phrase(s).</Paragraph> <Paragraph position="5"> The correspondence is grounded in the means by which individual content words are defined for the parser. Briefly, the point is that whenever one defines a class of objects or particular individuals so that they can be represented in one's domain model, that same act of definition can be made to result in a rule(s) being written for the parser so that whenever that rule completes, the resulting edge can immediately include a pointer to the domain object. Compound objects such as events are formed through the composition of these basic individuals, under the control of the interpretation rules that accompany the rules that dictate the composition of the phrases in the English text.</Paragraph> </Section> </Section> class="xml-element"></Paper>