File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-1033_metho.xml
Size: 22,004 bytes
Last Modified: 2025-10-06 14:12:55
<?xml version="1.0" standalone="yes"?> <Paper uid="C92-1033"> <Title>TTP: A FAST AND ROBUST PARSER FOR NATURAL LANGUAGE</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. SKIP-ANI)-FIT RECOVERY IN PARSING </SectionTitle> <Paragraph position="0"> A robust parser must deal efficiently with difficult input, whether it is an exUa-gmmmatical string, or a string whose complete analysis could be.</Paragraph> <Paragraph position="1"> considered too costly. Frequently, these two situations am not distinguishable, estmcially for long and complex sentences found iu free running text. The parser must be able to analyze such strings quickly and produec at least partiM stractures, imposhlg preferences when necessary, and even removing or inserting small input fragments, if the data-driven processing falters.</Paragraph> <Paragraph position="2"> For example, in the following sentence, The method is illustrated by the automatic construction of both recursive and iterafive programs operating on natural numbers, lists, and tree.s, ht order to construct a program satisfying certain specifications a theorem induced by those specifu:ations is proved, and the desired program is extracted from the ptooL the italicized part is likely to cause additional complications in parsing this lengthy string, and the parser may be better off ignoring the fragment altogether. To do so successfully, the parser must close the constituent which is being culrenfly parsed, an(l lYossibly a few of its parent constituents, removing correspumling productions from further consideration, until an appropriate production is rcactivatexl, The parser then jumps over the iutervening inatedal .so as to re.start processing of the remainder of the sentence usiag rite newly reactivated production. In the example at hand, suppose that the parser has just read the word specifications and is looking at the following article a. Rather than continuing at the present level, the parser reduces the phrase a program satiyfying certain The idea of parse &quot;fitting&quot; was partly ialspired by the UIM parser (Jen~en et al., 1983), as well as by the sumdard error mcovely techniques used in shift-reduce parsiug. specifications to NP, and then traces further reductions: SI --) to V NP; SA -~ SI; S .--) NP V NP SA, until production S --* S and S is reached. 4 Subsequently, the parser skips input to find and, then resumes normal processing.</Paragraph> <Paragraph position="3"> As may be expected, this kind of action involves a great deal of indeterminacy which, in case of natural language strings, is compounded by the high degree of lexical ambiguity. If the purpose of this skip-and-fit technique is to get the purser smoothly through even the most complex strings, the amount of additional backtracking caused by the lexical level ambiguity ks certain to defeat it. Without lexical disambigaation of input, the purser's performance will deteriorate, even if the .skipping is limited only to certain types of adverbial adjuncts. The most common cases of lexical ambiguity are tho~ of a phwal noun (nns) vs. a singular verb (vbz), a singular noun (nn) vs. a plmal or infinitive verb (vbp,vb), and a past tense verb (vbd) vs.</Paragraph> <Paragraph position="4"> a past participle (vbn), as illusWatod in the following exarnple.</Paragraph> <Paragraph position="5"> The notation used (vbn or vl~l?) explicitly asse.ciates (nns or vbz?) a data structure (vb or nn) shared (vbn or vbd?) by concun-ent processes (nn.,~ or vbz?) wiflt operatimLs defirmd (vbn or vbd?) cut it.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. PART OF SPEECH TAGGER </SectionTitle> <Paragraph position="0"> Oue way of dealing with lexical ambiguity is to use a tagger to preproccss the input marking each wurti with a tags that indicates its syntactic categoriza.tion: a part of speech with selected morphological features such as nunther, tense, mode, case and degree.</Paragraph> <Paragraph position="1"> The following are tagged sentcoces from the CACM-</Paragraph> <Paragraph position="3"> The tags are underst(xxl as follows: (It - determiner, nn - singular 1~oan, nns - plural noun, in - preposition, jj adjective, vbz - verb in present tense third person &quot;lhe decision to force * reducti(m rather than to back up co~ld be triggered by various means. In clte of TTP parser, it iJ always induced by the thne-citt lignal.</Paragraph> <Paragraph position="4"> Tagged u~ing the 35-tag Penn 'ft,zebank Tagset cmmed at the University of Pemtsylwmia.</Paragraph> <Paragraph position="5"> Acq~.s DE COLING-92, NA~'I~, 23deg28 Ao(rr 1992 1 9 9 PROC. OF COLlNG-92, NANrF.s, AUo. 23-28, 1992 singular, to - particle &quot;to&quot;, vbg - present participle, vim - past participle, vbd - past tense verb, vb - infinitive verb, cc - coordinate conjunction.</Paragraph> <Paragraph position="6"> Tagging of the input text substantially reduces the search space of a top-down parser since it resolves most of the lexical level ambiguities. In the examples ahove, tagging of presents as &quot;vbz&quot; in the first sentence cuts off a potentially long and cosily &quot;garden path&quot; with presents as a plural noun followed by a headless relative clause starting with (that) a proposal .... In the second sentence, tagging resolves ambiguity of used (vim vs. vbd), and associates (vbz vs. nns).</Paragraph> <Paragraph position="7"> Perhaps more imlxmantly, elimination of word-level lexical ambiguity allows the parser to make projection about the input which is yet to be parsed, using a simple lookabead; in particular, phrase boundaries can be determined with a degree of confidence (Church, 1988). This latter property is critical for implementing skip-and-fit recovery technique outlined in the previous section.</Paragraph> <Paragraph position="8"> Tagging of input also helps to reduce the number of parse structures that can be assigned to a sentence, decreases the demand for consulting of the dictionary, and simplifies dealing with unknown words. Since every item in the sentence is assigned a tag, so are the words for which we have no entry in the lexicon. Many of these words will be tagged as &quot;np&quot; (proper noun), however, the surrounding tags may force other selections. In the following example, chinese, which does not appear in the dictionary, is tagged as &quot;j.j&quot;:~</Paragraph> <Paragraph position="10"> We use a stochastic tagger to process the input text prior to parsing. The tagger is based upon a bi-gram model; it selects most likely tag for a word given co-occurrence probabilities computed from a small training SgL 7 4. PARSING wITH TTP PARSER TTP (Tagged Text Parser) is a top down English parser specifically designed for fast, reliable processing of large amounts of text.</Paragraph> <Paragraph position="11"> 6 We use the machine wadable version of the Oxford Ad- vanced Learner's Dictionary (OALD). 7 The program, suppfiod to us by Bolt Benmck and Newman, openttes in two almmative modes, either telocting * single most likely tag for each word (best-tag option, the one we use *t prcaenO, or supplying t slion tanked list of alternatives (Mercer et al., 1991). TTP is based on the Linguistic String Grammar developed by Sager (1981). Written in Quintus Prolog, the parser currently encompasses more than 400 grammar productions, s TIP produces a regularized representation of each lmrsed sentence that reflects the sentence's logical structure. This representation may differ considerably from a standard Imrse tree, in that the constituents get moved around (e.g., de.</Paragraph> <Paragraph position="12"> passivization, de--dativization), and the phrases are organized recursively around their head elements. An important novel feature of TIP parser is that it is equipped with a time-out mechanism that allows for fast closing of more difficult sub-constituents after a preset amount of time has elapsed without producing a parse. Although a complete analysis is attempted for each sentence, the parser may occasionally ignore fragments of input to resume &quot;normal&quot; processing after skipping a few words. These fragments are latex analyzed separately and attached as incomplete constituents to the main parse tree.</Paragraph> <Paragraph position="13"> As the parsing ixoceeds, each sentence receives a new slot of time during which its parse is to be returned. The amount of time allotted to any particular sentence can be regulated to obtain an acceptable compromise between parser's speed and precision. In our experiments we found that 0.5 see/sentence time slot was appropriate for the CACM abstracts, while 0.7 see/sentence was more appropriate for generally longer sentences in MUC-3 articles. 9 The actual length of the time interval allotted to any one sentence may depend on this sentence's length in words, although this dependency need not be linear. Such adjustments will have only limited impact on the parser's speed, but they may affect the quality of produced parse trees. Unfortunately, there is no obvious way to evaluate quality of parsing except by using its results to attain some measurable ends. We used the parsed CACM collection to generate domain-specific word correlations for query processing in an information retrieval system, and the results were satisfactory. For other applications, such as information extraction and deep understanding, a more accurate analysis may be required, m * See (Strzalkowski, 1990) for Prolog implementation details. Giving the parser more time per sentence doesn't always mean that * belmr (more accurate) parse will be obtained. For complex or extra-grammatical structures we are likely to be better o(f if we do not allow the parser wander around for too long: the molt likely inteq~mtation of an unexpected input is probably the one gcncnlted early (the grammar rule ordering en forces some preferences). Jo A qualitative method for par~cr evaluation has he~a pro\[me.ed in (ihrrison et al,, 1990, and it may be used to mike * rd*tire comtxtrison of purser's accuracy. What is not dear is how *oeuate a par~er needs to be for may particular apptic.iticct. ACTES DE COLING-92, NANTES, 23-28 AOt3T 1992 2 0 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 Initially, a full analysis of each sentence is attempted. If a parse is not returned before the allotted time elapses, the parser enters the time-out mode.</Paragraph> <Paragraph position="14"> From this point on, the parser is permitted to skip portions of input to reach a starter terminal for the next constituent to be parsed, and closing the currently opea one (or ones) with whatever partial representation has been generated thus far. The result is an approximate partial parse, which shows the overall structure of the sentence, from which some of the constituents may be missing. The fragments skipped in the first pass are not thrown out, instead they are analyzed by a simple phrasal post-processor that looks for noun phrases and relative clauses and then attaches the recovered material to the main parse structure.</Paragraph> <Paragraph position="15"> The time-out mechanism is implemented using a straightforward parameter passing and is at present limited to only a sub~et of nonterminals used by the grammar. Suppose that X is such a nonterminal, and that it appears on the right-hand side of a production S ---> X Y Z. The set of &quot;starters&quot; is computed for Y, which consists of the word tags that can occur as the left-most constituent of Y. This set is passed as a parameter while the parser attempts to recognize X in the input. If X is recognized successfully within a preset time, then the parser proceeds to parse a Y, and nothing else happens. On the other hand, if the parser cannot determine whether there is an X in the input or not, that is, it neither succeeds nor fails in parsing X before being timed out, the unfinished X constituent is closed with a partial l~rse, and the parser is restarted at the closest element from the sta~ers set for Y that can be found in the remainder of the input. If Y rewrites to an empty string, the starters for Z to the right of Y are added to the starters for Y and both sets are passed as a parameter to X. As an example consider the following clauses in the TIP parser: ~1 sentence(P) :- assertion(\[\],P).</Paragraph> <Paragraph position="16"> assertion (SR, P) : clause(SR,Pl),s coord(SR, PI,P).</Paragraph> <Paragraph position="17"> clause (SR, P) :sa ( \[pdt, dr, cd, pp, ppS, J j, Jjr, j Js, nn, nns, np, nps\] ,PAl) , subject ( \[vbd, vbz, vbp\], Tail, P 1 ), verbphrase (SR, Tail, PI, PAl, P) , subtail (Tail) .</Paragraph> <Paragraph position="18"> thats (SR, P) :that, assertion (SR, P) .</Paragraph> <Paragraph position="19"> In the clause production above, a (finite) clause n The clauses arc slightly simplified, and some arguments are removed for expository reasons.</Paragraph> <Paragraph position="20"> rewrites into an (optional) sentence adjunct (SA), a subject, a verbphrase and subject's right adjunct (SUBTAIL, also optional). With the exception of subtail, each predicate has a parameter that specifies the list of &quot;starter&quot; tags for restarting the parser, should the evaluation of this predicate exceed the allotted portion of time. Thus, in case sa is aborted before its evaluation is complete, the parser will jump over some elemenUs of the unparsed portion of the input looking for a word that could begin a subject phrase (either a predeterminer, a determiner, a count word, a pronoun, an adjective, a noun, or a proper name). Likewise, when subject is timed out, the parser will restart with verbphrase at either vbz, vbd or vbp (finite forms of a verb). Note that if verbphrase is timed out, then subtail will be ignored, both verbphrase and clause will be closed, and the parser will restart at an element of set SR passed down to clause from assertion. Note also that in the top-level production for a sentence the starter set for assertion is initialized to be empty: if the failure occurs at this level, no continuation is possible. When a non-terminal is timed out and the parser jumps over a non-zero length fragment of input, it is assumed that the skipped part was some sub-constituent of the closed non-terminal. Accordingly, a place holder is left in the parse structure under the node dominated by this non-terminal, which will be later filled by some nominal material recovered from the fragment. The examples given in the Appendix show approximate parse structures generated by TIP.</Paragraph> <Paragraph position="21"> There are a few caveats in the skip-and-fit parsing strategy just outlined which warrant further explanation. In particular, the following problems must be resolved to assure parser's effectiveness: how to select starter tags for non-terminals, how to select non-terminals at which to place the starter tags, and finally how to select non-terminals at which input skipping call occur.</Paragraph> <Paragraph position="22"> Obviotlsly some tags are mote likely to occur at the left-most position of a constituent than others.</Paragraph> <Paragraph position="23"> ~ly, a subject ~ can start with u word tagged with any element from the following fist: Ixlt, dt, cd, ji, jjr, jjs, pp, ppS, nn, nns, np, nps, vbg, vbo, rb, in} 2 In practice, however, we may select only a subset of these, as shown in the clause production above.</Paragraph> <Paragraph position="24"> Although we now risk missing the left-hand boundary of subject p~rases in some sentences, while skipping an adjunct to their left, most cases are still covered and the chances of making a serious misinterpretation of u Thit list it .ot comphac. In addition to the tal~ explthled before: pdt - \[n~de~trniner, jjt - compamtlve *djcctivC/, j~ - mpcdatire ~.ieO~c, pp - pronoun, ppS - s~nitivC/, rlp, npl - p,x~l,er noun. r'o - ~verb.</Paragraph> <Paragraph position="25"> ACTES DI~; COLING-92. NANTES. 23-28 nor\]r 1992 2 0 l PROC. OF COLING-92. NANTES. AUG. 23-28, 1992 input are significantly lower.</Paragraph> <Paragraph position="26"> We also need to decide on how input skipping is to be done. In a most straightforward design, when a nonterminal X is timed-out, the parser would skip input until it has reached a starter element of a nonterminal Y adjacent to X from the right, according to the top-down predictions, t3 On the other hand, certain adjunct phrases may be of little interest, possibly because of their typically low information contents, and we may choose to ignore them altogether. Therefore, if X is timed out, and Y is a low contents adjunct phrase, we can make the parser to jump fight to the next nonterminal Z. In the clause production discussed before, subtail is skipped over if verbphrase is timed ouL 14 Finally, it is not an entirely trivial task to select non-terminals at which the input skipping can occur. If wrong non-terminals are chosen the parser may generate rather uninteresting structures that would be next to useless, or it may become trapped in inadvertently created dead ends, hopelessly trying to fit the parse.</Paragraph> <Paragraph position="27"> Consider, for example, the following sentence, taken from MUC-3 corpus of news messages:</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> HONDURAN NATIONAL POLICE ON MON- DAY PRESENTED TO THE PRESS HON- DURAN JUAN BAUTISTA NUNEZ AMADOR AND NICARAGUAN LUIS FERNANDO OR- DON\[~ REYES, WHO TOLD REPORTERS THAT COMMANDER AURELIANO WAS AS- SASSINATED ON ORDERS FROM JOSE DE JESUS PENA, THE NICARAGUAN EMBASSY CHIEF OF SECURITY. </SectionTitle> <Paragraph position="0"> After reaching the verb PRESENTED, the parser consalts the lexicon and finds that one of the possible subcategorizations of this verb is \[pun,to\], that is, its object suing can be a prepositional phrase with 'to' followed by a noun phrase. The parser thus begins to look for a prepositional phrase starting at &quot;TO THE PRESS ...&quot;, but unfortunately misses the end of the phrase at PRESS (the following word is tagged as a noun), and continues until reaching the end of sentence. At this point it realizes that it went too far (there is no noun phrase left), and starts backing up. Before the parser has a chance to back up to the word PRESS and correct the early mistake, however, the time-out mode is turned on, and instead of abandoning the current analysis, the parser now tries hard to fix it by skipping varying portions of input. This may take a considerable amount time if the skip points are badly i~ Note that the top-down predictions are crucial for the skipping parser, wheahcr the paner's processing is top-down or bouem- up. t4 :mbta//it the remainder of a discontinued subject phrase.</Paragraph> <Paragraph position="1"> placed. On the other hand, we wouldn't like to allow an easy exit by accepting an empty noun phrase at the end of the sentenceI \]5 One of the essential properties of the input skipping mechanism is its flexibility to jump over varying-size chunks of the input sUing. The goal is to fit the input with a closest matching parse structure while leaving the minimum number of words unaccounted for. In TIP, the skipping mechanism is implemented by adding extra productions for selected nonterminals, and these are always tried fast whenever the nonterminal is to be expanded. We illustrate this with rn productions covering fight adjuncts to a noun.</Paragraph> <Paragraph position="2"> rn (SR, P) :timed out, !, skip (SR), store (P) .</Paragraph> <Paragraph position="3"> rn(_, \[\]) :la ( \[ \[pdt, dt, vbz, vbp, vbd, rod, eom, ha, rmr\] \] ), \+is ( \[ \[C0~\] , \[wdt,wp,wps\] \] ) .</Paragraph> <Paragraph position="4"> rn(SR,P) :- rnI(SR, P).</Paragraph> <Paragraph position="5"> In the rn predicate, SR is the list of starter tags and P is the parse tree fragment. The first production checks if the time-out mode has already been entered, in which case the input is skipped until a starter tag is found, while the skipped words are stored into P to be analyzed later in the purser's second pass. Note that in this case all other rn productions are cut off; however, should the first skip-and-fit attempt fail to lead to a successful parse, backtracking may eventually force predicate skip(SR) to evaluate again and make a longer leap. In a top-down left to right parser, each input skipping location becomes potentially a multiple bucktracking point which needs to be controlled in order to avoid a combinatorial explosion of possibilities. This is accomplished by supplementing top-down predictions with bottom-up, data-driven fragmentation of input, and a limited lookahead. For example, in the second of the rn productions above, a right adjunct to a noun can be considered empty if the item following the noun is either a period, a semicolon, a comma, or a word tagged as pdt, dt, vbz, vbp, vbd, or md, but not a comma followed by a relative pronoun.~6 ,2 In the present implementation, when the skipping mode is entered, it will stay on for the balance of the first pass in parsing of the current sentence. &quot;\[~his way, o~C/ skip-and-fit attempt may lead to anc4her before any backtracking is considered. An altemafive is to do time-out on a nonterminal by nonterminal basis, that is, to time out processing of selected nonterminals only and then resume regular parsing, qhis design leads to a far more complex implementation and somewhat inferior performance, but it might be worth comic~ring in the fumre.</Paragraph> <Paragraph position="6"> t6 md - modal veto; vbp - plural verb; wdt, wp, wps - ttladve pronouns.</Paragraph> <Paragraph position="7"> ACq'ES DE COLING-92, NANTES, 23-28 AOt3T 1992 2 0 2 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992</Paragraph> </Section> class="xml-element"></Paper>