File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-3156_metho.xml

Size: 18,552 bytes

Last Modified: 2025-10-06 14:13:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-3156">
  <Title>Agent AGT Beneficiary BENF Experiencer EXPR Instrument INST Object OBJ Recipient RECP Direction DIR Location_at LAT Location_from LFRM Location_to LTO Location_through LTRU Orientation ORNT Frequency FREQ Time_at TAT Time_from TFRM Time_to 'ITO Time_through qTRU Cause CAUS Contradiction CNTR Effect EFF Purpose PURP Accompaniment ACMP Content CONT Manner MANR Material MATR Measure MEAS</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE DIPETT PARSER
</SectionTitle>
    <Paragraph position="0"> TANKA requires a broad-coverage parser because it uses a limited semantic model based on Case relations, and domain-specific knowledge is not available to it a priori. Without rich semantics, syntax is the only basis for inferring meaning. In TANKA, the broader the parser's coverage, the more accurate the ultimate knowledge representation can be. This is in opposition to approaches in which semantic knowledge is fed in beforehand, and syntax is limited to restricted patterns or even just keywords. Our approach lies at the other end of the spectrum: we are concerned with realistic large-scale texts and need realistic syntactic coverage. This enables HAIKU, the interactive semantic interpreter, to extract overt meaning from DIPETr's detailed parse trees, and helps organize interaction with the user.</Paragraph>
    <Paragraph position="1"> DIPETT (Domain-Independent Parser for English Technical Texts) is a linguistic-theoryneutral parser with a broad surface-syntactic coverage of English. It handles most sentences in our unedited sample text, a guide to the fourth generation database language Quiz. DIPETI&amp;quot;s coverage encompasses every fundamental syntactic structure in the language, including coordination, and most syntactic phenomena encountered in typical expository technical texts. The core of its grammar is based on general and NLP-oriented English grammars, in particular, Quirk et al. (1985) and Winograd (1983).</Paragraph>
    <Paragraph position="2"> DIPETr's major components are a dictionary, a lexical analyzer, a syntactic analyzer, a memorizing device with a helper mechanism, plus its own trace mechanism. An input is usually given to the lexical analyzer and then to the syntactic analyzer', this makes for conceptually clear and easily implemented models. More than half of the parser's 5000 lines of code are DCG rules. A 15-word sentence can typically be processed in 15 to 20 seconds CPU on a Sun SparcStation.</Paragraph>
    <Paragraph position="3"> Novel features of DIPETT are a dynamic dictionary expansion facility, its memorizing device (well-formed substring table), a helper (error explanation mechanism), and an internal trace mechanism for debugging.</Paragraph>
    <Paragraph position="4"> The parser's surface-syntactic dictionary contains most English function words. It includes a table that associates legal adverbial particles with verbs (this is used to disambignate panicles and prepositions). Another table contains word groups such as &amp;quot;as much as&amp;quot; or &amp;quot;even if&amp;quot; that AcrEs DE COLING-92. NANTES, 23-28 AOb~r 1992 1 0 0 S PROC. OF COLING-92, NANTES. AUG. 23-28, 1992 usually play the same role as single function words. The dictionary will be expanded with semantic information when it is integrated with the Case Analyzer. The lexical analyzer builds a list of annotated words with the root form and the syntactic parameters. If the input contains a word for which the dictionary has no entry, this module allows the user to augment the dictionary dynamically. Such temporary additions are saved on a file for future permanent addition.</Paragraph>
    <Paragraph position="5"> DIPETr's grammar recognizes the following major syntactic units: sentence (simple, complex and multiply-coordinated), question, verb phrase (simple and conjoined), verbal clause, complement, subordinate clause, adverbial clause, noun phrase (simple and conjoined) and their substantive forms, that-clause, relative clause, trigclause, to-infinitive clause, whether-if clause, noun phrase post-modifier (e.g. appositive), prepositional phrase (simple and conjoined), noun pre- and post-modifier, determinative, adjectival phrase.</Paragraph>
    <Paragraph position="6"> The purpose of the memorizer is to minimize the reparsing of syntactic substructures that are reconsidered on backtracking. The helper shows the user information that may help identify the reasons for an input's rejection. Both features can be switched on or off for the session. These two modules use notes--assertions that record essential syntactic information about major well-formed substrings that constitute the prepositional, noun and verb phrases. A note stores a substring, its type and its syntactic structure produced by the parser. Corresponding DCG rules contain Prolog assertions invoked if the user has activated the memorizer or the helper.</Paragraph>
    <Paragraph position="7"> Testing and fine-tuning a complex parser can be difficult. Prolog debugging facilities are often cumbersome for logic grammars where it is only interesting to know what rule is being examined by the parser, for which part of the input string, and what has been successfully recognized. We have therefore implemented our own trace trw~hanism which employs trace instructions (activated by a flag) inserted in all rules related to prepositional, noun and verb phrases. The parser implementor can activate and control the trace mechanism through a menu interface.</Paragraph>
    <Paragraph position="8"> Conjoined verb phrases and sentences are usually very expensive to parse. We have devised two look-ahead mechanisms to treat co-ordination efficiently. These mechanisms check the lexical categories of tokens ahead in the input string.</Paragraph>
    <Paragraph position="9"> The f'trst looks for coordinated clauses, while the second checks inputs that are supposed to contain at least one verb (such as the to-infinitive clause). This information is used by the parser to identify potential joining-points for conjoined sentences and to avoid applying rules that cannot succeed. The parser also handles elided modals and auxiliaries in conjoined verb phrases. For example, &amp;quot;John has printed the letters and read the report&amp;quot; is analyzed as &amp;quot;\[\[John\] \[\[has printed the letters\] and \[has printed the report\]\]\]&amp;quot;. Scoping of negation and adverbs in conjoined verbs is handled, too. For example &amp;quot;John did not accidentally print and read my personal messages&amp;quot; is analyzed as &amp;quot;\[\[John\] \[\[did not accidentally print\] and \[did not accidentally read\]\] \[my personal messages\]\]&amp;quot;.</Paragraph>
    <Paragraph position="10"> DIPE'I'F does not have access to semantic knowledge, so prepositional phrase (PP) attachment must use syntax-based heuristics. Two examples: an 'of' PP is attached to the preceding noun by default; if a PP which is not an initial modifier occurs in a pre-verbal position, it is attached to the noun (whatever the preposition may be).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CURRENT WORK IN DIPETI'
</SectionTitle>
    <Paragraph position="0"> It is our experience that sooner or later an extra-grammatical or highly ambiguous input will engage the parser in an excessively lengthy computation. We must be able to deal with such extxeme situations because our knowledge acquisition method requires finding, for any input, the first parse tree that is linguistically reasonable.</Paragraph>
    <Paragraph position="1"> The reshuffling of the tree's components is left to HAIKU. At present, we discontinue a parse operation that exceeds the time allowed for a single parse (specified by the user at the beginning of a session). Timing-out in this manner causes loss of information from a partially parsed sentence, but it is preferable to the user's waiting unrealistically long for the system's feedback. DIPE'I'I&amp;quot; also applies look-ahead and heuristics to fail unpromising partial parses quickly (e.g. it will not try verb phrase analysis if there is no verb). This helps produce the first reasonable parse tree as fast as possible.</Paragraph>
    <Paragraph position="2"> The ultimate goal of the TANKA system is to process free-form technical texts. Texts often contain non-textual material such as tables or exmnples (e.g. data, programs, results). We assume all non-textual elements have been removed from our source texts, but each removal leaves a &amp;quot;hole&amp;quot; behind. Most holes are located between sentences and do not affect the structure of the text, but some cause fragments to appear in the text. Fragments are valid sub-structures of English sentences, such as &amp;quot;For example&amp;quot; in &amp;quot;For ~xample, &gt; SORT ON DATEJOINED D.&amp;quot; Acids oE COLING-92. Nnr, rrES, 23-28 ^OOT 1992 1 0 0 9 l'aoC. OF COLING-92~ N^rcrF.s, AUO. 23-28, 1992 DIPETr can parse such fragments.</Paragraph>
    <Paragraph position="3"> Three areas of grammar are currently under  active development in DIPE'IT: 1) References: the parser will be capable of resolving simple references, in particular anaphora, on syntactic grounds alone (we mean references whose resolution requires little or no semantic knowledge)---see Hobbs (1978).</Paragraph>
    <Paragraph position="4"> 2) Topic and focus: the parser will maintain some knowledge about topic and focus. As a first indication, a text's title should tell us about its topic while the current input indicates focus; this could benefit the Conceptual Knowledge Processor in TANKA by tentatively relating the topic to a cluster in the conceptual network.</Paragraph>
    <Paragraph position="5"> 3) Paragraph parsing: the parser's default mode of operation is one sentence at a time. Parsing longer inputs, a number of consecutive sentences  or even paragraphs, means much more elaborate processing than parsing single sentences.</Paragraph>
    <Paragraph position="6"> Nothing is gained by simply finding a sequence of parse trees--one for each sentence, in order;, see Jensen (1989) for a similar statement. We have plans for a more intelligent type of parsing that would be able to summarize the contents of these longer inputs by highlighting the main conceptual elements more closely related to the current topic (see Zadrozny &amp; Jensen (1991) for a theory of the paragraph). Topic and focus information will probably help here.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CASE ANALYSIS WITH LEARNING
</SectionTitle>
    <Paragraph position="0"> In TANKA, knowledge is expressed in terms of entities engaged in acts that serve to link them into a graph; see Sowa (1984) for a general discussion of this type of representation. This graph is the conceptual network that TANKA will gradually build for a technical text. It is constructed from Case frames of verbs recognized in the sentence. We have put together a set of Cases suitable for our class of domains; it is inspired by lists found in Fillmore (1968), Bruce (1975), Grimes (1975), Cook (1979), Larson (1984) and Sparck Jones &amp; Boguraev (1987). This set (Figure 1) is not entirely settled; we continue to review the work of other authors and we are currenfly testing our selections against those Somers (1987) presents in his Case grid.</Paragraph>
    <Paragraph position="1"> Case Analysis (CA) extracts the acts and Case constellations around them from the structure produced by the parser on a sentence-by-sentence basis. Only one parse is used, but the system will allow the user to override all its suggestions. Subsequent processing can adjust the understanding of a sentence enough to encompass most alternative parses and only fails to cover situations when a word can be legitimately parsed twice as different parts of speech. Items extracted from the parse are mapped quite directly into Case structures. A verb denotes an act. A Case is marked most often by a preposition or an adverb, and a noun or nominalization (marked by a preposition) serves as a Case object. Initial processing of a parse tree identifies elements of interest; others such as noun modifiers are not used by CA but are kept in the representation for the Conceptual Knowledge Processor. Two questions must then be answered for each Case-Marker in TANKA: to which verb does it attach, and which Case does it realize? The HAIKU module does not attempt to answer these questions itself, at least not in a definitive way. It asks the user to answer by selecting among alternatives in a list, which may include ACRES DE COLING-92. NANTES. 23-28 aOt\]T 1992 I 0 I 0 PROC. OF COLING-92. NANTES, AUG. 23-28. 1992 syntactic elements from the original sentence copied exactly, illustrative phrases and sentences, and possibly short descriptions of the meaning of Cases. Our goal is to minimize the number of interactions the user must engage in to give the right answer. This can be done by letting all answers be specified in one interaction, and that in turn is possible if HAIKU proposes correct Case-Marker attachments and semantics at the outset. In practice a minimum of two interactions per complex sentence appear to be necessary, one to correctly link Case Markers to verbs and a second to validate Case Marker semantics for each verb. Our work on HAIKU thus concentrates on ensuring it produces the correct configuration, preferably on the first interaction.</Paragraph>
    <Paragraph position="2"> Attachment of Case-Markers to verbs is inferred solely from the parse structure. Semantics could help were they known in advance (a verb has only one Case of a given type) but semantic inference is also aided by knowledge of syntax and something must come first. Once the user has endorsed an assignment of Case-Markers to verbs, each clause in the nested structure of coordinated and subordinated clauses received from the parser is considered in isolation. Because the pattern of Case-Markers (CMP) associated with a given verb is known when the second user interaction is undertaken, HAIKU can check a dictionary of these patterns to see if this partitular one has been encountered earlier with any verb. If it has, the matching CMPs will be ordered according to a closeness metric discussed below. Otherwise HAIKU will use this closeness metric to search its CMP dictionary for the pattern that most nearly resembles the input CMP.</Paragraph>
    <Paragraph position="3"> This pattern may lack certain Case-Markers, have extra ones, or not match on both grounds.</Paragraph>
    <Paragraph position="4"> However a candidate pattern will always be found, it will be the best possible, and HAIKU can provide additional, next-best patterns should the fast be deemed unsatisfactory.</Paragraph>
    <Paragraph position="5"> For example, the sentence ~Tho parcel was moved from the house to the ear&amp;quot; has the CI%'\[P SUBJ-OBJ-FROM-TO (where SUBJ is nil here), associated with the verb move. A dictionary of CMPs is searched to see if this pattern has previously been associated with move. If not, the analyzer will look at the entry for move. Suppose it finds { SUBJ-OBJ, SUBJ-OBJ-WITH, SUBJ-FROM-AT}. It could try to add Case alternatives realized by FROM and TO to the SUBJ-OBJ pattern, or it might return to the CMP dictionary and seek an instance of SUBJ-OBJ-FROM-TO associated with a different verb.</Paragraph>
    <Paragraph position="6"> Eventually the algorithm selects the CMP closest to the input pattern. Closeness is a metric based on factors such as the number, types and agreement of CMs in each pattern and the verb associated with each (Copeck et al. 1992). It may be extended to use a very simple noun semantics for Case Objects or counts of the frequency of previous selection.</Paragraph>
    <Paragraph position="7"> The HAIKU dictionaries--an incrementally growing store of verb-CMP associations, Case Patterns and examples--are searched for sentences that exemplify the Case Patterns associated with the CMPs. For example, if SUBJ-OBJ-FROM-TO is associated with take, the sentence might be &amp;quot;our guests took the train from Montreal to Ottawa&amp;quot;. The sentence is shown to the user, who can accept the underlying Case Pattern as correct, edit it by invoking a mode whereby a new Case is associated with a selected Case-Marker, or ask to see the next sentence in the list. The decision to view another sentence will probably be dictated by the number of changes required in the pattern illustrated by the current example. The user's selections are used to update the HAIKU dictionaries and to freeze the sense and structure of the conceptual fragment expressed by the clause which the pattern represents: the system has learned a new pattern of Case Markers, associated them with a particular verb, and recorded the meaning they convey in this instance. The resulting conceptual fragment is then passed on to the Conceptual Knowledge lh'ocessor to be integrated into the main Conceptual Network.</Paragraph>
    <Paragraph position="8"> The representation produced by HAIKU is essentially a reorganized parse tree, augmented with elements of meaning. Discourse relations communicated by conjunctions (e.g. causality) are not analyzed by CA. The representation also includes constituents irrelevant to the overall Case structure of the sentence, e.g. adjectives, relative clauses, PPs attached to nouns, clauses with stative verbs expressing noun-noun relations, and so on. These are passed to the next module of TANKA, the Mini-Network Builder.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FUTURE RESEARCH
</SectionTitle>
    <Paragraph position="0"> The new version of DIPE'IT is operational. It is now being integrated into the INTELLA system (Delisle et al. 1991)which combines text analysis with explanation-based learning. A Case Analysis prototype is running and work in this area is actively under way. It includes investigating the character of technical texts, validating the set of Cases used in TANKA, refining the process of ACIT~ DE COLING-92, NANTES, 23-28 AO6T 1992 1 0 1 1 PREC. OF COLING-92, NANTES, AUG. 23-28, 1992 confirming the design principles behind exampie-driven interaction with the user by experiment. A re-implementation of the HAIKU module will be completed in the coming months.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML