File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1082_metho.xml
Size: 22,558 bytes
Last Modified: 2025-10-06 14:13:41
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1082"> <Title>References</Title> <Section position="2" start_page="0" end_page="501" type="metho"> <SectionTitle> 1 LHIP Overview </SectionTitle> <Paragraph position="0"> This paper describes LIIII' (Left-Ilead corner Island Parser), a parser designed for broad-coverage handling of lmrestricted text. The system interprets an extended DCG formalism to produce a robust analyser that finds parses of the input made from 'islands' of terminals (corresponding to terminals consumed by successful grammar rules). It is currently in use for processing dialogue tr,'mscripts from the tICRC Map Task Corpus (Anderson et al., 1991), although we expect its eventual applications to he much wider. 1 Transcribed natural speech contains a number of frequent characteristic 'ungrmnmatical' phenomena: filled pauses, repetitions, restarts, etc. (as in e.g. Right I'll have ...you know, like I'll have to ...so I'm going between the picket fence and the mill, right.). ~ While a full analysis of a conversation might well take these into account, for many purposes they represent a significmlt obstacle to analysis. LIIIP provides a processing method wlfich allows selected portions of the input to be ignored or handled differently.</Paragraph> <Paragraph position="1"> The chief modifications to the standard Prolog *grammar rule' format are of two types: one or more rlght-hand side (RtIS) items may be marked *This work was carried out under grants nos. 2033903.92 and 12-36505.92 from the Swiss National Fund.</Paragraph> <Paragraph position="2"> tNote that the input consists of wr/tten texts within the Map Task Corpus; LtIIP is not intended for use in speech processing.</Paragraph> <Paragraph position="3"> 2This example is taken fronl the Map Task Corpus.</Paragraph> <Paragraph position="4"> as 'heads', and one or more RHS items may be marked as 'ignorable'. We expand on these points and introduce other differences below.</Paragraph> <Paragraph position="5"> The behaviorlr of LHIP can best he understood in terms of the notions of island, span, cover and threshold: Island: Within an input string consisting of the terminals (tl,t2,...tn), ~ island is a sub-sequence (ti, ti+l,.., ti+,,), whose length is m + 1.</Paragraph> <Paragraph position="6"> Span: The span of a grammar rule R is the length of the longest island (tl,...tj) such that ternfinals tl and t i are both consumed (directly or indirectly) by R.</Paragraph> <Paragraph position="7"> Cow.'r: A rule R is said to cover m items if rn terminals are consumed within the island described by R. The coverage of R is then rn.</Paragraph> <Paragraph position="8"> Threshold: The threshold of a rule is the minimum ~;alue for the ratio of its coverage c to its span s which must hold in order for the rule to succeed. Note that c <_ s, aud that if c -- s the rule has completely covered the span, consuming all terminals.</Paragraph> <Paragraph position="9"> As implied here, rules need not cover all of the input in order to succeed. More specifically, the constraints applied in creating islands are such that ishmds do not have to be adjacent, but may be separated by non-covered input. Moreover, an island may itself contain input which is unaccounted for by the grammar. Islands do not overlap, although when multiple anMyses exist they will in general involve different segmentations of the input into islands.</Paragraph> <Paragraph position="10"> There are two notions of non-coverage of the input: sanctioned and unsanetloned noncoverage. The latter case arises when the grammar simply does not account for some terminM. S~mctioned non-coverage means that some number of special 'ignore' rules have been applied which simulate coverage of input material lying between the ish'mds, thus in effect making the islands contiguous. Those parts of the input that have been 'ignored' are considered to have been consmned. These ignore rules can be invoked ino dividually or as a class. It is this latter capability which distinguishes ignore rules from regular rules, as they are functionally equivalent otherwise, mainly serving as a notational aid for the grammar writer.</Paragraph> <Paragraph position="11"> Strict adjacency between RHS clauses can be specified in the grammar. It is possible to define global and local thresholds for the proportion of the spanned input that must be covered by rules; in this way, the user of an LHIP grammar can exercise quite fine control over the required accuracy and completeness of the analysis.</Paragraph> <Paragraph position="12"> A chart is kept of successes and failures of rules, both to improve efficiency and to provide a means of identifying unattached constituents. In addition, feedback is given to the grammar writer on the degree to which the grammar is able to cope with the given input; in a context of grammar development, this may serve as notification of areas to which the coverage of the grammar might next be extended.</Paragraph> <Paragraph position="13"> The notion of ~head' employed here is connected more closely with processing control than linguistics. In particular, nothing requires that a head of a rule should share any information with the LItS item, although in practice it often will. Heads serve as anchor-points in the input string around which islands may be formed, and are accordingly treated before non-head items (RHS items are re-ordered during compilation-see below). In the central role of heads, LtIIP resembles parsers devised by Kay (1989) and van Noord (1991); in other respects, including the use which is made of heads, the approaches are rather different, however.</Paragraph> </Section> <Section position="3" start_page="501" end_page="504" type="metho"> <SectionTitle> 2 The LHIP System </SectionTitle> <Paragraph position="0"> In this section we describe the LHIP system.</Paragraph> <Paragraph position="1"> First, we define what constitutes an acceptable LHIP grammar, second, we describe the process of converting such a grammar into Prolog code, and third, we describe the analysis of input with such a grammar.</Paragraph> <Paragraph position="2"> LHIP graxnmars are an extended form of Prolog DCG graznmars. The extensions can be summarized as follows: a 1. one or more \[tHS clauses may be nominated as heads; ZA version of LHIP exists which permits a form of negation on RHS clauses. That version is not described here.</Paragraph> <Paragraph position="3"> 2. one or more P~tlS clauses may be marked as optional; 3. 'ignore' rules may be invoked; 4. adjacency constraints may be imposed between l~tIS clauses; 5. a global threshold level may be set to determine the minimum fraction of spanned input that may be covered in a parse, and 6. a local threshold level may be set in a rule : to override the global threshold witlfin that &quot; rule.</Paragraph> <Paragraph position="4"> We provide a syntactic definition (below) of a LHIP grammar rule, using a notation with syntactic rules of the form C -~ F1 I if2--- I Fn wtfich indicates that the category C may take any of the forms F1 to F,~. An optional item in a form is denoted by surrounding it with square brackets '\[...\]'. Syntactic categories are italieised, while terminMs are underlined: '...'.</Paragraph> <Paragraph position="5"> A LtlIP granunar rule has the form: lhiVrute ~ \[ - \] term \[ # T \] ~~__~> U~i~body where T is a value between zero and one. If present, this value defines the local threshold fraction for that rule. This local threshold value overrules the global threshold. The symbol '-' before tile name of a rule marks it as being an 'ignore' rule. 0nly a rule defined this way can be invoked as an ignore rule in an RHS clause.</Paragraph> <Paragraph position="7"> The connectives ',' and ~;~ have the same precedence as in Prolog, while ~'' has the same precedence as ~'. Parentheses may be used to resolve ambiguities. The connective '~' is used to indicate that strings subsumed by two RHS clauses are ordered but not necessarily adjacent in the input. Thus 'A ~ /3' indicates that A precedes I3 in the input, perhaps with some intervening material. The stronger constraint of immediate precedence is marked by ':'; 'A : B' indicates that the span of A precedes that of B, and that there is no 1recovered input between the two. Disjunction is expressed by ~', and optional R/IS clauses are surrounded by '(?... ?)'.</Paragraph> <Paragraph position="9"> The symbol '*' is used to indicate a head clause. A rule name is a Prolog term, and only rules and terminal items may act as heads within a rule body. The symbol '@' introduces a terminM string. As previously said, the purpose of ignore rules is simply to consume input terminals, and their intended use is in facilitating repMrs in analysing input that contains the false starts, restarts, fdled pauses, etc. mentioned above. These rules are referred to individually hy preceding their name by the '-' symbol. They can also be referred to as a class in a rule body hy the speciM I~.tIS clause '\[\]'. If used in a rule body, they indicate that input is potentially ignoredthe problems that ignore rules are intended to repair will not always occur, in which case tile rules succeed without conslmfing any input. There is a semantic restriction on the body of a rule which is that it must contain at least one clause which necessarily covers input (optional clauses and ignore rules do not necessarily cover input).</Paragraph> <Paragraph position="10"> The following is an example of a LtIIP rule.</Paragraph> <Paragraph position="11"> Here, the sub-rule 'conjunction(Con j)' is marked as a head and is therefore evaluated before either</Paragraph> <Paragraph position="13"> s(S~).</Paragraph> <Paragraph position="14"> tIow is such a rule converted into Prolog code by the LHIP system? First, the rule is read and the RHS clauses are partitioned into those marked as heads, and those not. A record is kept of their original ordering, and this record allows each clause to be constrMned with respect to the clause that precedes it, as well as with respect to the next head clause wMch follows it. Additional code is added to maintain a chart of known successes and failures of each rule. Each rule name is turned into the name of a Prolog clause, and addltionM arguments are added to it. These arguments are used for the input, the start and end points of the area of tlm input in which the rule may succeed, tile start and end points of the actual part of the input over which it in fact succeeds, the number of terminal items covered within that island, a reference to the point in the chart where the result is stored, and a list of pointers to sub-results. The converted form of tile above rule is given below (rMnus the code for chart maintenance): s(conjunct(H,I,J), A, B, C, D, E, F, ELIK\]-K, G) :lhip_threshold valuo(M), conjunction(H, A, B, C, O, P, Q, R-S,_), s(l, A, B, fl, D, _, T, C-R, _).</Paragraph> <Paragraph position="15"> s(J, A. P, C, ~, E, U, s-El, _), F is U+Q+T, F/(E-D)>=M.</Paragraph> <Paragraph position="16"> The important points to note about this convetted form are the following: 1. the conjunction clause is searched for before either of the two s clauses; 2. the region of the input to be searched for the conjunction clause is the stone as that for the rule's LIIS (B-C): its island extends from 0 to p and covers Q items; 3. the search region for tile first s clause is B-0 (i.e. from tile start of tile LHS search region to tile start of the conjunction island), its island starts at D and covers T items; 4. the search region for tile second s clause is P-C (i.e. from the end of the conjunction island to the end of the LIIS search region), its island ends at E and covers II items; 5. the island associated with the rule as a whole extends from D to E and covers F items, whereFisU+ Q + T; 6. lhip_throshold_value/l unifies its argu null ment M with the current global threshold value.</Paragraph> <Paragraph position="17"> In the current implementation of LI\[IP, compiled rules are interpreted depth-first and left-to-right by the standard Prolog theorem-prover, giving an anMyser that proceeds in a top-down, qeft-headcorner' fashion. Because of the reordering carried out during compilation, the situation regarding left-recursion is slightly more subtle than in a conventional DCG. The 's(conjunct(... ))' rule shown above is a case in point. While at first sight it appears left-recursive, inspection of its converted form shows its true leftmost subrule to be 'conjunction'. Naturally, compilation may induce left-recursion as well as eliminating it, in which case LIIIP will suffer from the same termination problems as an ordinary DCG formalism interpreted in this way. And as with an ordinary DCG formalism, it is possible to apply different parsing methods to LHIP in order to circumvent these problems (see e.g. Pereira and Shieber, 1987). A related issue concerns the interpretation of embedded Prolog code. Reordering of lZHS clauses will result in code which precedes a head within a LtHP rule being evaluated after it; judicious freezing of goals and avoidance of unsafe cuts are therefore required.</Paragraph> <Paragraph position="18"> LHIP provides a number of ways of applying a grammar to input. The simplest allows one to enumerate the possible analyses of the input with the grammar. The order in which the results are produced wiU reflect the lexical ordering of the rules as they are converted by LHIP. With the threshold level set to 0, all analyses possible with the grammar by deletion of input terminals can be generated. Thus, supposing a suitable grammar, for the sentence John saw Mary and Mark saw them there would be analyses corresponding to the sentence itself, as well as John saw Mary, John saw Mark, John saw them, Mary saw them, Mary and Mark saw them, etc.</Paragraph> <Paragraph position="19"> By setting the threshold to 1, only those partial analyses that have no unaccounted for terminals within their spans can succeed. Hence, Mark saw them would receive a valid analysis, as would Mary and Mark saw them, provided that the grammar contains a rule for conjoined NPs; John saw them, on the other hand, would not. As this example illustrates, a partial analysis of this kind may not in fact correspond to a true sub-parse of the input (since Mary and Mark was not a conjoined subject in the original). Some care must therefore be taken in interpreting results. A number of built-in predicates are provided which allow the user to constrain the behaviour of the parser in various ways, based on the notions of coverage, span and threshold: lhip _phras o (+C, + S ) Succeeds if the input S can be parsed as an instance of category C.</Paragraph> <Paragraph position="20"> lhip_ cv_phrase (+C, +S) As for lhip_phrase/2, except that all of the input must be covered.</Paragraph> <Paragraph position="21"> lhip_phras e (+C, +S, -B, -E, -Coy) As for lhip_phrase/2, except that B binds to the beginning of the island described by this application of C, E binds to the position immediately following the end, and Coy binds to the ntunber of ternfinals covered.</Paragraph> <Paragraph position="22"> lhip_mc_phrasos (+C, +S, -Coy, -Ps ) The maximal coverage of $ by C is Cov. Ps is the set of parses of S by C with coverage Coy. lhip_rainmax_phr as e s (+C, +S, -Coy, -Ps ) As for lh+-p_mc_phrases\]4, except that Ps is additionally the set of parses with the least span.</Paragraph> <Paragraph position="23"> lhip seq_phrase(+C,+S,-Seq) Succeeds if Soq is a sequence of one or more parses of S by C such that they are non-overlapping and each consumes input that precedes that consumed by the next.</Paragraph> <Paragraph position="24"> lhip maxT_phras os (+C, +S, -MaxT) MaxT is the set of parses of S by C that have the tfighest threshold value. On backtracking it returns the set with the next highest threshold value.</Paragraph> <Paragraph position="25"> In addition, other predicates can be used to search the chart for constituents that have been identified but have not been attached to the parse tree. These include: lhip_success Lists successful rules, indicating island position and coverage.</Paragraph> <Paragraph position="26"> lhip_ms_success As for lhip_success, but lists ouly the most specific successful rules (i.e. those which have themselves succeeded but whose results have not been used elsewhere).</Paragraph> <Paragraph position="27"> lhip_ms_success (N) As for lhip_ms_succoss, but lists only successful instances of rule N.</Paragraph> <Paragraph position="28"> Even if a sentence receives no complete analysis, it is likely to contain some parsalfle substrings; results from these are recorded together with their position within the input. By using these predicates, partiM but possibly useful information can be extracted from a sentence despite a global failure to parse it (see section 4).</Paragraph> <Paragraph position="29"> The conversion of the grammar into Prolog code means that the user of the system can easily develop anMysis tools that apply different constraints, using the tools provided as building blocks.</Paragraph> </Section> <Section position="4" start_page="504" end_page="504" type="metho"> <SectionTitle> 3 Using LHIP </SectionTitle> <Paragraph position="0"> As previously mentioned, LHIP facilitates a cyclic approach to grammar development. Suppose one is writing an English grammar for the Map Task Corpus, and that the following is the first attempt at a rule for noun phrases (with appropriate rules for determiners and nouns):</Paragraph> <Paragraph position="2"> While tiffs rule will adequately anMyse simple NPs such as your map, or a missionary camp, on a NP such as the bottom right-hand corner it will give analyses for the bottom, the right-hand and the corner. Worse still, in a long sentence it will join determiners from the start of the sentence to nouns that occur in the latter hMf of the sentence. The number of superfluous anMyses can be reduced by imposing a local threshohl level, of say 0.5. By looking at the various analyses of sentences in the corpus, one can see that this rule gives the skeleton for noun phrases, but from the fraction of coverage of these parses one c,'m also see that it leaves out an importmlt feature, adjecfives, which are optionally found in noun phrases. np(N, D, A) # 0.5 ~,-~> determiner(D), (? adjectives(A) ?), * noun(N).</Paragraph> <Paragraph position="3"> With rids rule, one can now handle such phrases as the left-hand bottom corner, and a banana tree. Suppose further that this rule is now applied to tile corpus, and then the rule is applied again but with a local threshold level of 1. By looking at items parsed in the first case but not in the second, one can identify features of nolm phrases found in tlle corpus that are not covered by the current rules. Tiffs might include, for instance, phrases of the form a slightly dipping line. One can then go hark to the grammar azld see that the noun phrase rule needs to bc changed to account for certain types of modifier including adjectives and adverbial modifiers.</Paragraph> <Paragraph position="4"> It is Mso possible to set loom thresholds dynamically, by making use of the '{ prolog code }' facility:</Paragraph> <Paragraph position="6"> In this way, the strictness of a rule may be varied according to information originating either within the particular run-time invocation of the rule, or elsewhere in the current parse. For example, it would be possible, by providing a suitable definition for set_dynamic_threshold/2, to set T to 0.5 when more titan one optional adjective has been found, and 0.9 otherwise.</Paragraph> <Paragraph position="7"> Once a given rule or set of rules is stabl% and tile writer is satisfied with the performtmce of that part of the grammar, a local threshold value of 1 may bc assigned so that superfluous parses will not interfere with work elsewhere.</Paragraph> <Paragraph position="8"> The use of the chart to store known results and failures allows the user to develop hybrid parsing techniques, rather than relying on the default depth-first top-down strategy given by analysing with respect to the top-most category.</Paragraph> <Paragraph position="9"> For instance, it is possible to anMyse the input in 'layers' of linguistic categories, perhaps starting by analysing noun-phrases, then prepositions, verbs, relative clauses, clauses, conjuncts, and finally complete sentences. Such a strategy allows the user to perform processing of results between these layers, w:hich can be useful in trying to find the 'best' analyses first.</Paragraph> </Section> <Section position="5" start_page="504" end_page="505" type="metho"> <SectionTitle> 4 Partial results </SectionTitle> <Paragraph position="0"> The discussion of built-ln predicates mentioned facilities for recovering partial parses. Here we ilhlstrate this process, and indicate what further use might be made of tile information titus obtained. null In the following example, tile chart is inspected to reveal what constituents have been built during a t~iled parse of the truncated sentence Have you the tree by the brook that... :</Paragraph> <Paragraph position="2"> Each rule is listed with its identifier ('-1' for lexical rules), the island wtfich it has analysed with beginning and ending positions, its coverage, and the representation that was constructed for it.</Paragraph> <Paragraph position="3"> From this output it can be seen that the grammar manages reasonably well with noun phrases, but is unable to deM with questions (the initial auxiliary have remains unattached).</Paragraph> <Paragraph position="4"> Users will often be more interested in the successful application of rules which represent maximal constituents. These are displayed by Here, two unattached lexical items have been identified, together with two instances of rule 4, which combines a NP with a postmodifying PP.</Paragraph> <Paragraph position="5"> The first of these has analysed the island you the tree by the brook, ignoring the tree, while the second has analysed the tree by the brook, consuming all terminals. There is a second analysis for the tree by the bTvok, due to rule 5, which has been obtained by ignoring the sequence tree by the. From this information, a user might wistt to rank the three results according to their respective span:coverage ratios, probably preferring the second.</Paragraph> </Section> class="xml-element"></Paper>