XML Viewer - p92-1024

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/p92-1024_metho.xml
Size: 22,693 bytes
Last Modified: 2025-10-06 14:13:12
<?xml version="1.0" standalone="yes"?>
<Paper uid="P92-1024">
  <Title>Development and Evaluation of a Broad-Coverage Probabilistic Grammar of English-Language Computer Manuals</Title>
  <Section position="5" start_page="185" end_page="186" type="metho">
    <SectionTitle>
3 We discovered that the grammar's coverage (to be defined later)
</SectionTitle>
    <Paragraph position="0"> of the training set increased quickly to above 98% as soon as the grammarian identified the problem sentences. So we have been</Paragraph>
  </Section>
  <Section position="6" start_page="186" end_page="186" type="metho">
    <SectionTitle>
IN It_PPH1 N\]
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> caster Treebank.</Paragraph>
    <Paragraph position="3"> the grammar fails to parse. We currently have about 25,000 sentences for training.</Paragraph>
    <Paragraph position="4"> The point of the treebank parses is to constitute a &amp;quot;strong filter,&amp;quot; that is to eliminate incorrect parses, on the set of parses proposed by a grammar for a given sentence. A candidate parse is considered to be &amp;quot;acceptable&amp;quot; or &amp;quot;correct&amp;quot; if it is consistent with the treebank parse. We define two notions of consistency: structure-consistent and label-consistent. The span of a consitituent is the string of words which it dominates, denoted by a pair of indices (i, j) where i is the index of the leftmost word and j is the index of the rightmost word. We say that a constituent A with span (i, j) in a candidate parse for a sentence is structure-consistent with the treebank parse for the same sentence in case there is no constituent in the treebank parse having span (i', j') satisfying</Paragraph>
    <Paragraph position="6"> In other words, there can be no &amp;quot;crossings&amp;quot; of the span of A with the span of any treebank non-terminal. A grammar parse is structure-consistent with the treebank parse if all of its constituents are structure-consistent with the treebank parse.</Paragraph>
    <Paragraph position="7"> continuously increasing the training set as more data is treebanked. The notion of label-consistent requires in addition to structure-consistency that the grammar constituent name is equivalent 4 to the treebank non-terminal label.</Paragraph>
    <Paragraph position="8"> The following example will serve to illustrate our consistency criteria. We compare a &amp;quot;treebank parse&amp;quot;:</Paragraph>
    <Paragraph position="10"> For the structure-consistent criterion, the first and second candidate parses are correct, even though the first one has a more detailed constituent spanning (4, 5). The third is incorrect since the constituent NT6 is a case of a crossing bracket. For the label-consistent criterion, the first candidate parse is the only correct parse, because it has all of the bracket labels and parts-of-speech of the treebank parse. The second candidate parse is incorrect, since two of its part-of-speech labels and one of its bracket labels differ from those of the treebank parse.</Paragraph>
    <Paragraph position="11"> Grammar writing and statistical estimation: The task of developing the requisite system is factored into two parts: a linguistic task and a statistical task.</Paragraph>
    <Paragraph position="12"> The linguistic task is to achieve perfect or near-perfect coverage of the test set. By this we mean that among the n parses provided by the parser for each sentence of the test dataset, there must be at least one which is consistent with the treebank illter. s To eliminate trivial solutions to this task, the grammarian must hold constant over the course of development the geometric mean of the number of parses per word, or equivalently the total number of parses for the entire test corpus.</Paragraph>
    <Paragraph position="13"> The statistical task is to supply a stochastic model for probabilistically training the grammar such that the parse selected as the most likely one is a correct parse. 6 4See Section 4 for the definition of a many-to-many mapping between grammar and trcebank non-terminals for determining equiv-Mence of non-termlnals.</Paragraph>
    <Paragraph position="14"> SWe propose this sense of the term coverage as a replacement for the sense in current use, viz. simply supplying one or more parses, correct or not, for some portion of a given set of sentences. 6Clcarly the grammarian can contribute to this task by, among other things, not just holding the average number of parses con&amp;quot;I 87 The above decomposition into two tasks should lead to better broad-coverage grammars. In the first task, the grammarian can increase coverage since he can examine examples of specific uncovered sentences. In the second task, that of selecting a parse from the many parses proposed by a grammar, can best be done by maximum likelihood estimation constrained by a large treebank. The use of a large treebank allows the development of sophisticated statistical models that should outperform the traditional approach of using human intuition to develop parse preference strategies. We describe in this paper a model based on probabilistic context-free grammars estimated with a constrained version of the Inside-Outside algorithm (see Section 4)that can be used for picking a parse for a sentence. In \[2\], we desrcibe a more sophisticated stochastic grammar that achieves even higher parsing accuracy.</Paragraph>
  </Section>
  <Section position="7" start_page="186" end_page="189" type="metho">
    <SectionTitle>
3. Grammar
</SectionTitle>
    <Paragraph position="0"> Our grammar is a feature-based context-free phrase structure grammar employing traditional syntactic categories. Each of its roughly 700 &amp;quot;rules&amp;quot; is actually a rule template, compressing a family of related productions via unification. 7 Boolean conditions on values of variables occurring within these rule templates serve to limit their ambit where necessary. To illustrate, the rule template</Paragraph>
    <Paragraph position="2"> imposes agreement of the children with reference to feature f2, and percolates this value to the parent. Acceptable values for feature f3 are restricted to three (d,g,h) for the second child (and the parent), and include all possible values for feature f3 ezeept k, for the first child. Note that the variable value is also allowed in all cases mentioned (V1,V2,V3). If the set of licit values for feature f3 is (d,e,f,g,h,i,j,k,1}, and that for feature f2 is {r,s}, then, allowing for the possibility of variables remaining as such, the rule template above represents 3*4*9 = 108 different rules. If the condition were removed, the rule template would stand for 3&amp;quot;10&amp;quot;10 = 300 different rules.</Paragraph>
    <Paragraph position="3"> stunt, but in fact steadily reducing it. The importance of this contribution will ultimately depend on the power of the statistical models developed after a reasonable amount of effort.</Paragraph>
    <Paragraph position="4"> Unification is to be understood in this paper in a very limited sense, which is precisely stated in Section 4. Our grammar is not a unification grammar in the sense which is most often used in the literature.</Paragraph>
    <Paragraph position="5"> awhere fl,f2,f3 are features; a,b,c are feature values; and V1,V2,V3 are variables over feature values While a non-terminal in the above grammar is a feature vector, we group multiple non-terminals into one class which we call a mnemonic, and which is represented by the least-specified non-terminal of the class. A sample mnemonic is N2PLACE (Noun Phrase of semantic category Place). This mnemonic comprises all non-terminals that unify with:</Paragraph>
    <Paragraph position="7"> including, for instance, Noun Phrases of Place with no determiner, Noun Phrases of Place with various sorts of determiner, and coordinate Noun Phrases of Place.</Paragraph>
    <Paragraph position="8"> Mnemonics are the &amp;quot;working nonterminals&amp;quot; of the grammar; our parse trees are labelled in terms of them. A production specified in terms of mnemonics (a mnemonic production) is actually a family of productions, in just the same way that a mnemonic is a family of non-terminals.</Paragraph>
    <Paragraph position="9"> Mnemonics and mnemonic productions play key roles in the stochastic modelling of the grammar (see below). A recent version of the grammar has some 13,000 mnemonics, of which about 4000 participated in full parses on a run of this grammar on 3800 sentences of average word length 12. On this run, 440 of the 700 rule templates contributed to full parses, with the result that the 4000 mnemonics utilized combined to form approximately 60,000 different mnemonic productions. The grammar has 21 features whose range of values is 2 - 99, with a median of 8 and an average of 18. Three of these features are listed below, with the function of each:  To handle the huge number of linguistic distinctions required for real-world text input, the grammarian uses many of the combinations of the feature set. A sample rule (in simplified form) illustrates this:</Paragraph>
    <Paragraph position="11"> This rule says that a lexical adjective parses up to an adjective phrase. The logically primary use of the feature &amp;quot;details&amp;quot; is to more fully specify conjunctions and phrases  involving them. Typical values, for coordinating conjunctions, are &amp;quot;or&amp;quot; and &amp;quot;but&amp;quot;; for subordinating conjunctions and associated adverb phrases, they include e.g. &amp;quot;that&amp;quot; and &amp;quot;so.&amp;quot; But for content words and phrases (more precisely, for nominal, adjectival and adverbial words and phrases), the feature, being otherwise otiose, carries the semantic category of the head.</Paragraph>
    <Paragraph position="12"> The mnemonic names incorporate &amp;quot;semantic&amp;quot; categories of phrasal heads, in addition to various sorts of syntactic information (e.g. syntactic data concerning the embedded clause, in the case of &amp;quot;that-clauses&amp;quot;). The &amp;quot;semantics&amp;quot; is a subclassification of content words that is designed specifically for the manuals domain. To provide examples of these categories, and also to show a case in which the semantics succeeded in correctly biasing the probabilities of the trained grammar, we contrast (simplified) parses by an identical grammar, trained on the same data (see below), with the one difference that semantics was eliminated from the mnemonics of the grammar that produced the first parse below.</Paragraph>
    <Paragraph position="13"> \[SC\[V1 Enter \[N2\[N2 the name \[P1 of the system</Paragraph>
    <Paragraph position="15"> nect \[P1WO to P1\]V2\]V1\]SD\]N2\]P1\]N2\]V1\]SC\].</Paragraph>
    <Paragraph position="16"> What is interesting here is that the structural parse is different in the two cases. The first case, which does not match the treebank parse 9 parses the sentence in the same way as one would understand the sentence, &amp;quot;Enter the chapter of the manual you want to begin with.&amp;quot; In the second case, the semantics were able to bias the statistical model in favor of the correct parse, i.e. one which does match the treebank parse. As an experiment, the sentence was submitted to the second grammar with a variety of different verbs in place of the original verb &amp;quot;connect&amp;quot;, to make sure that it is actually the semanitc class of the verb in question, and not some other factor, that accounts for the improvement. Whenever verbs were substituted that were licit syntatically but not semantically (e.g. adjust, comment, lead) the parse was as in the first case above. Of course other verbs of the class &amp;quot;OR-GANIZE&amp;quot; were associated with the correct parse, and verbs that did were not even permitted syntactically occasioned the incorrect parse.</Paragraph>
    <Paragraph position="17"> We employ a lexical preprocessor to mark multiword  \[V Enter \[N the name \[P of \[N the system \[Fr\[N you \]\[V want \[Wl to connect \[P to \]\]\]\]\]\]\]\].</Paragraph>
    <Paragraph position="18">  units as well as to license unusual part-of-speech assignments, or even force labellings, given a particular context. For example, in the context: &amp;quot;How to:&amp;quot;, the word &amp;quot;How&amp;quot; can be labelled once and for all as a General Wh-Adverb, rather than a Wh-Adverb of Degree (as in, &amp;quot;How tall he is getting!&amp;quot;). Three sample entries from our lexicon follow: &amp;quot;Full-screen&amp;quot; is labelled as an adjective which  usually bears an attributive function, with the semantic class &amp;quot;Screen-Part&amp;quot;. &amp;quot;Hidden&amp;quot; is categorized as a past participle of semantic class &amp;quot;Alter&amp;quot;. &amp;quot;1983&amp;quot; can be a temporal noun (viz. a year) or else a number. Note that all of these classifications were made on the basis of the examination of concordances over a several-hundredthousand-word sample of manuals data. Possible uses not encountered were in general not included in our lexicon.</Paragraph>
    <Paragraph position="19"> Our approach to grammar development, syntactical as well as lexical, is frequency-based. In the case of syntax, this means that, at any given time, we devote our attention to the most frequently-occurring construction which we fail to handle, and not the most &amp;quot;theoretically interesting&amp;quot; such construction.</Paragraph>
  </Section>
  <Section position="8" start_page="189" end_page="190" type="metho">
    <SectionTitle>
4. Statistical Training and Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section we will give a brief description of the procedures that we have adopted for parsing and training a probabilistic model for our grammar. In parsing with the above grammar, it is necessary to have an efficient way of determining if, for example, a particular feature bundle A = (AI, A2,...,AN) can be the parent of a given production, some of whose features are expressed as variables. As mentioned previously, we use the term unification to denote this matching procedure, and it is defined precisely in figure 2.</Paragraph>
    <Paragraph position="1"> In practice, the unification operations are carried out very efficiently by representing bundles of features as bitstrings, and realizing unification in terms of logical bit operations in the programming language PL.8 which is similar to C. We have developed our own tools to translate the rule templates and conditions into PL.8 programs.</Paragraph>
    <Paragraph position="2"> A second operation that is required is to partition the set of nonterminals, which is potentially extremely large, into a set of equivalence classes, or mnemonics, as mentioned earlier. In fact, it is useful to have a tree, which hierarchically organizes the space of possible fea-UNIFY(A, B): do for each feature f if not FEATURE_UNIFY(A/, B/) then return FALSE return TRUE FEATURE_UNIFY(a, b): if a -- b then return TRUE else if a is variable or b is variable then return TRUE return FALSE Figure 2 ture bundles into increasingly detailed levels of semantic and syntactic information. Each node of the tree is itself represented by a feature bundle, with the root being the feature bundle all of whose features are variable, and with a decreasing number of variable features occuring as a branch is traced from root to leaf. To find the mnemonic .A4(A) assigned to an arbitrary feature bundle A, we find the node in the mnemonic tree which corresponds to the smallest mnemonic that contains (subsumes) the feature bundle A as indicated in Fugure 3.</Paragraph>
    <Paragraph position="3"> .A4(A):</Paragraph>
    <Paragraph position="5"> do for each child m of n if Mnemonic(m) contains A then return SEARCH_SUBTREE(m, A) return Mnemonic(n) Figure 3 Unconstrained training: Since our grammar has an extremely large number of non-terminals, we first describe how we adapt the well-known Inside-Outside algorithm to estimate the parameters of a stochastic context-free grammar that approximates the above context-free grammar. We begin by describing the case, which wc call unconstrained training, of maximizing the likelihood of an unbrackctcd corpus. We will later describe the modifications necessary to train with the constraint of a bracketed corpus.</Paragraph>
    <Paragraph position="6"> To describe the training procedure we have used, we will assume familiarity with both the CKY algorithm \[?\] and the Inside-Outside algorithm \[?\], which we have adapted to the problem of training our grammar. The main computations of the Inside-Outside algorithm are indexed using the CKY procedure which is a bottom-up chart parsing algorithm. To summarize the main points  in our adaptation of these algorithms, let us assume that the grammar is in Chomsky normal form. The general case involves only straight-forward modifications. Proceeding in a bottom-up fashion, then, we suppose that we have two nonterminals (bundles of features) B and C, and we find all nonterminals A for which A -~ B C is a production in the grammar. This is accomplished by using the unfication operation and checking that the relevent Boolean conditions are satisfied for the nonterminals A, B, and C.</Paragraph>
    <Paragraph position="7"> Having found such a nonterminal, the usual Inside-Outside algorithm requires a recursive update of the Inside probabilities IA(i,j) and outside probabilities OA(i , j) that A spans (i, j). These updates involve the probability parameter PrA(A ---* B C).</Paragraph>
    <Paragraph position="8"> In the case of our feature-based grammar, however, the number of such parameters would be extremely large (the grammar can have on the order of few billion nonterminals). We thus organize productions into the equivalence classes induced by the mncmomic classes on the non-terminals. The update then uses mnemonic productions for the stochastic grammar using the parameter</Paragraph>
    <Paragraph position="10"> Of course, for lexical productions A --) w we use the corresponding probability</Paragraph>
    <Paragraph position="12"> in the event that we are rewriting not a pair of nonterminals, but a word w.</Paragraph>
    <Paragraph position="13"> Thus, probabilities are expressed in terms of the set of mnemonics (that is, by the nodes in the mnemonic tree), rather that in terms of the actual nonterminals of the grammar. It is in this manner that we can obtain efficient and reliable estimates of our parameters. Since the grammar is very detailed, the mnemonic map JUt can be increasingly refined so that a greater number of linguistic phenomena are caputured in the probabilities. In principle, this could be carried out automatically to determine the optimum level of detail to be incorporated into the model, and different paramcterizations could be smoothed together. To date, however, we have only contructed mnemonic maps by hand, and have thus experimented with only a small number of paramcterizations. Constrained training: The Inside-Outside algorithm is a special case of the general EM algorithm, and as such, succssive iteration is guaranteed to converge to a set of parameters which locally maximize the likelihood of generating the training corpus. We have found it useful to employ the trccbank to supervise the training of these parameters. Intuitively, the idea is to modify the algorithm to locally maximize the likelihood of generating the training corpus using parses which are &amp;quot;similar&amp;quot; to the treebank parses. This is accomplished by only collecting statistics over those parses which are consistent with the treebank parses, in a manner which we will now describe. The notion of label-consistent is defined by a (many-to-many) mapping from the mnemonics of the feature-based grammar to the nonterminal labels of the treebank grammar. For example, our grammar maintains a fairly large number of semantic classes of singular nouns, and it is natural to stipulate that each of them is label-consistent with the nonterminal NI~I denoting a generic singular noun in the treebank. Of course, to exhaustively specify such a mapping would be rather time consuming. In practice, the mapping is implemented by organizing the nonterminals hierarchically into a tree, and searching for consistency in a recursive fashion.</Paragraph>
    <Paragraph position="14"> The simple modification of the CKY algorithm which takes into account the treebank parse is, then, the following. Given a pair of nonterminals B and C in the CKY chart, if the span of the parent is not structure-consistent then this occurence of B C cannot be used in the parse and we continue to the next pair. If, on the other hand, it is structure-consistent then we find all candidate parents A for which A ~ B C is a production of the grammar, but include only those that are label-consistent with the treebank nonterminal (if any) in that position. The probabilities are updated in exactly the same manner as for the standard Inside-Outside algorithm. The procedure that we have described is called constrained training, and it significantly improves the effectiveness of the parser, providing a dramatic reduction in computational requirements for parameter estimation as well as a modest improvement in parsing accuracy.</Paragraph>
    <Paragraph position="15"> Sample mappings from the terminals and non-terminals of our grammar to those of the Lancaster tree-bank are provided in Table 5. For ease of understanding, we use the version of our grammar in which the semantics are eliminated from the mnemonics (see above). Category names from our grammar are shown first, and the Lancaster categories to which they map are shown second: The first case above is straightforward: our prepositional-phrase category maps to Lancaster's. In the second case, we break down the category Relative Clause more finely than Lancaster does, by specifying the syntax of the embedded clause (e.g. FRV2: &amp;quot;that opened the adapter&amp;quot;). The third case relates to relative clauses lacking prefatory particles, such as: &amp;quot;the row you are specifying&amp;quot;; we would call &amp;quot;you are specifying&amp;quot; an SD (Declarative Sentence), while Lancaster calls it an Fr (Relative Clause). Our practice of distinguishing constituents which function as interrupters from the same constituents tout court accounts for the fourth case; the category in question is Infinitival Clause. Finally, we generate attributive adjectives (JB) directly from past participles (VVN) by rule, whereas Lancaster opts to label as adjectives (J J) those past participles so functioning.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML