XML Viewer - c92-2065

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-2065_metho.xml
Size: 31,574 bytes
Last Modified: 2025-10-06 14:12:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-2065">
  <Title>PROBABILISTIC TREE-ADJOINING GRAMMAR AS A FRAMEWORK FOR STATISTICAL NATURAL LANGUAGE PROCESSING</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Probabilistic CFGs
</SectionTitle>
    <Paragraph position="0"> Thanks to the increased availability of text corpora, fast computers, and inexpensive off-line storagc, statistical approaches to issues in natural \]mlguage processing are enjoying a surge of interest, across a wide variety of applications. As research iu tbis area progresses, the question of how to combine our existing knowledge of grammatical methods (e.g. generative power, effEcient parsing) with dcvelol)ments in statistical and information-theoretic methods (especially techniques imported from the speech-processing community) takes on increasing significance.</Paragraph>
    <Paragraph position="1"> Perhaps the most straightforward combination of grammatical and statistical techniques can be found in the probabilistic generalization of context-free gram*This reaeardl was supported by the following gremts: ARO DAAL 03~9-C-0031, DARPA N00014-90-J-1863, NSF lRI 9016592, and Ben Franklin 91S.3078C-1, I would like to thank</Paragraph>
    <Paragraph position="3"> we VP (rs = N ~ we) we V NP (r4 = VP ---* V NP) we like NP (r5 = V ---* like) we like N (r6 = NP ---, N) we like Mary (r7 = N ~ Mary)  mars. \[Jelinek et al., 1990\] characterize a probabilistic context-free grammar (PCFG) as a context-free grammar in which each production has been assigned a probability of use. I will refer to the entire set of probabilities as the statistical parameters, or simply parameters, of the probabilistie grammar.</Paragraph>
    <Paragraph position="4"> For exmnple, the productions in Figure 1, together with their associated parameters, might comprise a fragment of a PCFG for English. Note that for each nonterminal symbol, the probabilities of its alternative expansions sum to 1. Tbis captures the fact that in a context-free derivation, each instance of a nontermina\] symbol must be rewritten according to exactly one of its expansions. Also by definition of a context-free derivation, each rewriting is independent of the context within which the nonterminal appears. So, for example, in the (leftmost) derivation of We like Mary (Figure 2), the expansions N ~ we and VP --~ V NP are independent events.</Paragraph>
    <Paragraph position="5"> The probability of n independent events is ACTES DE COLING-92, NANTES, 23-28 Aot)r 1992 4 1 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 merely the product of the probabilities of each individual event. Therefbre the probability of a context-frce derivation with rewritings rt, r~, ..., rn is Pr(rl)Pr(r2) ... Pr(rn).</Paragraph>
    <Paragraph position="6"> Jelinek et al. use this fact to develop a set of efficient algorithms (based on the CKY parsing algorithm) to compute the probability of a sentence given a grammar, to find the most probable parse of a given sentence, to compute the probability of an initial substring leading to a sentence generated by the grarmnar, and, following \[Baker, 1979\], to estimate the statistical parameters for a given grammar using a large text corpus.</Paragraph>
    <Paragraph position="7"> Now, the definition of the probability of a derivation that is used for PCFG depends crucially upon the context-freeness of the grammar; without it, the independence assumption that pernfits us to simply multiply probabilities is compromised. Yet experience tells us that rule expmmions are not, in general, context-free. Consider the following simple example. According to one frequency estimate of English usage \[~Y=tmeis and Kucera, 1982\], he is more titan three times as likely as we to appear as a subject pronoun. So if we is replaced with he in Figure 2, the new derivation (of He hke Mary) is accorded far higher probability according to the PCFG, even though He like Mary is not English. The problem extends beyond such obvious cases of agreenmnt: since crushing happens to be more likely than mashing, the probability of deriving lie's crushing potatoes is greater than the probability corresponding derivation for He's mashing potatoes, although we expect the latter to be more likely.</Paragraph>
    <Paragraph position="8"> lu short, lexical context matters. Although probnbilistic context-free grammar captures the fact that not all nonterminal rewritiugs are equally likely, its insensitivity to lexical context makes it less than adequate as a statistical model of natural language, t and weakeus Jelinek et al.'s contention that &amp;quot;in an ambiguous but appropriately chosen probabilistic CFG (PCFG), correct parses are Ifigh probability parses&amp;quot; (p. 2). In practical terms, it also suggests that techniques for incorporatiug lexical context-sensitivity will potentially improve PCFG performance.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Co-occurrence Statistics
</SectionTitle>
    <Paragraph position="0"> A second common approach to the corpus-based investigation of natural language considers instances of words occurring adjacent to each other in the text, or, more generally, words occurring together within a window of fixed size. For example, \[Church and tlanks, 1989\] use an information-theoretic measure, mutual information, to calculate the degree of association of 1 Independent of the statiatical i~ue, of course, it is also generally accepted that CFGs are gcneratively too weak to model natural ltmguage in its full generality \[Shieber, 1985\].</Paragraph>
    <Paragraph position="1"> word pairs based upon their co-occurrence throughout a large corpus. Tile nmtual information between two events is defined as</Paragraph>
    <Paragraph position="3"> where, for tile purposes of corpus analysis, Pr(x) and Pr(y) are the respective probabilities of words x and y appearing within the corpus, and Pr(z, y) is the probability of word x being followed by word y within a window of w words. Intuitively, mutual information relates the actual probability of seeing x and y together (numerator) to the probability of seeing them together under the as~sumption that they were independent events (denominator).</Paragraph>
    <Paragraph position="4"> As defined here, tbe calculation of mutual information takes into account only information about co-occurrence within surface strings. A difficulty with such an al~proash, however, is that cO-occurrences of interest may not lie sufficiently local. Although sentence (1)a can witness the co-oceurence of students and spend within a small window (say, w = 2 or 3), sentence (1)b cannot.</Paragraph>
    <Paragraph position="5"> (1)a. Students spend a lot of money.</Paragraph>
    <Paragraph position="6"> b. Students who attend conferances spend a lot of money.</Paragraph>
    <Paragraph position="7"> Simply increasing the window size will not suffice, for two rc~sons. First, there is no bound on the length of relative clauses such as the one in (1)b, hence no fixed value of w that will solve the problem. Second, the choice of window size depends on the application -- Church and Hanks write that &amp;quot;smaller window sizes will identify fixed expressions (idioms) and other relations that hold over short ranges; larger window sizes will highlight semantic concepts and other relationships that hold over larger scales&amp;quot; (p. 77). Extending the window size in order to capture co-occurrences such as the one found in (1)1) may therefore undermine other goals of the analysis.</Paragraph>
    <Paragraph position="8"> \[Brown et ul., 1991\] encounter a similar problem.</Paragraph>
    <Paragraph position="9"> Their application is statistics-driven machine translation, where the statistics are calculated on the basis of trigranm (observed triples) of words within the source and target corpora. As in (1), sentences (2)a and (3)a illustrate a difficulty encountered when limiting window size. Light verb constructions are an example of a situation in which the correct translation of the verb depends on the identity of its object. The correct word sense of p~ndre, &amp;quot;to make,&amp;quot; is chosen in (2)a, since the dependency signalling that word sense -- between prcndre and its object, decision -- falls within a tri-gram window and thus within the bounds of their language model.</Paragraph>
    <Paragraph position="10"> (2)a. prendre la decision ACrEs DE COLING-92, NANTES, 23-28 ^ofrr 1992 4 1 9 PROC. OF COL1NG-92, NANTES, AUC;, 23-28, 1992 b. make the decision (3)a. prendre une di\]fficile dgcision b. *take a difficult decision In (3)a, however, the dependency is not sufficiently local: the model has no access to lexical relationships spanning a distance of more than three words. Without additional information about the verb's object, the system relies on the fact that the sense of prendre a.~ &amp;quot;to take&amp;quot; occurs with greater frequency, and tire incorrect translation results.</Paragraph>
    <Paragraph position="11"> Brown et al. propose to solve the problem by seeking clues to a word's sense within a larger context. Each possible clue, or informant, is e~ site relative to the word. For prendre, the informant of interest is &amp;quot;the first noun to the right&amp;quot;; other potential informants include &amp;quot;the first verb to the right,&amp;quot; and so on. A set of such potential informants is defined in advance, and statistical techniques are used to identify, for each word, which potential informant contribntes most to determining the sense of the word to use.</Paragraph>
    <Paragraph position="12"> It seems clear that in most cases, Brown et al.'s informants represent approximations to structural relationships such as subject and object. Examples (1) through (3) suggest that attention to structure would have advantages for the analysis of lexical relationships. By using co-occurrence within a structure rather than co-occurrence within a window, it is possible to capture lexical relationships without regard to irrelevant intervening material. One way to express this notion is by saying that structural relationships permit us to use a notion of locality that is more general than simple distance in the surface string. So, for example, (1) demonstrates that the relationship between a verb and its subject is not affected by a subject relative clanse of any length.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Hybrid Approaches
</SectionTitle>
    <Paragraph position="0"> In the previous sections I have observed that probabillstic CFGs, though capable of capturing the relative likelihoods of context-free derivations, require greater sensitivity to lexical context in order to model natural languages. Conversely, lexica\] co-occurrence analyses seem to require a richer notion of locality, something that structural relationships such as subject and object can provide. In this section I briefly discuss two proposals that could be considered &amp;quot;hybrid&amp;quot; approaches, making use of both grammatical structure and lexical co-occurrence, The first of these, \[Hindle, 1990\], continues in the direction suggested at the end of the previous section: rather than using &amp;quot;informants&amp;quot; to approximate structural relationships, lexical eo-oecnrrenee is calculated over structures. Hindle uses co-occurrence statistics, collected over parse trees, in order to classify nouns on the basis of the verb contexts in which they appear. First, a robust parser is used to obtain a parse tree (possibly partial) for each sentence. For example, the following table contains some of the information Hindle's parser retrieved from the parse of the sentence &amp;quot;The clothes we wear, the food we eat, the air we breathe, the watcr we drink, the land that sustains us, and many of the prodncts we use are the result of agricultural research.&amp;quot; I verb \] subject I object I e~t we food breathe we air drink we water sust'ain land us Next, Hindle calculates a similarity measure based on tile mutual information of verbs and their arguments: nouns that tend to appear as subjects and objects of the sanae verbs are judged more similar. According to the criterion that that two nouns be reciprocally most similar to each other, Hindle's analysis derives such plausible pairs of similar nonns as ruling and decision, battle and fight, researcher and scientist. A proposal in \[Magerman and Marcus, 1991\] relates to probabilistie parsing. As in PCFG, Magerman and Marcus associate probabilities with the rules of a context-free grammar, but these probabilities are conditioned on the contexts within which a rule is observed to be used. It is in the formulation of &amp;quot;context&amp;quot; that lexical co-occurrence plays a role, albeit an indirect one. Magerman and Marcus's Pearl parser is essentially a bottom-up chart parser with Earley-style top-down prediction. When a context-free rule A -~ al...a~ is proposed as an entry in ttle chart spanning input symbols al through ai, it is assigned a &amp;quot;score&amp;quot; based  on 1. tile rule that generated this instance of nontermihal symbol A, and 2. tile part-of-speech trigram centered at al.</Paragraph>
    <Paragraph position="1">  For example, given the input My first love was named Pearl, a proposed chart entry VP---+ V NP starting its span at love (i.e., a theory trying to interpret love as a verb) would be scored on the basis of the rule that generated the VP (in this ease, probably S ---* NP VP) together with the part-of-speech trigram &amp;quot;adjective verb verb.&amp;quot; This score would be lower than that of a different chart entry interpreting love as a noun, since the latter would be scored using the more likely part of speech trigram &amp;quot;adjective noun verb&amp;quot;. So, although tire context-free probability favors the interpretation of love as the beginning of a verb phrase, ACRES DI&amp;quot; COLUqG-92, NANTES, 23-28 AOl3&amp;quot;r 1992 4 2 0 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 information about the lexical context of the word rescues the correct interpretation, in which it is categorized as a noun.</Paragraph>
    <Paragraph position="2"> Although Pearl does not take into account tile relationships between particular words, it does represent a promising combiuation of CFG-based probahilistic parsing mid corpus statistics calculated on the basis of simple (trigram) co-occurrences.</Paragraph>
    <Paragraph position="3"> A difficulty with the hybrid approaches, however, is that they leave unclear exactly what the statistical model of tile language is. In probabilistic context:i~ee grannnar and in surface-string analysis of lexical c(&gt; occurrence, there is a precise deIinition of the event space -- what events go into making up a sentence.</Paragraph>
    <Paragraph position="4"> (As discussed earlier, the events in a PCFG derivation are the rule expansions; the events in a window-baqed analysis of surface strings call be viewed as transitions in a finite Markov chain modelling the language.) The absence of sitch a characterization in the hybrid approaches makes it more difficult to identify what assumptions are being made, and gives such work a decidedly empirical flavor.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Probabilistic TAG
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Lexicalized TAG
</SectionTitle>
      <Paragraph position="0"> The previous sections have demonstrated a need for a well-defned statistical framework in which both syntactic structure and texical co, occurrences are incorporated. In thLs section, I argue that a probabilistic form of lexicalized tree-adjoining grammar provides just such a seamless combination. \[ begin with a brief description of lcxicalized tree-adjoining grammar, and then present its probabilistic generalization and the advantages of the resulting formalism.</Paragraph>
      <Paragraph position="1"> &amp;quot;l~'ec~adjoining grammar \[Joshi ctaL, 1975\], or TAG, is a generalization of context-fl'ee grannnar that has been proposed as a useful forlualism for the study of natural languages. A tree-adjoining grammar eompriscs two sets of elementary structures: initial trees and auxiliary trees. (See l,'igure 3, in which initial and auxiliary trees are labelled using cr and f3, respectively.) An auxiliary tree, by definition, has a uonterminal node on its frontier - tile fool node -- that matches the nonterminal symbol at its root.</Paragraph>
      <Paragraph position="2"> These elementary structures can be combined using two operations, substitution and adjunction. The substitution operation corresponds to the rewriting of a symbol in a context-free derivation: a nonterminal node at the frontier of a tree is replaced with au initial tree having the same nonterminal symbol at its rout.</Paragraph>
      <Paragraph position="3"> So, for example, one could expand either NP in al by rewriting it as tree ccu.</Paragraph>
      <Paragraph position="5"> The adjunction operation is a generalization of substitution that permits internal as well as frontier nodes to be expanded. One adjoins an auxiliary tree fl into a tree c~ at node r I by &amp;quot;splitting&amp;quot; ~ horizontally at r 1 and then inserting ft. In the resulting structure, 7, f)'s root node appears in tfs original position, mid its foot node dominates all the material that r I donfinated previously (see Figure 4). For example, l&amp;quot;igure 5 shows T1, the result of adjoining auxiliary tree ~1 at the N node of e~2.</Paragraph>
      <Paragraph position="6"> }br context-free grmnmar, each derived structure is itself a record of the operations that produced it: one Call simply read from a parse tree tile context-free rules that were used in the parse. In contrm~t, a derived structure in a tree-adjoining grammar is distinct from its derivation history. The final parse tree encodes the slruclure of the sentence according to the grammar, but the events that constitute the derivation history that is, the subsitutions and adjunctions that took place are not directly encoded. The significance of this distinction will become apparent shortly.</Paragraph>
      <Paragraph position="7"> A lcxicalized tree-adjoining gra~nmar is it TAG in which each elemcutary structure (initial or auxiliary tree) has a lexieal item on its frontier, known as its anchor \[Schabes, 1990\]. Another natural way to say this is that each lexical item has associated with it  ACRES 1OE COLING-92, NANTES, 23-28 AOl~l&amp;quot; 1992 4 2 1 PROC. Or; COLING-92, NANTJ!S. AUG. 23-28, 1992 a set of structures that, taken together, characterize the contexts within which that item can appear. 2 For example, tree al in Figure 3 represents one possible structural context for eat; another would be an initial tree in which eat appears as an intransitive verb, and yet another would be a tree containing eat within a passive construction.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Features of Probabilistic TAG
</SectionTitle>
      <Paragraph position="0"> Three features of lexicalized TAG urake it particularly appropriate as a probabilistic framework for natural language processing. First, since every tree is associated with a lexica\] anchor, words and their associated structures are tightly linked. Thus, unlike probabilistic context-free grammar, the probabilities associated with structural operations are sensitive to lexical context. In particular, the probability of an adjunetion step such as the one that produced 71, above, is sensitive to the lexical anchors of the trees. One would expect a similar adjunction of ~1 into aa (resulting in the string roasted people) to have extremely low probability, reflecting the low degree of association between the lexical items involved. Similarly, tree aa is far more likely than a2 to be substituted as the subject NP in tree ~1, since people is more likely than peanuts to appear as the subject of eat.</Paragraph>
      <Paragraph position="1"> Notice that this attention to lexical context is not acquired at the expense of the independence assumption for probabilities. Just as expansion of a nonterminal node in a CFG derivation takes place without regard to that node's history, substitution or adjunction to a node during a TAG derivation takes place without regard to that node's history, a Thus it will be straightforward to express the probability of a probabilistic TAG derivation as a product of probabilities, just as was the case for PCFG, yielding a well-defined statistical model.</Paragraph>
      <Paragraph position="2"> Second, since TAG factors local dependencies from recursion, the problem of intervening material in window-based lexical co-occurrence analyses does not arise. Notice that in tree c~t, the verb, its subject, and its object all appear together within a single elementary structure. The introduction of recursive substructure -- relative clauses, adverbs, strings of adjectival modifiers -- is done entirely by means of the adjunction operation, as illustrated by figures 4 and 5. This fact provides a principled treatment of examples such as (4): the probability of substituting cr 2 as the object NP of ax (capturing an association between eat and peanuts)  tion quite nicely in obscxving that for both CFG derivation trees and TAG derivation histories, the path setA (set of possible paths from root to frontier) are regular.</Paragraph>
      <Paragraph position="3"> is independent of any other operations that might take place, such as the introduction (via adjunction) of an adjectival modifier. Similarly, the probability of substituting aa as the subject NP of al is not affected by the subsequent adjunction of a relative clause (cf. exampie (1)). Thus, in contrast to a co-occnrrence analysis based on strings, the analysis of (4) within probabilistic TAG finds precisely the same relationship between the verb and its arguments in (4)b and (4)c as it does in (4)a.</Paragraph>
      <Paragraph position="4">  (4)a. People eat peanuts.</Paragraph>
      <Paragraph position="5"> b. People eat roasted peanuts.</Paragraph>
      <Paragraph position="6"> c. People who are v~iting the zoo eat roasted  peanuts.</Paragraph>
      <Paragraph position="7"> Third, the notion of lexical anchor in lexicalized TAG has been generalized to account for multi-word lexicat entries \[Abeill6 and Schabes, 1989\]. Thus the formalism appears to satisfy the criteria of \[Smadja and McKeown, 1990\], who write of the need for &amp;quot;a flexible lexicon capable of using single word entries, multiple word entries as well as phrasal templates and a mechanism that would be able to gracefully merge and combine them with other types of constraints&amp;quot; (p. 256). Among &amp;quot;other types of constraints&amp;quot; are linguistic criteria, and here, too, TAG offers the potential for capturing the details of linguistic and statistical facts in a uniform way \[Kroch mad Joshi, 1985\].</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Formalization of Probabilistic TAG
</SectionTitle>
      <Paragraph position="0"> There are a number of ways to express lexicalized tree-adjoining grammar as a probabilistic grammar formalism. 4 Here I propose what appears to me to be the most direct probabilistic generalization of lexiealized TAG; a different treatment can be found in \[Schabes, 1992\].</Paragraph>
      <Paragraph position="1"> Definitions: Let I denote the set of initial trees in the grammar, and A the set of auxiliary trees.</Paragraph>
      <Paragraph position="2"> Each tree in a lexlcalized TAG has a (possibly empty) subset of its frontier nodes marked as nodes at which substitution may take place. Given a tree c~, let that subset be denoted by s(a).</Paragraph>
      <Paragraph position="3"> Adjunction may take place at any node r/ in labelled by a nonterminal symbol, so long as *7 s(~). Denote this set of possible adjunetion nodes n(.). 5 4 The idea of defining probabilities over derivations involving combinations of elementary structur~ was introduced as early as \[J~hi, 1973\].</Paragraph>
      <Paragraph position="4"> ~Note that ~almtitution nodes and adjmmtion nodes ar~ not disting~fished in Figure 3.</Paragraph>
      <Paragraph position="5"> AcrEs DE COLING-92, NANTES, 23-28 nO~&amp;quot; 1992 4 2 2 PROC. OF COLING-92, NA/CrES, AUG. 23-28, 1992 Let S(a, a', r/) denote the event of substituting tree c~ * into tree c~ at node r/.</Paragraph>
      <Paragraph position="6"> Let A(a,~,O) denote the event of adjoining auxiliary tree ~ into tree c~ at node r/, and let A(oC/, none, U) denote the event in which no adjunction is performed at that node.</Paragraph>
      <Paragraph position="7"> Let f~ = the set of all substitution and adjunction events.</Paragraph>
      <Paragraph position="8">  Pa(A(a,~,r/)) denotes the probability of adjoining fl at node 7/of tree a (where PA(A(a, none,t\])) denotes the probability that no adjunetion takes place at node o of a).</Paragraph>
      <Paragraph position="9"> A TAG derivation is described by the initial tree with which it starts, together with the sequence of substitutions and adjunction operations that then take place. Denoting each operation as op(cq,a2,~l),op E {S,A}, and denoting the initial tree with whieb tim derivation starts as ~0, the probability of a TAG deriva-</Paragraph>
      <Paragraph position="11"> This definition is directly analogous to the probability of a context-free derivation r = (rh..., r,),</Paragraph>
      <Paragraph position="13"> though in a CFG every derivation starts with S and so PI(S) -= 1.</Paragraph>
      <Paragraph position="14"> Does probabilistic TAG, thus formalized, behave as desired? Returning to example (4), consider the following derivation history for the sentence People eat roasted peanuts:  mates for the probabilities of substitution and adjunetion, the probability of this derivation would indeed be far greater than for the corresponding derivation of Peanuts eat roasted people. Thus, in contrast to PCFG, probabitistic TAG's sensitivity to lexieal context does provide a more accurate language model. In addition, were this derivation to be used as an observation for the purpose of estimating probabilities, the estimate of Ps(S(at,C/t3, NPs,,bj)) would be unaffected by the event A(a2, /!/1, N). That is, the model would in fact capture a relationship between the verb eat and its object peanuts, mediated by the trees that they anchor, regardless of intervening material in the surface string.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Acquisition of Probabilistic TAG
</SectionTitle>
      <Paragraph position="0"> Despite the attractive theoretical features of probabilistic TAG, many practical issues remain. Foremost among these is the acquisition of the statistical parameters. \[Sehabes, 1992\] has recently adapted the Inside-Outside algorithm, used for estinaating the parameters of a probabilistic CFG \[Baker, 1979\], to probabilistie TAG. The Inside-Outside algorithm is itself a generalization of tile Forward-Backward algorithm used to train hidden Markov models \[Baum, 1972\]. It optimizes by starting with a set of initial pareaneters (chosen randomly, or uniformly, or perhaps initially set on the basis of a priori knowledge), then iteratively collecting statistics over the training data (using the existing parameters) and then re-estimating the parameters on the basis of those statistics. Each re-estimation step guarantees improved or at worst equivalent parameters, according to a maximmn-likelihood criterion.</Paragraph>
      <Paragraph position="1"> The procedure immediately raises two questions, regardless of whether the probabilistic formalism under cousideration is finite-state, context-free, or treeadjoining. First, since the algorithm re-estimates parameters but does not determine the rules in the grammar, the structures underlying the statistics (ares, rules, trees) must bc determined in some other fashion.</Paragraph>
      <Paragraph position="2"> The starting point can be a hand-written grammar, which requires considerable effort and linguistic knowledge; altcrnatively, one can initially include all possible structures and let tim statistics weed out useless rules by assigning them near-zero probability. A third possibility is to add an engine that hypothesizes rules on the basis of observed data, a~ld then let the parameter estimation operate over these hypothesized structures.</Paragraph>
      <Paragraph position="3"> Which possibility is best depends upon the application and the breadth of linguistic coverage needed.</Paragraph>
      <Paragraph position="4"> Second, any statistical approach to natural language nmst consider tim size of the parameter set to be estimated. Additional parameters can enhance the theoretical accuracy of a statistical model (e.g., extending a trigram model to 4-grams) but may also lead to an ACTES DE COL1NG-92. NANqES. 23-28 Ao(rr 1992 4 2 3 PRoc. ov COLING-92. NArCrES, Au6.23-28, 1992 unmanageable number of parameters to store and retrieve, much less estimate. In addition, as the number of parameters grows, more data are required in order to collect accurate statistics. For example, both (5)a and (5)b witness the same relationship between called and son within a trigram model.</Paragraph>
      <Paragraph position="5"> (5)a. Mary called her son.</Paragraph>
      <Paragraph position="6"> b. Mary called her son Junior.</Paragraph>
      <Paragraph position="7"> Within a recent lexicalized TAG for English \[Abeill~ el al., 1990\], however, these two instances of called are associated with two distinct initiM trees, reflecting the different syntactic structures of the two examples.</Paragraph>
      <Paragraph position="8"> Thus observations that would be the same for the tri-gram model can be fragmented in the TAG model. As a result of this fragmentation, each individual event is observed fewer times, and so the model is more vulnerable to statistical inaccuracy resulting from low counts. The acquisition of a probababilistie tree-adjoining grammar &amp;quot;from the ground up,&amp;quot; that is, including hypothesizing grammatical structures as well as estimating statistical parameters, is an intended topic of future work.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> 1 have argued that probabilistic 'rAG provides a seamless framework that combines the simplicity and structure of the probabilistie CFG approach with the lexieal sensitivity of string-based co-occurrence analyses.</Paragraph>
    <Paragraph position="1"> Within a wide variety of applications, it appears that various researchers (e.g. \[tIindle, 1990; Magerman and Marcus, 1991; Brown el al., 1991; Smadja and MeKeown, 1990\]) are confronting issues similar to those discussed here. As the resources required for statistical approaches to natural language continue to become more readily available, these issues will take on increasing importance, and the need for a framework that unifies grammatical and statistical techniques will continue to grow. Probabilistie TAG is one step toward such a unifying framework.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML