File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1019_metho.xml

Size: 12,086 bytes

Last Modified: 2025-10-06 14:10:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1019">
  <Title>Partial Training for a Lexicalized-Grammar Parser</Title>
  <Section position="3" start_page="144" end_page="145" type="metho">
    <SectionTitle>
2 The CCG Parsing Model
</SectionTitle>
    <Paragraph position="0"> Clark and Curran (2004b) describes two log-linear parsing models for CCG: a normal-form derivation model and a dependency model. In this paper we use the dependency model, which requires sets of predicate-argument dependencies for training.1 1Hockenmaier and Steedman (2002) describe a generative model of normal-form derivations; one possibility for training this model on partial data, which has not been explored, is to use the EM algorithm (Pereira and Schabes, 1992).</Paragraph>
    <Paragraph position="1"> The predicate-argument dependencies are represented as 5-tuples: &lt;hf,f,s,ha,l&gt; , where hf is the lexical item of the lexical category expressing the dependency relation; f is the lexical category; s is the argument slot; ha is the head word of the argument; and l encodes whether the dependency is non-local. For example, the dependency encoding company as the object of bought (as in IBM bought the company) is represented as follows: &lt;bought2, (S\NP1)/NP2, 2, company4,[?]&gt; (1) CCG dependency structures are sets of predicate-argument dependencies. We define the probability  ofadependencystructureasthesumoftheprobabilities of all those derivations leading to that structure (Clark and Curran, 2004b). &amp;quot;Spurious ambiguity&amp;quot; in CCG means that there can be more than one derivation leading to any one dependency structure. Thus, the probability of a dependency structure, pi, given a sentence, S, is defined as follows:</Paragraph>
    <Paragraph position="3"> where [?](pi) is the set of derivations which lead to pi.</Paragraph>
    <Paragraph position="4"> The probability of a &lt;d,pi&gt; pair, o, conditional on a sentence S, is defined using a log-linear form:</Paragraph>
    <Paragraph position="6"> where l.f(o) =summationtexti lifi(o). The function fi is the integer-valued frequency function of the ith feature; li is the weight of the ith feature; and ZS is a normalising constant.</Paragraph>
    <Paragraph position="7"> Clark and Curran (2004b) describes the training procedure for the dependency model, which uses a discriminativeestimationmethodbymaximisingthe conditional likelihood of the model given the data (Riezler et al., 2002). The optimisation of the objective function is performed using the limited-memory BFGS numerical optimisation algorithm (Nocedal and Wright, 1999; Malouf, 2002), which requires calculation of the objective function and the gradient of the objective function at each iteration.</Paragraph>
    <Paragraph position="8"> The objective function is defined below, where L(L) is the likelihood and G(L) is a Gaussian prior term for smoothing.</Paragraph>
    <Paragraph position="9">  He anticipates growth for the auto maker</Paragraph>
  </Section>
  <Section position="4" start_page="145" end_page="145" type="metho">
    <SectionTitle>
NP (S[dcl]\NP)/NP NP (NP\NP)/NP NP[nb]/N N/N N
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> S1,...,Sm are the sentences in the training data; pi1,...,pim are the corresponding gold-standard dependency structures; r(S) is the set of possible &lt;derivation, dependency-structure&gt; pairs for S; s is a smoothing parameter; and n is the number of features. The components of the gradient vector are:</Paragraph>
    <Paragraph position="4"> The first two terms of the gradient are expectations of feature fi: the first expectation is over all derivations leading to each gold-standard dependency structure, and the second is over all derivations for each sentence in the training data. The estimation process attempts to make the expectations in (5) equal (ignoring the Gaussian prior term). Another way to think of the estimation process is that it attempts to put as much mass as possible on the derivations leading to the gold-standard structures (Riezler et al., 2002).</Paragraph>
    <Paragraph position="5"> Calculation of the feature expectations requires summing over all derivations for a sentence, and summing over all derivations leading to a gold-standard dependency structure. Clark and Curran (2003) shows how the sum over the complete derivation space can be performed efficiently using a packed chart and the inside-outside algorithm, and Clark and Curran (2004b) extends this method to sum over all derivations leading to a gold-standard dependency structure.</Paragraph>
  </Section>
  <Section position="5" start_page="145" end_page="147" type="metho">
    <SectionTitle>
3 Partial Training
</SectionTitle>
    <Paragraph position="0"> The partial data we use for training the dependency model is derived from CCG lexical category sequences only. Figure 1 gives an example sentence adapted from CCGbank (Hockenmaier, 2003) together with its lexical category sequence. Note that, although the attachment of the prepositional phrase to the noun phrase is not explicitly represented, it can be inferred in this example because the lexical category assigned to the preposition has to combine with a noun phrase to the left, and in this example there is only one possibility. One of the key insights in this paper is that the significant amount of syntactic information in CCG lexical categories allows us to infer attachment information in many cases.</Paragraph>
    <Paragraph position="1"> Theprocedureweuseforextractingdependencies from a sequence of lexical categories is to return all  thosedependencieswhichoccurink%ofthederivations licenced by the categories. By giving the k parameter a high value, we can extract sets of dependencies with very high precision; in fact, assuming that the correct lexical category sequence licences the correct derivation, setting k to 100 must result in 100% precision, sinceanydependency whichoccurs in every derivation must occur in the correct derivation. Of course the recall is not guaranteed to be high; decreasingk has the effect of increasing recall, but at the cost of decreasing precision.</Paragraph>
    <Paragraph position="2"> The training method described in Section 2 can be adapted to use the (potentially incomplete) sets of dependencies returned by our extraction procedure. In Section 2 a derivation was considered correct if it produced the complete set of gold-standard dependencies. In our partial-data version a derivation is considered correct if it produces dependencies which are consistent with the dependencies returned by our extraction procedure. We define consistency as follows: a set of dependencies D is consistent with a set G if G is a subset of D. We also say that a derivation d is consistent with dependency set G if G is a subset of the dependencies produced by d.</Paragraph>
    <Paragraph position="3">  This definition of &amp;quot;correct derivation&amp;quot; will introduce some noise into the training data. Noise arises from sentences where the recall of the extracted dependencies is less than 100%, since some of the derivations which are consistent with the extracted dependencies for such sentences will be incorrect.</Paragraph>
    <Paragraph position="4"> Noise also arises from sentences where the precisionoftheextracteddependenciesislessthan100%, null since for these sentences every derivation which is consistent with the extracted dependencies will be incorrect. The hope is that, if an incorrect derivation produces mostly correct dependencies, then it can still be useful for training. Section 4 shows how the precision and recall of the extracted dependencies varies with k and how this affects parsing accuracy.</Paragraph>
    <Paragraph position="5"> The definitions of the objective function (4) and the gradient (5) for training remain the same in the partial-data case; the only differences are that [?](pi)  isnowdefinedtobethosederivationswhichareconsistent with the partial dependency structure pi, and the gold-standard dependency structures pij are the partial structures extracted from the gold-standard lexical category sequences.2 Clark and Curran (2004b) gives an algorithm for finding all derivations in a packed chart which produce a particular set of dependencies. This algorithm is required for calculating the value of the objective function (4) and the first feature expectation in (5). We adapt this algorithm for finding all derivations which are consistent with a partial dependency structure. The new algorithm is shown in Figure 2.</Paragraph>
    <Paragraph position="6"> The algorithm relies on the definition of a packed chart, which is an instance of a feature forest (Miyao and Tsujii, 2002). The idea behind a packed chart is that equivalent chart entries of the same type and in the same cell are grouped together, and back pointers to the daughters indicate how an individual entry was created. Equivalent entries form the same structures in any subsequent parsing.</Paragraph>
    <Paragraph position="7"> A feature forest is defined in terms of disjunctive and conjunctive nodes. For a packed chart, the individual entries in acell are conjunctive nodes, and the equivalence classes of entries are disjunctive nodes.</Paragraph>
    <Paragraph position="8"> The definition of a feature forest is as follows: A feature forest Ph is a tuple &lt;C,D,R,g,d&gt; where: 2Note that the procedure does return all the gold-standard dependencies for some sentences.</Paragraph>
    <Paragraph position="9"> &lt;C,D,R,g,d&gt; is a packed chart / feature forest G is a set of dependencies returned by the extraction procedure Let c be a conjunctive node Let d be a disjunctive node deps(c) is the set of dependencies on node c</Paragraph>
    <Paragraph position="11"> mark(d): mark d as a correct node foreach c [?] g(d) if dmax(c) == dmax(d) mark c as a correct node  with a partial dependency structure</Paragraph>
    <Paragraph position="13"> Dependencies are associated with conjunctive nodes in the feature forest. For example, if the disjunctive nodes (equivalence classes of individual entries) representing the categories NP and S\NP combine to produce a conjunctive node S, the resulting S node will have a verb-subject dependency associated with it.</Paragraph>
    <Paragraph position="14"> In Figure 2, cdeps(c) is the number of dependencies on conjunctive node c which appear in partial structure G; dmax(c) is the maximum number of dependencies in G produced by any sub-derivation headed by c; dmax(d) is the same value for disjunctive node d. Recursive definitions for calculating these values are given; the base case occurs when conjunctive nodes have no disjunctive daughters.</Paragraph>
    <Paragraph position="15">  Thealgorithmidentifiesallthoserootnodesheading derivations which are consistent with the partial dependency structure G, and traverses the chart top-down marking the nodes in those derivations. The insight behind the algorithm is that, for two conjunctive nodes in the same equivalence class, if one node heads a sub-derivation producing more dependencies in G than the other node, then the node with  less dependencies inGcannot be part of a derivation consistent with G.</Paragraph>
    <Paragraph position="16"> The conjunctive and disjunctive nodes appearing in derivations consistent with G form a new &amp;quot;goldstandard&amp;quot; feature forest. The gold-standard forest, and the complete forest containing all derivations spanning the sentence, can be used to estimate the likelihood value and feature expectations required by the estimation algorithm. Let EPhLfi be the expected value of fi over the forest Ph for model L; then the values in (5) can be obtained by calculating EPhjL fi for the complete forest Phj for each sentence Sj in the training data (the second sum in (5)), and also EPsjL fi for each forest Psj of derivations consistentwiththepartialgold-standarddependencystruc- null ture for sentence Sj (the first sum in (5)):</Paragraph>
    <Paragraph position="18"> where logZPh is the normalisation constant for Ph.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML