File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1620_metho.xml

Size: 11,260 bytes

Last Modified: 2025-10-06 14:10:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1620">
  <Title>Multilingual Deep Lexical Acquisition for HPSGs via Supertagging</Title>
  <Section position="5" start_page="165" end_page="166" type="metho">
    <SectionTitle>
3 Task and Resources
</SectionTitle>
    <Paragraph position="0"> In this section, we outline the resources targeted in this research, namely the English Resource Grammar (ERG: Flickinger (2002), Copestake and Flickinger (2000)) and the JACY grammar of Japanese (Siegel and Bender, 2002). Note that our choice of the ERG and JACY as testbeds for experimentation in this paper is somewhat arbitrary, and that we could equally run experiments over any Grammar Matrix-based grammar for which there is treebank data.</Paragraph>
    <Paragraph position="1"> Both the ERG and JACY are implemented open-source broad-coverage precision Head-driven Phrase Structure Grammars (HPSGs: Pollard and Sag (1994)). A lexical item in each of the grammars consists of a unique identifier, a lexical type (a leaf type of a type hierarchy), an orthography, and a semantic relation. For example, in the English grammar, the lexical item for the noun dog is simply:</Paragraph>
    <Paragraph position="3"> in which the lexical type of n - c le encodes the fact that dog is a noun which does not subcategorise for any other constituents and which is countable, &amp;quot;dog&amp;quot; specifies the lexical stem, and &amp;quot; dog n 1 rel&amp;quot; introduces an ad hoc predicate name for the lexical item to use in constructing a semantic representation. In the context of the ERG and JACY, DLA equates to learning the range of lexical types a given lexeme occurs with, and generating a single lexical item for each.</Paragraph>
    <Paragraph position="4"> Recent development of the ERG and JACY has been tightly coupled with treebank annotation, and all major versions of both grammars are deployed over a common set of dynamically-updateable treebank data to help empirically trace the evolution of the grammar and retrain parse selection models (Oepen et al., 2002a; Bond et al., 2004).</Paragraph>
    <Paragraph position="5"> This serves as a source of training and test data for building our supertaggers, as detailed in Table 1.</Paragraph>
    <Paragraph position="6"> In translating our treebank data into a form that can be understood by a supertagger, multiword expressions (MWEs) pose a slight problem. Both the ERG and JACY include multiword lexical items, which can either be strictly continuous (e.g. hot line) or optionally discontinuous (e.g. transitive English verb particle constructions, such as pick up as in Kim picked the book up).</Paragraph>
    <Paragraph position="7"> Strictly continuous lexical items are described by way of a single whitespace-delimited lexical stem (e.g. STEM &lt; &amp;quot;hot line&amp;quot; &gt;). When faced with instances of this lexical item, the supertagger must perform two roles: (1) predict that the words hot and line combine together to form a single lexeme, and (2) predict the lexical type associated with the lexeme. This is performed in a single step through the introduction of the ditto lexical type, which indicates that the current word combines (possibly recursively) with the left-adjacent word to form a single lexeme, and shares the same lexical type. This tagging convention is based on that used, e.g., in the CLAWS7 part-of-speech tagset.</Paragraph>
    <Paragraph position="8"> Optionally discontinuous lexical items are less of a concern, as selection of each of the discontinuous &amp;quot;components&amp;quot; is done via lexical types. E.g. in the case of pick up, the lexical entry looks as follows:</Paragraph>
    <Paragraph position="10"> in which &amp;quot;pick&amp;quot;selects for the up p sel rel predicate, which in turn is associated with the stem &amp;quot;up&amp;quot; and lexical type p prtcl le. In terms of lexical tag mark-up, we can treat these as separate  tags and leave the supertagger to model the mutual inter-dependence between these lexical types.</Paragraph>
    <Paragraph position="11"> For detailed statistics of the composition of the two grammars, see Table 1.</Paragraph>
    <Paragraph position="12"> For morphological processing (including tokenisation and lemmatisation), we use the pre-existing machinery provided with each of the grammars. In the case of the ERG, this consists of a finite state machine which feeds into lexical rules; in the case of JACY, segmentation and lemmatisation is based on a combination of ChaSen (Matsumoto et al., 2003) and lexical rules. That is, we are able to assume that the Japanese data has been pre-segmented in a form compatible with JACY, as we are able to replicate the automatic pre-processing that it uses.</Paragraph>
  </Section>
  <Section position="6" start_page="166" end_page="167" type="metho">
    <SectionTitle>
4 Suppertagging
</SectionTitle>
    <Paragraph position="0"> The DLA strategy we adopt in this research is based on supertagging, which is a simple instance of sequential tagging with a larger, more linguistically-diverse tag set than is conventionally the case, e.g., with part-of-speech tagging. Below, we describe the pseudo-likelihood CRF model we base our supertagger on and outline the feature space for the two grammars.</Paragraph>
    <Section position="1" start_page="166" end_page="167" type="sub_section">
      <SectionTitle>
4.1 Pseudo-likelihood CRF-based
Supertagging
</SectionTitle>
      <Paragraph position="0"> CRFs are undirected graphical models which define a conditional distribution over a label sequence given an observation sequence. Here we use CRFs to model sequences of lexical types, where each input word in a sentence is assigned a single tag.</Paragraph>
      <Paragraph position="1"> The joint probability density of a sequence labelling, a0 (a vector of lexical types), given the input sentence, a1 , is given by:</Paragraph>
      <Paragraph position="3"> where we make a first order Markov assumption over the label sequence. Here a29 ranges over the word indices of the input sentence (a1 ), a41 ranges over the model's features, and a42a43a11a45a44 a25 a23a47a46 are the model parameters (weights for their corresponding features). The feature functions a27a48a23 are pre-defined real-valued functions over the input sentence coupled with the lexical type labels over adjacent &amp;quot;times&amp;quot; (= sentence locations) a29 . These feature functions are unconstrained, and may represent overlapping and non-independent features of the data. The distribution is globally normalised by the partition function, a40 a3a49a5 a1a28a9 , which sums out the numerator in (1) for every possible labelling:  We use a linear chain CRF, which is encoded in the feature functions of (1).</Paragraph>
      <Paragraph position="4"> The parameters of the CRF are usually estimated from a fully observed training sample, by maximising the likelihood of these data. I.e.</Paragraph>
      <Paragraph position="6"> is the complete set of training data.</Paragraph>
      <Paragraph position="7"> However, as calculating a40 a3a49a5 a1a28a9 has complexity quadratic in the number of labels, we need to approximate a2a56a3a6a5 a0a8a7a1a28a9 in order to scale our model to hundreds of lexical types and tens-of-thousands of training sentences. Here we use the pseudo-likelihood approximation a2a48a74 a60a3 (Li, 1994) in which the marginals for a node at time a29 are calculated with its neighbour nodes' labels fixed to those ob-</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="167" end_page="168" type="metho">
    <SectionTitle>
FEATURE DESCRIPTION
WORD CONTEXT FEATURES
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"/>
    <Paragraph position="4"> is the lexical type label observed in the training data and a78 ranges over the label set. This approximation removes the need to calculate the partition function, thus reducing the complexity to be linear in the number of labels and training instances. null Because maximum likelihood estimators for log-linear models have a tendency to overfit the training sample (Chen and Rosenfeld, 1999), we define a prior distribution over the model parameters and derive a maximum a posteriori (MAP)</Paragraph>
    <Paragraph position="6"> We use a zero-mean Gaussian prior, with the probability density function a2a83a82a47a5 a25 a23 a9a85a84</Paragraph>
    <Paragraph position="8"> In order to train the model, we maximize (4).</Paragraph>
    <Paragraph position="9"> While the log-pseudo-likelihood cannot be maximised for the parameters, a42 , in closed form, it is a convex function, and thus we resort to numerical optimisation to find the globally optimal parameters. We use L-BFGS, an iterative quasi-Newton optimisation method, which performs well for training log-linear models (Malouf, 2002; Sha and Pereira, 2003). Each L-BFGS iteration requires the objective value and its gradient with respect to the model parameters.</Paragraph>
    <Paragraph position="10"> As we cannot observe label values for the test data we must use a2a48a3a49a5 a0a8a7a1a28a9 when decoding. The Viterbi algorithm is used to find the maximum posterior probability alignment for test sentences,</Paragraph>
    <Section position="1" start_page="167" end_page="168" type="sub_section">
      <SectionTitle>
4.2 CRF features
</SectionTitle>
      <Paragraph position="0"> One of the strengths of the CRF model is that it supports the use of a large number of non-independent and overlapping features of the input sentence. Table 2 lists the word context and lexical features used by the CRF model (shared across both grammars).</Paragraph>
      <Paragraph position="1"> Word context features were extracted from the words and lexemes of the sentence to be labelled combined with a proposed label. A clique label pair feature was also used to model sequences of lexical types.</Paragraph>
      <Paragraph position="2"> For the lexical features, we generate a feature for the unigram, bigram and trigram prefixes and suffixes of each word (e.g. for bottles, we would generate the prefixes b, bo and bot, and the suffixes s, es and les); for words in the test data, we generate a feature only if that feature-value is attested in the training data. We additionally test each word for the existence of one or more elements of a range of character sets a108a110a109 . In the case of English, we focus on five character sets: upper case letters, lower case letters, numbers, punctuation and hyphens. For the Japanese data, we employ six character sets: Roman letters, hiragana, katakana, kanji, (Arabic) numerals and punctuation. For example, a111a113a112a115a114a113a116 &amp;quot;mouldy&amp;quot; would be flagged as containing katakana character(s), kanji character(s) and hiragana character(s) only. Note that the only language-dependent component of  the lexical features is the character sets, which requires little or no specialist knowledge of the language. Note also that for languages with infixing, such as Tagalog, we may want to include a10 -gram infixes in addition to a10 -gram prefixes and suffixes. Here again, however, the decision about what range of affixes is appropriate for a given language requires only superficial knowledge of its morphology.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML