XML Viewer - h94-1048

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1048_metho.xml
Size: 10,300 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1048">
  <Title>A Maximum Entropy Model for Prepositional Phrase Attachment</Title>
  <Section position="2" start_page="0" end_page="250" type="metho">
    <SectionTitle>
2. Maximum Entropy Modeling
</SectionTitle>
    <Paragraph position="0"> The Maximum Entropy model \[1\] produces a probability distribution for the PP-attachment decision using only information from the verb phrase in which the attachment occurs.</Paragraph>
    <Paragraph position="1"> We denote the partially parsed verb phrase, i.e., the verb phrase without the attachment decision, as a history h, and the conditional probability of an attachment as p(dlh), where d 6 .\[0, 1} and corresponds to a noun or verb attachment (respectively). The probability model depends on certain features of the whole event (h, d) denoted by fi(h, d). An example of a binary-valued feature function is the indicator function that a particular (V, P) bigram occured along with the attachment decision being V, i.e. fprint,on(h, d) is one if and only if the main verb of h is &amp;quot;print&amp;quot;, the preposition is &amp;quot;on&amp;quot;, and d is &amp;quot;V&amp;quot;. As discussed in \[6\], the ME principle leads to a model for p(dlh ) which maximizes the training data log-likelihood, a) log p(dlh), h,d where ~(h, w) is the empirical distribution of the training set, and where p(dlh ) itself is an exponential model:</Paragraph>
    <Paragraph position="3"/>
  </Section>
  <Section position="3" start_page="250" end_page="250" type="metho">
    <SectionTitle>
4. Head Noun of the Object of the Preposition (N2)
</SectionTitle>
    <Paragraph position="0"> For example, questions on the history &amp;quot;imposed a gradual ban on virtually all uses of asbestos&amp;quot;, can only ask about the following four words: At the maximum of the training data log-likelihood, the model has the property that its k parameters, namely the At's, satisfy k constraints on the expected values of feature functions, where the ith constraint is,</Paragraph>
    <Paragraph position="2"> imposed ban on uses The notion of a &amp;quot;head&amp;quot; word here corresponds loosely to the notion of a lexical head. We use a small set of rules, called a Tree Head Table, to obtain the head word of a constituent \[12\].</Paragraph>
    <Paragraph position="3"> We allow two types of binary-valued questions: The model expected value is,</Paragraph>
    <Paragraph position="5"> 1. Questions about the presence of any n-gram (n _&lt; 4) of the four head words, e.g., a bigram maybe {V == ' ' is' ', P == ' ' of' ' }. Features comprised solely of questions on words are denoted as &amp;quot;word&amp;quot; features. and the training data expected value, also called the desired value, is</Paragraph>
    <Paragraph position="7"> The values of these k parameters can be obtained by one of many iterative algorithms. For example, one can use the Generalized Iterative Scaling algorithm of Darroch and Ratcliff \[3\]. As one increases the number of features, the achievable maximum of the training data likelihood increases. We describe in Section 3 a method for determining a reliable set of features.</Paragraph>
  </Section>
  <Section position="4" start_page="250" end_page="252" type="metho">
    <SectionTitle>
3. Features
</SectionTitle>
    <Paragraph position="0"> Feature functions allow us to use informative characteristics of the training set in estimating p(dlh). A feature is defined as follows: -.~(h,d) d~_f ~'1, iffd=OandVq6 Q~,q(h)= 1 O, otherwise. I.</Paragraph>
    <Paragraph position="1"> where Q~ is a set of binary-valued questions about h. We restrict the questions in any Q~ ask only about the following four head words:  I. Head Verb (V) 2. Head Noun (N1) 3. Head Preposition (P) . Questions that involve the class membership of a head  word. we use a binary hierarchy of classes derived by mutual information clustering which we describe below. Given a binary class hierarchy, we can associate a bit string with every word in the vocabulary. Then, by querying the value of certain bit positions we can constmct binary questions. For example, we can ask whether about a bit position for any of the four head words, e.g., Bit 5 of Preposition == i. We discuss below a richer set of these questions. Features comprised solely of questions about class bits are denoted as &amp;quot;class&amp;quot; features, and features containing questions about both class bits and words are denoted as &amp;quot;mixed&amp;quot; features 1. Before discussing, feature selection and construction, we give a brief overview of the mutual information clustering of words.</Paragraph>
    <Paragraph position="2"> Mutual Information Bits Mutual information clustering, as described in \[10\], creates a a class &amp;quot;tree&amp;quot; for a given vocabulary. Initially, we take the C most frequent words (usually 1000) and assign each one to its own class. We then take the (C + 1)st word, assign it to its own class, and merge the pair of classes that minimize the loss of average mutual information. This repeats until all the words in the vocabulary have been exhausted. We then take our C classes, and use the same algorithm to merge classes that minimize the loss of mutual information, until one class remains. If we trace the order in which words and classes are merged, we can form a binary tree whose leaves consists of words and whose root is the class which spans the entire vocabulary. Consequently, we uniquely identify each word by its path from the root, which  can be represented by a string of binary digits. If a path lengt of a word is less than the maximum depth, we pad the bottor of the path with O's (dummy left branches), so that all word are represented by an equally long bitstring. &amp;quot;Class&amp;quot; feature query the value of bits, and hence examine the path of th word in the mutual information tree.</Paragraph>
    <Paragraph position="3"> Special Features In addition to the types of features de scribed above, we employ two special features in the MI model, the: Complement and the Null feature. The Comple ment, defined as fcomr,(h,d) dJ {1,0, otherwise.ifffi(h'd)=0'Vfi 6.,%4 will fire on a pair (h, d) when no other fi in the model applie,, The Initial feature is simply clef I'1, iffd=O fn~zz(h, d) = ~, 0, otherwise and causes the ME model to match the a pr i or i probability of seeing an N-attachment.</Paragraph>
    <Section position="1" start_page="251" end_page="252" type="sub_section">
      <SectionTitle>
3.1. Feature Search
</SectionTitle>
      <Paragraph position="0"> The search problem here is to find an optimal set of features A4 for use in the ME model. We begin with a search space 79 of putative features, and use a feature ranking criterion which incrementally selects the features in .A4, and also incrementally expands the search space 79.</Paragraph>
      <Paragraph position="1"> Initially 79 consists of all 1, 2, 3 and 4-gram word features of the four headwords that occur in the training histories 2, and  all possible unigram class features 3. We obtain E (~) = 15 k=l word features from each training history, and, assuming each word is assigned m bits, a total of 2m * 4 unigram class features, e.g., there are 2m features per word: Bit 1 of Verb == O, Bit 1 of Verb == 1 .....</Paragraph>
      <Paragraph position="2"> Bit m of Verb == 0, Bit m of Verb ==I The feature search then proceeds as follows:  1. Initialize 79 as described above, initialize A,4 to contain complement and null feature 2. Select the best feature from 79 using Delta-Likelihood rank 3. Add it to .A4 2With a certain frequency cut-off, usually 3 to 5  4. Train Maximum Entropy Model, using features in .A4 5. Grow 79 based on last feature selected 6. repeat from (2)  If we measure the training entropy and test entropy after the addition of each feature, the training entropy will monotonically decrease while the test entropy will eventually reach a minimum (due to overtraining). Test set performance usually peaks at the test entropy minimum ( see Fig. 1 &amp; 2 ). Delta-Likelihood At step (2) in the search, we rank all features in 7 9 by estimating their potential contribution to the log-likelihood of the training set. Let q be the conditional probability distribution of the model with the features currently in A,4. Then for each f~ 6 79, we compute, by estimating only ~, the probability distribution p that results when fi is added to the ME model:</Paragraph>
      <Paragraph position="4"> We then compute the increase in (log) likelihood with the new model:</Paragraph>
      <Paragraph position="6"> and choose the feature with the highest 6L. Features redundmlt or correlated to those features already in .A.4 will produce</Paragraph>
    </Section>
    <Section position="2" start_page="252" end_page="252" type="sub_section">
      <SectionTitle>
Journal Data
</SectionTitle>
      <Paragraph position="0"> a zero or negligible 6L, and will therefore be outranked by genuinely informative features. The chosen feature is added to M and used in the ME Model.</Paragraph>
      <Paragraph position="1"> 3.2. Growth of Putative Feature Set At step (5) in the search we expand the space 7 ~ of putative features based on the feature last selected from 72 for addition to M. Given an n-gram feature f~ (i.e., of type &amp;quot;word&amp;quot;, &amp;quot;class&amp;quot; or&amp;quot;mixed&amp;quot;) that was last added to M, we create 2m.4 new n + 1-gram features which ask questions about class bits in addition to the questions asked in fi. E.g., let fi(h, d) constrain d = 0 and constrain h with the questions v == ' ' imposed' ' , P == ' 'on' ' Then, given fi(h,d), the 2m new features generated for just the Head Noun are the following:</Paragraph>
      <Paragraph position="3"> We construct the remaining 6m features similarly from the remaining 3 head words. We skip the construction of features  containing questions that are inconsistent or redundant with those word or class questions in fi.</Paragraph>
      <Paragraph position="4"> The newly created features are then added to P, and compete for selection in the next Delta-Likelihood ranking process. This method allows the introduction of complex features on word classes while keeping the search space manageable; &amp;quot;P grows linearly with .M.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML