File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2214_metho.xml
Size: 25,110 bytes
Last Modified: 2025-10-06 14:14:59
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2214"> <Title>General-to-Specific Model Selection for Subcategorization Preference*</Title> <Section position="3" start_page="1314" end_page="1315" type="metho"> <SectionTitle> 2 A Model of Generating a Verb-Noun </SectionTitle> <Paragraph position="0"> Collocation from Subcategorization Frame(s) This section introduces a model of generating a verb-noun collocation from subcategorization frame(s).</Paragraph> <Section position="1" start_page="1314" end_page="1314" type="sub_section"> <SectionTitle> 2.1 Data Structure Verb-Noun Collocation Verb-noun colloca- </SectionTitle> <Paragraph position="0"> tion is a data structure for the collocation of a verb and all of its argument/adjunct nouns. A verb-noun collocation e is represented by a feature structure which consists of the verb v and all the pairs of co-occurring case-markers p and thesaurus classes e of case-marked nouns:</Paragraph> <Paragraph position="2"> We assume that a thesaurus is a tree-structured type hierarchy in which each node represents a semantic class, and each thesaurus class 0,..., Ck in a verb-noun collocation is a leaf class in the thesaurus. We also introduce ~c as the superordinate-subordinate relation of classes in a thesaurus: cl ___e c2 means that cl is subordi- 1 nate to c2.</Paragraph> <Paragraph position="3"> Subcategorization Frame A subcategorization frame s is represented by a feature structure which consists of a verb v and the pairs of case-markers p and sense restriction c of case-marked argument/adjunct nouns:</Paragraph> <Paragraph position="5"> Sense restriction cl, * * *, ct of case-marked argument/adjunct nouns are represented by classes at arbitrary levels of the thesaurus.</Paragraph> <Paragraph position="6"> Subsumption Relation We introduce the subsumption relation &quot;~s$ of a verb-noun collo1Although we ignore sense ambiguities of case-marked nouns in the definitions of this section, in the current implementation, we deal with sense ambiguities of case-marked nouns by deciding that a class c is superordinate to an ambiguous leaf class Cl if c is superordinate to at least one of the possible unambiguous classes of Ct.</Paragraph> <Paragraph position="7"> cation e and a subcategorization frame s: e --sl s iff. for each case-marker Pi in s and its noun class csi, there exists the same case-marker pi in e and its noun class cei is subordinate to c~i, i.e.</Paragraph> <Paragraph position="8"> Cei &quot;<c Csi The subsumption relation &quot;~s$ is applicable also as a subsumption relation of two subcategorization frames.</Paragraph> </Section> <Section position="2" start_page="1314" end_page="1314" type="sub_section"> <SectionTitle> 2.2 Generating a Verb-Noun Collocation from Subcategorization Frame(s) </SectionTitle> <Paragraph position="0"> Suppose a verb-noun collocation e is given as:</Paragraph> <Paragraph position="2"> Then, let us consider a tuple (sl, ...,sn) of partial subcategorization frames which satisfies the following requirements: i) the unification sl A ... Asn of all the partial subcategorization frames has exactly the same case-markers as e has as in (4), ii) each semantic class Csi of a case-marked noun of the partial subcategorization frames is superordinate to the corresponding leaf semantic class eei of e as in (5), and iii) any pair si and si, (i 7PS i I) do not have common case-markers as in (6):</Paragraph> <Paragraph position="4"> When a tuple (Sl, ...,sn) satisfies the above three requirements, we assume that the tuple (Sl, ..., sn) can generate the verb-noun collocation e and denote as below: (~,..., ~.) , e (7) As we will describe in section 3.2, we assume that the partial subcategorization frames Sl, ..., Sn are regarded as events occurring independently of each other and each of them is assigned an independent parameter.</Paragraph> </Section> <Section position="3" start_page="1314" end_page="1315" type="sub_section"> <SectionTitle> 2.3 Example </SectionTitle> <Paragraph position="0"> This section shows how we can incorporate case dependencies and noun class generalization into the model of generating a verb-noun collocation from a tuple of partial subcategorization frames* The Ambiguity of Case Dependencies The problem of the ambiguity of case dependencies is caused by the fact that, only by observing each verb-noun collocation in corpus, it is not decidable which cases are dependent on each other and which cases are optional and independent of other cases. Consider the following example: Example 1 Kodomo-ga kouen-de juusu-wo nomu.</Paragraph> <Paragraph position="1"> child-NOM park-at juice-A CC drink (A child drinks juice at the park.) The verb-noun collocation is represented as a feature structure e below:</Paragraph> <Paragraph position="3"> where co, cp, and cj represent the leaf classes (in the thesaurus) of the nouns &quot;kodomo(child)&quot;, &quot;kouen(park)&quot;, and '~uusu(juice)'.</Paragraph> <Paragraph position="4"> Next, we assume that the concepts &quot;human&quot;, &quot;place&quot;, and &quot;beverage&quot; are superordihate to &quot;kodomo(child)&quot;, &quot;kouen(park)&quot;, and '~uusu(juice)&quot;, respectively, and introduce the corresponding classes Chum, Cplc, and Cbe v as sense restriction in subcategorization frames. Then, according to the dependencies of cases, we can consider several patterns of subcategorization frames each of which can generate the verb-noun collocation e.</Paragraph> <Paragraph position="5"> If the three cases &quot;ga(NOM)&quot;, &quot;wo(ACC)&quot;, and &quot;de(at)&quot; are dependent on each other and it is not possible to find any division into several independent subcategorization frames, e can be regarded as generated from a subcategorization frame containing all of the three cases:</Paragraph> <Paragraph position="7"> Otherwise, if only the two cases &quot;ga(NOM)&quot; and &quot;wo(A CC)&quot; are dependent on each other and the &quot;de(at)&quot; case is independent of those two cases, e can be regarded as generated from the following two subcategorization frames independently: null</Paragraph> <Paragraph position="9"> The Ambiguity of Noun Class Generalization The problem of the ambiguity of noun class generalization is caused by the fact that, only by observing each verb-noun collocation in corpus, it is not decidable which superordinate class generates each observed leaf class in the verb-noun collocation. Let us again consider Example 1. We assume that the concepts &quot;mammal&quot; and &quot;liquid&quot; are superordinate to &quot;human&quot; and &quot;beverage&quot;, respectively, and introduce the corresponding classes Cma m and Ctiq. If we additionally allow these superordinate classes as sense restriction in subcategorization frames, we can consider several additional patterns of subcategorization frames which can generate the verb-noun collocation e.</Paragraph> <Paragraph position="10"> Suppose that only the two cases &quot;ga(NOM)&quot; and &quot;wo(ACC)&quot; are dependent on each other and the &quot;de(at)&quot; case is independent of those two cases as in the formula (10). Since the leaf class cc (&quot;child&quot;) can be generated from either Chum or cream, and also the leaf class cj ('~uice') can be generated from either Cbe v or Cliq, e can be regarded as generated according to either of the four formulas (10) and (11),~(13):</Paragraph> <Paragraph position="12"/> </Section> </Section> <Section position="4" start_page="1315" end_page="1318" type="metho"> <SectionTitle> 3 Maximum Entropy Modeling of </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1315" end_page="1315" type="sub_section"> <SectionTitle> Subcategorization Preference </SectionTitle> <Paragraph position="0"> This section describes how we apply the maximum entropy modeling approach of Della Pietra et al. (1997) and Berger et al. (1996) to model learning of subcategorization preference.</Paragraph> </Section> <Section position="2" start_page="1315" end_page="1316" type="sub_section"> <SectionTitle> 3.1 Maximum Entropy Modeling </SectionTitle> <Paragraph position="0"> Given the training sample C of the events (x, y), our task is to estimate the conditional probability p(y I x) that, given a context x, the process will output y. In order to express certain features of the whole event (x, y), a binary-valued indicator function is introduced and called a feature function. Usually, we suppose that there exists a large collection F of candidate features, and include in the model only a subset S of the full set of candidate features .T. We call S the set of active features. Now, we assume that S contains n feature functions. For each feature fi(E S), the sets Vzi and Vyi indicate the sets of the values of x and y for that feature. According to those sets, each feature function fi will be defined as follows:</Paragraph> <Paragraph position="2"> Then, in the maximum entropy modeling approach, the model with the maximum entropy is selected among the possible models. With this constraint, the conditional probability of the output y given the context x can be estimated as the following p~(y \[ x) of the form of the exponential family, where a parameter Ai is introduced</Paragraph> <Paragraph position="4"> The parameter values )C/i are estimated by an algorithm called Improved Iterative Scaling (IIS) algorithm.</Paragraph> <Paragraph position="5"> Feature Selection by One-by-one Feature Adding The feature selection process presented in Della Pietra et al. (1997) and Berger et al. (1996) is an incremental procedure that builds up S by successively adding features oneby-one. It starts with S as empty, and, at each step, selects the candidate feature which, when adjoined to the set of active features S, produces the greatest increase in log-likelihood of the training sample.</Paragraph> </Section> <Section position="3" start_page="1316" end_page="1317" type="sub_section"> <SectionTitle> 3.2 Modeling Subcategorization Prefer- </SectionTitle> <Paragraph position="0"> ence Events In our task of model learning of sub-categorization preference, each event (x,y) in the training sample is a verb-noun collocation e, which is defined in the formula (1). A verb-noun collocation e can be divided into two parts: one is the verbal part ev containing the verb v while the other is the nominal part ep containing all the pairs of case-markers p and thesaurus leaf classes</Paragraph> <Paragraph position="2"> Then, we define the context x of an event (x, y) as the verb v and the output y as the nominal part & of e, and each event in the training sample is denoted as (v, %):</Paragraph> <Paragraph position="4"> Features We represent each partial subcategorization frame as a feature in the maximum entropy modeling. According to the possible variations of case dependencies and noun class generalization, we consider every possible patterns of subcategorization frames which can generate a verb-noun collocation, and then construct the full set ~- of candidate features. Next, for the given verb-noun collocation e, tuples of partial subcategorization frames which can generate e are collected into the set SF(e) as below: Then, for each partial subcategorization frame s, a binary-valued feature function fs(V, ep) is defined to be true if and only if at least one element of the set SF(e) is a tuple (sl,...,s,...,sn) that</Paragraph> <Paragraph position="6"> In the maximum entropy modeling approach, each feature is assigned an independent parameter, i.e., each (partial) subcategorization frame is assigned an independent parameter.</Paragraph> <Paragraph position="7"> Parameter Estimation Suppose that the set S(C_ ~') of active features is found by the procedure of the next section. Then, the parameters of subcategorization frames are estimated according to IIS Algorithm and the conditional probability distribution ps(& \[ v) is given as:</Paragraph> <Paragraph position="9"> This section describes the new feature selection algorithm which utilizes the subsumption relation of subcategorization frames. It starts from the most general model, i.e., a model with no case dependency as well as the most general sense restrictions which correspond to the highest classes in the thesaurus. This starting model has high coverage of the test data. Then, the algorithm gradually examines more specific models with case dependencies as well as more specific sense restrictions which correspond to lower classes in the thesaurus. The model search process is guided by a model evaluation criterion.</Paragraph> </Section> <Section position="4" start_page="1317" end_page="1317" type="sub_section"> <SectionTitle> 4.1 Partially-Ordered Feature Space </SectionTitle> <Paragraph position="0"> In section 2.1, we introduced subsumption relation ~sl of two subcategorization frames. All the subcategorization frames are partially ordered according to this subsumption relation, and elements of the set .T of candidate features constitute a partially ordered feature space.</Paragraph> <Paragraph position="1"> Constraint on Active Feature Set Throughout the feature selection process, we put the following constraint on the active feature set S: Case Covering Constraint: for each verb-noun collocation in the training set C, each case p (and the leaf class marked by p) of e has to be covered by at least one feature in S.</Paragraph> <Paragraph position="2"> Initial Active Feature Set Initial set So of active features is constructed by collecting features which are not subsumed by any other candidate features in ~-: So = (fslVfs,( fs) E ~,s 7~sf S t } (16) This constraint on the initial active feature set means that each feature in So has only one case and the sense restriction of the case is (one of) the most general class(es).</Paragraph> <Paragraph position="3"> Candidate Non-active Features for Replacement At each step of feature selection, one of the active features is replaced with several non-active features. Let G be a set of non-active features which have never been active until that step. Then, for each active feature fs(E S), the set DI, (C ~) of candidate non-active features with which fs is replaced has to satisfy the following two requirements 2 3. 1. Subsumption with s: for each element fs' of DI. , s' has to be subsumed by s. 2. Upper Bound of ~: for each element fs, of DI, , and for each element ft of G, t does not subsume s', i.e., DI, is a subset of the upper bound of with respect to the subsumption relation ~sI-Among all the possible replacements, the most appropriate one is selected according to a model evaluation criterion.</Paragraph> </Section> <Section position="5" start_page="1317" end_page="1317" type="sub_section"> <SectionTitle> 4.2 Model Evaluation Criterion </SectionTitle> <Paragraph position="0"> As the model evaluation criterion during feature selection, we consider the following two types.</Paragraph> <Paragraph position="1"> The MDL (Minimum Description Length) principle (Rissanen, 1984) is a model selection criterion. It is designed so as to &quot;select the model that has as much fit to a given data as possible and that is as simple as possible.&quot; The MDL principle selects the model that minimizes the following description length l( M, D) of the probability model M for the data D: 1N</Paragraph> <Paragraph position="3"> where logLM(D) is the log-likelihood of the model M to the data D, NM is the number of the parameters in the model 21I, and IDI is the size of the data D.</Paragraph> </Section> <Section position="6" start_page="1317" end_page="1317" type="sub_section"> <SectionTitle> Description Length of Subcategorization </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> a small portion of the non-active features as the next candidate for the active feature, while the feature selection by one-by-one feature adding considers all the non-active features as the next candidate. Thus, in terms of efficiency, the general-to-specific feature selection has an advantage over the one-by-one feature adding algorithm, especially when the number of the candidate features is large.</Paragraph> <Paragraph position="3"> 3As long as the case covering constraint is satisfied, the set Df, of candidate non-active features with which f, is replaced could be an empty set 0.</Paragraph> <Paragraph position="4"> 4More precisely, we slightly modify the probability model ps by multiplying the probability of generating the verb-noun collocation e from the (partial) subcategorization frames that correspond to active features evaluating to true for e, and then apply the MDL principle to this modified model. The probability of generating a verb-noun collocation from (partial) subcategorization frames is simply estimated as the product of the probabilities The other type of the model evaluation criterion is the performance in the subcategorization preference test presented in Utsuro and Matsumoto (1997), in which the goodness of the model is measured according to how many of the positive examples can be judged as more appropriate than the negative examples. This subcategorization preference test can be regarded as modeling the subcategorization ambiguity of an argument noun in a Japanese sentence with more than one verbs like the one in Example 2.</Paragraph> <Paragraph position="5"> Example 2 TV-de mouketa shounin-wo mita TV-by/on earn money merchant-A CC see (If the phrase &quot;TV-de'(by/on TV) modifies the verb &quot;mouketa'(earn money), the sentence means that &quot;(Somebody) saw a merchant who earned money by (selling) TV.&quot; On the other hand, if the phrase &quot;TVde&quot;(by/on TV) modifies the verb &quot;mita'(see), the sentence means that &quot;On TV, (somebody) saw a merchant who earned money.&quot;) Negative examples are artificially generated from the positive examples by choosing a case element in a positive example of one verb at random and moving it to a positive example of another verb.</Paragraph> <Paragraph position="6"> Compared with the calculation of the description length l(ps, C) in (18), the calculation of the accuracy of subcategorization preference test requires comparison of probability values for sufficient number of positive and negative data and its computational cost is much higher than that of calculating the description length. Therefore, at present, we employ the description length l(ps,C) in (18) as the model evaluation criterion during the general-to-specific feature selection procedure, which we will describe in the next section in detail. After obtaining a sequence of active feature sets (i.e., subcategorization preference models) which are totally ordered from general to specific, we select an optimal subcategorization preference model according to the accuracy of subcategorization preference test, as we will describe in section 4.4.</Paragraph> </Section> <Section position="7" start_page="1317" end_page="1318" type="sub_section"> <SectionTitle> 4.3 Feature Selection Algorithm </SectionTitle> <Paragraph position="0"> The following gives the details of the general-to-specific feature selection algorithm, where the deof generating each leaf-class in the verb-noun collocation from the corresponding superordinate class in the subcategorization frame. With this generation probability, the more general the sense restriction of the subcategorization frames is, the less fit the model has to the data, and the greater the data description length (the first term of (18)) of the model is. Thus, this modification causes the feature selection process to be more sensitive to the sense restriction of the model.</Paragraph> <Paragraph position="1"> scription length l(ps, g) in (18) is employed as the model evaluation criterion: 5 General-to-Specific Feature Selection Input: Training data set E; collection ~- of candidate features Output: Set `S of active features; model Ps incorporating these features 1. Start with ,S = ,So of the definition (16) and with g =~'-& 2. Do for each active feature f E `S and every possible replacement D I C G: Compute the model PSuD/-U} using IIS Algorithm.</Paragraph> <Paragraph position="2"> Compute the decrease in the description length of (18).</Paragraph> <Paragraph position="3"> 3. Check the termination condition s 4. Select the feature j and its replacement D\] with maximum decrease in the description length 5. S,----SuD\]-{\]}, G~---G-D\] 6. Compute ps using IIS Algorithm 7. Go to step 2</Paragraph> </Section> <Section position="8" start_page="1318" end_page="1318" type="sub_section"> <SectionTitle> 4.4 Selecting a Model with Approx- imately Optimal Subcategorization Preference Accuracy </SectionTitle> <Paragraph position="0"> Suppose that we are constructing subcategorization preference models for the verbs Vl,...,Vm. By the general-to-specific feature selection algorithm in the previous section, for each verb vi, a totally ordered sequence of ni active feature sets Si0,... ,'-&quot;C/ini (i.e., subcategorization preference models) are obtained from the training sample g. Then, using another training sample C ~ which is different from C and consists of positive as well as negative data, a model with optimal subcategorization preference accuracy is approximately selected by the following procedure. Let ~,..., 7-m denote the current sets of active features for verbs Vl,..., Vm, respectively: 1. Initially, for each verb vi, set ~ as the most general one `sis of the sequence `sio,..., `sire.</Paragraph> <Paragraph position="1"> 2. For each verb vi, from the sequence `sn,..., `sire, search for an active feature set which gives a maximum subcategorization preference accuracy for g~, then set Ti as it.</Paragraph> <Paragraph position="2"> 3. Repeat the same procedure as 2.</Paragraph> <Paragraph position="3"> 4. Return the current sets ~,..., 7-m as the approx- null imately optimal active feature sets 'S1,.--,'~r~ for verbs Vl,..., vm, respectively.</Paragraph> <Paragraph position="4"> 5Note that this feature selection algorithm is a hill-climbing one and the model selected here may have a description length greater than the global minimum. 6In the present implementation, the feature selection process is terminated after the description length of the model stops decreasing and then certain number of active features are replaced.</Paragraph> </Section> </Section> <Section position="5" start_page="1318" end_page="1318" type="metho"> <SectionTitle> 5 Experiment and Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1318" end_page="1318" type="sub_section"> <SectionTitle> 5.1 Corpus and Thesaurus </SectionTitle> <Paragraph position="0"> As the training and test corpus, we used the EDR Japanese bracketed corpus (EDR, 1995), which contains about 210,000 sentences collected from newspaper and magazine articles. We used 'Bunrui Goi Hyou'(BGH) (NLRI, 1993) as the Japanese thesaurus. BGH has a sevenlayered abstraction hierarchy and more than 60,000 words are assigned at the leaves and its nominal part contains about 45,000 words.</Paragraph> </Section> <Section position="2" start_page="1318" end_page="1318" type="sub_section"> <SectionTitle> 5.2 Training/Test Events and Features </SectionTitle> <Paragraph position="0"> We conduct the model learning experiment under the following conditions: i) the noun class generalization level of each feature is limited to above the level 5 from the root node in the thesaurus, ii) since verbs are independent of each other in our model learning framework, we collect verb-noun collocations of one verb into a training data set and conduct the model learning procedure for each verb separately.</Paragraph> <Paragraph position="1"> For the experiment, seven Japanese verbs 7 are selected so that the difficulty of the subcategorization preference test is balanced among verb pairs. The number of training events for each verb varies from about 300 to 400, while the number of candidate features for each verb varies from 200 to 1,350. From this data, we construct the following three types of data set, each pair of which has no common element: i) the training data ~: which consists of positive data only, and is used for selecting a sequence of active feature sets by the general-to-specific feature selection algorithm in section 4.3, ii) the training data g' which consists of positive and negative data and is used in the procedure of section 4.4, and iii) the test data C ts which consists of positive and negative data and is used for evaluating the selected models in terms of the performance of subcategorization preference test. The sizes of the data sets g, g', and g ts are 2,333, 2,100, and 2,100.</Paragraph> </Section> </Section> class="xml-element"></Paper>