File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2407_metho.xml
Size: 9,944 bytes
Last Modified: 2025-10-06 14:10:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2407"> <Title>Extending corpus-based identification of light verb constructions using a supervised learning framework</Title> <Section position="4" start_page="50" end_page="52" type="metho"> <SectionTitle> 3 Framework and Features </SectionTitle> <Paragraph position="0"> Previous work has shown that different measures based on corpus statistics can assist in LVC detection. However, it is not clear to what degree these different measures overlap and can be used to reinforce each other's results. We solve this problem by viewing LVC detection as a supervised classification problem. Such a framework can integrate the various measures and enable us to test their combinations in a generic manner. Specifically, each verb-object pair constitutes an individual classification instance, which possesses a set of features f1,...,fn and is assigned a class label from the binary classification of {LV C,!LV C}.</Paragraph> <Paragraph position="1"> In such a machine learning framework, each of the aforementioned metrics are separate features.</Paragraph> <Paragraph position="2"> In our work, we have examined three different sets of features for LVC classification: (1) base, (2) extended and (3) new features. We start by deriving three base features from key LVC detection measures as described by previous work - GT95, DJ96 and SFN04. As suggested in the previous section, we can make alternate formulations of the past work, such as to discard a pre-filtering step (i.e. filtering of constructions that do not include the top three most frequent prepositions). These measures make up the extended feature set. The third set of features are new and have not been used for LVC identification before. These include features that further model the influence of context (e.g. prepositions after the object) in LVC detection. null</Paragraph> <Section position="1" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 3.1 Base Features </SectionTitle> <Paragraph position="0"> These features are based on the original previous work discussed in Section 2, but have been adapted to give a numeric score. We use the initials of the original authors without year of publication to denote our derived base features.</Paragraph> <Paragraph position="1"> Recall that the aim of the original GT95 and DJ96 formulae is to rank the possible support verbs given a deverbal noun. As each of these formulae contain a function which returns a numeric score inside the argmaxv, we use these functions as two of our base features:</Paragraph> <Paragraph position="3"> The SFN04 measure can be used without modification as our third base feature, and it will be referred to as SFN for the remainder of this paper.</Paragraph> </Section> <Section position="2" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 3.2 Extended Features </SectionTitle> <Paragraph position="0"> Since Grefenstette and Teufel indicated that the filtering step might not be necessary, i.e., f(v,n) may be used instead of g(v,n), we also have the following extended feature:</Paragraph> <Paragraph position="2"> In addition, we experiment with the reverse process for the DJ feature, i.e., to replace f(v,n) in the function for DJ with g(v,n), yielding the following extended feature:</Paragraph> <Paragraph position="4"> In Grefenstette and Teufel's experiments, they used the top three prepositions for filtering. We further experiment with using all possible prepositions. null</Paragraph> </Section> <Section position="3" start_page="51" end_page="52" type="sub_section"> <SectionTitle> 3.3 New Features </SectionTitle> <Paragraph position="0"> In our new feature set, we introduce features that we feel better model the v and n components as well as their joint occurrences v-n. We also introduce features that model the v-n pair's context, in terms of deverbal counts, derived from our understanding of LVCs.</Paragraph> <Paragraph position="1"> Most of these new features we propose are not good measures for LVC detection by themselves.</Paragraph> <Paragraph position="2"> However, the additional evidence that they give can be combined with the base features to create a better composite classification system.</Paragraph> <Paragraph position="3"> Mutual information: We observe that a verb v and a deverbal noun n are more likely to appear in verb-object pairs if they can form a LVC. To capture this evidence, we employ mutual information to measure the co-occurrences of a verb and a noun in verb-object pairs. Formally, the mutual information between a verb v and a deverbal noun n is defined as</Paragraph> <Paragraph position="5"> where P(v,n) denotes the probability of v and n constructing verb-object pairs. P(v) is the probability of occurrence of v and P(n) represents the probability of occurrence of n. Let f(v,n) be the frequency of occurrence of the verb-object pair v-n and N be the number of all verb-object pairs in the corpus. We can estimate the above probabilities using their maximum likelihood esti-</Paragraph> <Paragraph position="7"> formation of co-occurrences between v and n. It does not capture the global frequency of verb-object pair v-n, which is demonstrated as effective by Dras and Johnson (1996). As such, we need to combine the local mutual information with the global frequency of the verb-object pair. We thus create the following feature, where the log function is used to smooth frequencies:</Paragraph> <Paragraph position="9"> Deverbal counts: Suppose a verb-object pair v-n is a LVC and the object n should be a deverbal noun. We denote vprime to be the verbalized form of n. We thus expect that v-n should express the same semantic meaning as that of vprime. However, verb-object pairs such as &quot;have time&quot; and &quot;have right&quot; in English scored high by the DJ and MI-LOGFREQ measures, even though the verbalized form of their objects, i.e., &quot;time&quot; and &quot;right&quot;, do not express the same meaning as the verb-object pairs do. This is corroborated by Grefenstette and Teufel claim that if a verb-object pair v-n is a LVC, then n should share similar properties with vprime. Based on our empirical analysis on the corpus using a small subset of LVCs, we believe that: 1. The frequencies of n and vprime should not differ very much, and 2. Both frequencies are high given the fact that LVCs occur frequently in the text.</Paragraph> <Paragraph position="10"> The first observation is true in our corpus where light verb and verbalized forms are freely interchangable in contexts. Then, let us denote the frequencies of n and vprime to be f(n) and f(vprime) respectively. We devise a novel feature based on the hypotheses: null</Paragraph> <Paragraph position="12"> where the two terms correspond to the above two hypotheses respectively. A higher score from this metric indicates a higher likelihood of the compound being a LVC.</Paragraph> <Paragraph position="13"> Light verb classes: Linguistic studies of light verbs have indicated that verbs of specific semantic character are much more likely to participate in LVCs (Wang, 2004; Miyamoto, 2000; Butt, 2003; Bjerre, 1999). Such characteristics have been shown to be cross-language and include verbs that indicate (change of) possession (Danish give, to give, direction (Chinese guan diao to switch off), aspect and causation, or are thematically incomplete (Japanese suru, to do). As such, it makes sense to have a list of verbs that are often used lightly. In our work, we have predefined a light verb list for our English experiment as exactly the following seven verbs: &quot;do&quot;, &quot;get&quot;, &quot;give&quot;, &quot;have&quot;, &quot;make&quot;, &quot;put&quot; and &quot;take&quot;, all of which have been studied as light verbs in the literature.</Paragraph> <Paragraph position="14"> We thus define a feature that considers the verb in the verb-object pair: if the verb is in the predefined light verb list, the feature value is the verb itself; otherwise, the feature value is another default value.</Paragraph> <Paragraph position="15"> One may ask whether this feature is necessary, given the various features used to measure the frequency of the verb. As all of the other metrics are corpus-based, they rely on the corpus to be a representative sample of the source language. Since we extract the verb-object pairs from the Wall Street Journal section of the Penn Treebank, terms like &quot;buy&quot;, &quot;sell&quot;, &quot;buy share&quot; and &quot;sell share&quot; occur frequently in the corpus that verb-object pairs such as &quot;buy share&quot; and &quot;sell share&quot; are ranked high by most of the measures. However, &quot;buy&quot; and &quot;sell&quot; are not considered as light verbs. In addition, the various light verbs have different behaviors.</Paragraph> <Paragraph position="16"> Despite their lightness, different light verbs combined with the same noun complement often gives different semantics, and hence affect the lightness of the verb-object pair. For example, one may say that &quot;make copy&quot; is lighter than &quot;put copy&quot;. Incorporating this small amount of linguistic knowledge into our corpus-based framework can enhance performance. null Other features: In addition to the above features, we also used the following features: the determiner before the object, the adjective before the object, the identity of any preposition immediately following the object, the length of the noun object (if a phrase) and the number of words between the verb and its object. These features did not improve performance significantly, so we have omitted a detailed description of these features.</Paragraph> </Section> </Section> class="xml-element"></Paper>