File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2407_evalu.xml

Size: 11,313 bytes

Last Modified: 2025-10-06 13:59:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2407">
  <Title>Extending corpus-based identification of light verb constructions using a supervised learning framework</Title>
  <Section position="5" start_page="52" end_page="54" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section, we report the details of our experimental settings and results. First, we show how we constructed our labeled LVC corpus, used as the gold standard in both training and testing under cross validation. Second, we describe the evaluation setup and discuss the experimental results obtained based on the labeled data.</Paragraph>
    <Section position="1" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
4.1 Data Preparation
</SectionTitle>
      <Paragraph position="0"> Some of the features rely on a correct sentence parse. In order to minimize this source of error, we employ the Wall Street Journal section in the Penn Treebank, which has been manually parsed by linguists. We extract verb-object pairs from the Penn Treebank corpus and lemmatize them using WordNet's morphology module. As a filter, we require that a pair's object be a deverbal noun to be considered as a LVC. Specifically, we use Word-Net to check whether a noun has a verb as one of its derivationally-related forms. A total of 24,647 candidate verb-object pairs are extracted, of which 15,707 are unique.</Paragraph>
      <Paragraph position="1"> As the resulting dataset is too large for complete manual annotation given our resources, we sample the verb-object pairs from the extracted set.</Paragraph>
      <Paragraph position="2"> As most verb-object pairs are not LVCs, random sampling would provide very few positive LVC instances, and thus would adversely affect the training of the classifier due to sparse data. Our aim in the sampling is to have balanced numbers of potential positive and negative instances. Based on the 24,647 verb-object pairs, we count the corpus frequencies of each verb v and each object n, denoted as f(v) and f(n). We also calculate the DJ score of the verb-object pair DJ(v,n) by counting the pair frequencies. The data set is divided into 5 bins using f(v) on a linear scale, 5 bins using f(n) on a linear scale and 4 bins using DJ(v,n) on a logarithmic scale.1 We cross-multiply these three factors to generate 5 x 5 x 4 = 100 bins.</Paragraph>
      <Paragraph position="3"> Finally, we uniformly sampled 2,840 verb-object pairs from all the bins to construct the data set for labeling.</Paragraph>
    </Section>
    <Section position="2" start_page="52" end_page="53" type="sub_section">
      <SectionTitle>
4.2 Annotation
</SectionTitle>
      <Paragraph position="0"> As noted by many linguistic studies, the verb in a LVC is often not completely vacuous, as they can serve to emphasize the proposition's aspect, its argument's semantics (cf., th roles) (Miyamoto, 2000), or other function (Butt and Geuder, 2001).</Paragraph>
      <Paragraph position="1"> As such, previous computational research had proposed that the &amp;quot;lightness&amp;quot; of a LVC might be best modeled as a continuum as opposed to a binary class (Stevenson et al., 2004). We have thus annotated for two levels of lightness in our annotation of the verb-object pairs. Since the purpose of the work reported here is to flag all such constructions, we have simplified our task to a binary decision, similar to most other previous corpus-based work.</Paragraph>
      <Paragraph position="2"> A website was set up for the annotation task, so that annotators can participate interactively.</Paragraph>
      <Paragraph position="3"> For each selected verb-object pair, a question is constructed by displaying the sentence where the verb-object pair is extracted, as well as the verb-object pair itself. The annotator is then asked whether the presented verb-object pair is a LVC given the context of the sentence, and he or she will choose from the following options: (1) Yes,  (2) Not sure, (3) No. The following three sentences illustrate the options.</Paragraph>
      <Paragraph position="4"> (1) Yes - A Compaq Computer Corp.</Paragraph>
      <Paragraph position="5"> spokeswoman said that the company hasn't made a decision yet, although &amp;quot;it isn't under active consideration.&amp;quot; (2) Not Sure - Besides money, criminals have also used computers to steal secrets and intelligence, the newspaper said, but it gave no more details.</Paragraph>
      <Paragraph position="6"> (3) No - But most companies are too afraid to  take that chance.</Paragraph>
      <Paragraph position="7"> The three authors, all natural language processing researchers, took part in the annotation task, and we asked all three of them to annotate on the same data. In total, we collected annotations for 741 questions. The average correlation coefficient between the three annotators is r = 0.654, which indicates fairly strong agreement between the annotators. We constructed the gold standard data by considering the median of the three annotations for each question. Two gold standard data sets are created: * Strict - In the strict data set, a verb-object pair is considered to be a LVC if the median annotation is 1.</Paragraph>
      <Paragraph position="8"> * Lenient - In the lenient data set, a verb-object pair is considered to be a LVC if the median annotation is either 1 or 2.</Paragraph>
      <Paragraph position="9"> Each of the strict and lenient data sets have 741 verb-object pairs.</Paragraph>
    </Section>
    <Section position="3" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
4.3 Experiments
</SectionTitle>
      <Paragraph position="0"> We have two aims for the experiments: (1) to compare between the various base features and the extended features, and (2) to evaluate the effectiveness of our new features.</Paragraph>
      <Paragraph position="1"> Using the Weka data mining toolkit (Witten and Frank, 2000), we have run a series of experiments with different machine learning algorithms. However, since our focus of the experiments is to determine which features are useful and not to evaluate the machine learners, we report the results achieved by the best single classifier without additional tuning, the random forest classifier (Breiman, 2001). Stratified ten-fold cross-validation is performed. The evaluation criteria used is the F1-measure on the LV C class, which is defined as</Paragraph>
      <Paragraph position="3"> where P and R are the precision and recall for the  tended features.</Paragraph>
      <Paragraph position="4"> We first present the results for the base features and the extended features in Table 1. From these results, we make the following observations: * Overall, DJ and DJ-FILTER perform better than GT and FREQ. This is consistent with the results by Dras and Johnson (1996).</Paragraph>
      <Paragraph position="5"> * The results for both GT/FREQ and DJ show that filtering using preposition does not impact performance significantly. We believe that the main reason for this is that the filtering process causes information to be lost. 163 of the 741 verb-object pairs in the corpus do not have a preposition following the object and hence cannot be properly classified using the features with filtering.</Paragraph>
      <Paragraph position="6"> * The SFN metric does not appear to work with our corpus. We suspect that it requires a far larger corpus than our corpus of 24,647 verb-object pairs to work. Stevenson et al. (2004)  have used a corpus whose estimated size is at least 15.7 billion, the number of hits returned in a Google search for the query &amp;quot;the&amp;quot; as of February 2006. The large corpus requirement is thus a main weakness of the SFN metric.</Paragraph>
      <Paragraph position="7">  We now evaluate the effectiveness of our class of new features. Here, we do not report results of classification using only the new features, because these features alone are not intended to constitute a stand-alone measure of the lightness. As such, we evaluate these new features by adding them on top of the base features. We first construct a full feature set by utilizing the base features (GT, DJ and SFN) and all the new features. We chose not to add the extended features to the full feature set because these extended features are not independent to the base features. Next, to show the effectiveness of each new feature individually, we remove it from the full feature set and show the performance of classifier without it.</Paragraph>
      <Paragraph position="8">  binations for our evaluation.</Paragraph>
      <Paragraph position="9"> Table 2 shows the resulting F1-measures when using various sets of features in our experiments.2 We make the following observations: * The combinations of features outperform the individual features. We observe that using individual base features alone can achieve the highest F1-measure of 0.491 on the strict data set and 0.616 on the lenient data set respectively. When applying the combination of all base features, the F1-measures on both 2For the strict data set, the base feature set has a precision and recall of 0.674 and 0.446 respectively, while the full feature set has a precision and recall of 0.642 and 0.523 respectively. For the lenient data set, the base feature set has a precision and recall of 0.778 and 0.598 respectively, while the full feature set has a precision and recall of 0.768 and 0.624 respectively.</Paragraph>
      <Paragraph position="10"> data sets increased to 0.537 and 0.676 respectively. null Previous work has mainly studied individual statistics in identifying LVCs while ignoring the integration of various statistics.</Paragraph>
      <Paragraph position="11"> The results demonstrate that integrating different statistics (i.e. features) boosts the performance of LVC identification. More importantly, we employ an off-the-shelf classifier without special parameter tuning. This shows that generic machine learning methods can be applied to the problem of LVC detection. It provides a sound way to integrate various features to improve the overall performance.</Paragraph>
      <Paragraph position="12"> * Our new features boost the overall performance. Applying the newly proposed features on top of the base feature set, i.e., using the full feature set, gives F1-measures of 0.576 and 0.689 respectively (shown in bold) in our experiments. These yield a significant increase (p &lt; 0.1) over using the base features only. Further, when we remove each of the new features individually from the full feature set, we see a corresponding drop in the F1-measures, of 0.011 (deverbal counts) to 0.044 (light verb classes) for the strict data set, and 0.013 (deverbal counts) to 0.049 (light verb classes) for the lenient data set. It shows that these new features boost the overall performance of the classifier. We think that these new features are more task-specific and examine intrinsic features of LVCs. As such, integrated with the statistical base features, these features can be used to identify LVCs more accurately. It is worth noting that light verb class is a simple but important feature, providing the highest F1-measure improvement compared to other new features. This is in accordance with the observation that different light verbs have different properties (Stevenson et al., 2004).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML