XML Viewer - w06-1620

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1620_evalu.xml
Size: 7,462 bytes
Last Modified: 2025-10-06 13:59:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1620">
  <Title>Multilingual Deep Lexical Acquisition for HPSGs via Supertagging</Title>
  <Section position="8" start_page="168" end_page="168" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> Evaluation is based on the treebank data associated with each grammar, and a random training-test split of 20,000 training sentences and 1,013 test sentences in the case of the ERG, and 40,000 training sentences and 1,095 test sentences in the case of the JACY. This split is fixed for all models tested.</Paragraph>
    <Paragraph position="1"> Given that the goal of this research is to acquire novel lexical items, our primary focus is on the performance of the different models at predicting the lexical type of any lexical items which occur only in the test data (which may be either novel lexemes or previously-seen lexemes occurring with a novel lexical type). As such, we identify all unknown lexical items in the test data and evaluate according to: token accuracy (the proportion of unknown lexical items which are correctly tagged: ACCa11 ); type precision (the proportion of correctly hypothesised unknown lexical entries: PREC); type recall (the proportion of gold-standard unknown lexical entries for which we get a correct prediction: REC); and type F-score (the harmonic mean of type precision and type recall: F-SCORE). We also measure the overall token accuracy (ACC) across all words in the test data, irrespective of whether they represent known or unknown lexical items.</Paragraph>
    <Section position="1" start_page="168" end_page="168" type="sub_section">
      <SectionTitle>
5.1 Baseline: Unigram Supertagger
</SectionTitle>
      <Paragraph position="0"> As a baseline model, we use a simple unigram supertagger trained based on maximum likelihood estimation over the relevant training data, i.e. the tag a29a13a12 for each token instance of a given word a14 is predicted by:</Paragraph>
      <Paragraph position="2"> In the instance that a14 was not observed in the training data, we back off to the majority lexical type in the training data.</Paragraph>
    </Section>
    <Section position="2" start_page="168" end_page="168" type="sub_section">
      <SectionTitle>
5.2 Benchmark: fnTBL
</SectionTitle>
      <Paragraph position="0"> In order to benchmark our results with the CRF models, we reimplemented the supertagger model proposed by Baldwin (2005b) which simply takes FNTBL 1.1 (Ngai and Florian, 2001) off the shelf and trains it over our particular training set. FNTBL is a transformation-based learner that is distributed with pre-optimised POS tagging modules for English and other European languages that can be redeployed over the task of supertagging.</Paragraph>
      <Paragraph position="1"> Following Baldwin (2005b), the only modifications we make to the default English POS tagging methodology are: (1) to set the default lexical types for singular common and proper nouns to n - c le and n - pn le, respectively; and (2) reduce the threshold score for lexical and context transformation rules to 1. It is important to realise that, unlike our proposed method, the English POS tagger implementation in FNTBL has been fine-tuned to the English POS task, and includes a rich set of lexical templates specific to English.</Paragraph>
      <Paragraph position="2"> Note that were only able to run FNTBL over the English data, as encoding issues with the Japanese proved insurmountable. We are thus only able to compare results over the English, although this is expected to be representative of the relative performance of the methods.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="168" end_page="169" type="evalu">
    <SectionTitle>
5.3 Results
</SectionTitle>
    <Paragraph position="0"> The results for the baseline, benchmark FNTBL method for English and our proposed CRF-based supertagger are presented in Table 3, for each of the ERG and JACY. In order to gauge the impact of the lexical features on the performance of our CRF-based supertagger, we ran the supertagger first without lexical features (CRFa35a17a16a19a18a21a20 ) and then with the lexical features (CRFa74a22a16a19a18a23a20 ).</Paragraph>
    <Paragraph position="1">  The first finding of note is that the proposed model surpasses both the baseline and FNTBL in all cases. If we look to token accuracy for unknown lexical types, the CRF is far and away the superior method, a result which is somewhat diminished but still marked for type-level precision, recall and F-score. Recall that for the purposes of this paper, our primary interest is in how successfully we are able to learn new lexical items, and in this sense the CRF appears to have a clear edge over the other models. It is also important to recall that our results over both English and Japanese have been achieved with only the bare minimum of lexical feature engineering, whereas those of FNTBL are highly optimised.</Paragraph>
    <Paragraph position="2"> Comparing the results for the CRF with and without lexical features (CRFa0 a16a19a18a21a20 ), the lexical features appear to have a strong bearing on type precision in particular, for both the ERG and JACY.</Paragraph>
    <Paragraph position="3"> Looking to the raw numbers, the type-level performance for all methods is far from flattering. However, it is entirely predictable that the over-all token accuracy should be considerably higher than the token accuracy for unknown lexical items.</Paragraph>
    <Paragraph position="4"> A breakdown of type precision and recall for unknown words across the major word classes for English suggests that the CRFa74a22a16a19a18a23a20 supertagger is most adept at learning nominal and adjectival lexical items (with an F-score of 0.671 and 0.628, respectively), and has the greatest difficulties with verbs and adverbs (with an F-score of 0.333 and 0.395, respectively). In the case of Japanese, conjugating adjectives and verbs present the least difficulty (with an F-score of 0.933 and 0.886, respectively), and non-conjugating adjectives and adverbs are considerably harder (with an F-score of 0.396 and 0.474, respectively).</Paragraph>
    <Paragraph position="5"> It is encouraging to note that type precision is higher than type recall in all cases (a phenomenon that is especially noticeable for the ERG), as this means that while we are not producing the full inventory of lexical items for a given lexeme, over half of the lexical items that we produce are genuine (with CRFa74a22a16a19a18a23a20 ). This suggests that it should be possible to present the grammar developer with a relatively low-noise set of automatically learned lexical items for them to manually curate and feed into the lexicon proper.</Paragraph>
    <Paragraph position="6"> One final point of interest is the ability of the CRF to identify multiword expressions (MWEs).</Paragraph>
    <Paragraph position="7"> There were no unknown multiword expressions in either the English or Japanese data, such that we can only evaluate the performance of the supertagger at identifying known MWEs. In the case of English, CRFa74a22a16a19a18a23a20 identified strictly continuous MWEs with an accuracy of 0.758, and optionally discontinuous MWEs (i.e. verb particle constructions) with an accuracy of 0.625. For Japanese, the accuracy is considerably lower, at 0.536 for continuous MWEs (recalling that there were no optionally discontinuous MWEs in JACY).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML