File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2057_evalu.xml
Size: 5,410 bytes
Last Modified: 2025-10-06 13:59:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2057"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A FrameNet-based Semantic Role Labeler for Swedish</Title> <Section position="6" start_page="441" end_page="442" type="evalu"> <SectionTitle> 4 Evaluation of the System </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="441" end_page="441" type="sub_section"> <SectionTitle> 4.1 Evaluation Corpus </SectionTitle> <Paragraph position="0"> To evaluate the system, we manually translated 150 sentences from the FrameNet example corpus.</Paragraph> <Paragraph position="1"> These sentences were selected randomly from the English development set. Some sentences were removed, typically because we found the annotation dubious or the meaning of the sentence difficult to comprehend precisely. The translation was mostly straightforward. Because of the extensive use of compounding in Swedish, some frame elements were merged with target words.</Paragraph> </Section> <Section position="2" start_page="441" end_page="441" type="sub_section"> <SectionTitle> 4.2 Comparison of FE Bracketing Methods </SectionTitle> <Paragraph position="0"> We compared the performance of the two methods for FE bracketing on the test set. Because of limited time, we used smaller training sets than for the full evaluation below (100,000 training instances for all classifiers). Table 2 shows the result of this comparison.</Paragraph> <Paragraph position="1"> As we can see from the Table 2, the globally optimized start-end method increased the precision somewhat, but decreased the recall and made the overall F-measure lower. We therefore used the greedy start-end method for our final evaluation that is described in the next section.</Paragraph> </Section> <Section position="3" start_page="441" end_page="442" type="sub_section"> <SectionTitle> 4.3 Final System Performance </SectionTitle> <Paragraph position="0"> We applied the Swedish semantic role labeler to the translated sentences and evaluated the result.</Paragraph> <Paragraph position="1"> We used the conventional experimental setting where the frame and the target word were given in advance. The results, with approximate 95% confidence intervals included, are presented in Table 3. The figures are precision and recall for the full task, classification accuracy of pre-segmented arguments, precision and recall for the bracketing task, full task precision and recall using the Senseval-3 scoring metrics, and finally the proportion of full sentences whose FEs were correctly bracketed and classified. The Senseval-3 method uses a more lenient scoring scheme that counts a FE as correctly identified if it overlaps with the gold standard FE and has the correct label. Although the strict measures are more interesting, we include these figures for comparison with the systems participating in the Senseval-3 Restricted task (Litkowski, 2004).</Paragraph> <Paragraph position="2"> We include baseline scores for the argument bracketing and classification tasks, respectively.</Paragraph> <Paragraph position="3"> The bracketing baseline method considers non-punctuation subtrees dependent of the target word. When the target word is a verb, the baseline puts FE brackets around the words included in each of these subtrees1. When the target is a noun, we also bracket the target word token itself, and when it is an adjective, we additionally bracket its parent token. As a baseline for the argument classification task, every argument is assigned the most frequent semantic role in the frame. As can be seen from the table, all scores except the argument bracketing recall are well above the baselines.</Paragraph> <Paragraph position="4"> proximate 95% confidence intervals.</Paragraph> <Paragraph position="5"> Although the performance figures are better than the baselines, they are still lower than for most English systems (although higher than some of the systems at Senseval-3). We believe that the main reason for the performance is the quality of the data that were used to train the system, since the results are consistent with the hypothesis that the quality of the transferred data was roughly equal to the performance of the English system multiplied by the figures for the transfer method (Johansson and Nugues, 2005). In that experiment, the transfer method had a precision of 0.84, a recall of 0.81, and an F-measure of 0.82. If we assume that the transfer performance is similar for Swedish, we arrive at a precision of 0.71 * 0.84 = 0.60, a recall of 0.65 * 0.81 = 0.53, 1This is possible because MALTPARSER produces projective trees, i.e. the words in each subtree form a contiguous substring of the sentence.</Paragraph> <Paragraph position="6"> and an F-measure of 0.56. For the F-measure, 0.55 for the system and 0.56 for the product, the figures match closely. For the precision, the system performance (0.67) is significantly higher than the product (0.60), which suggests that the SVM learning method handles the noisy training set rather well for this task. The recall (0.47) is lower than the corresponding product (0.53), but the difference is not statistically significant at the 95% level. These figures suggest that the main effort towards improving the system should be spent on improving the training data.</Paragraph> </Section> </Section> class="xml-element"></Paper>