File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1622_evalu.xml

Size: 9,081 bytes

Last Modified: 2025-10-06 13:59:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1622">
  <Title>Semantic Role Labeling via Instance-Based Learning</Title>
  <Section position="9" start_page="183" end_page="186" type="evalu">
    <SectionTitle>
5 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> Experimental results were obtained for part of the Brown corpus (the part provided by CoNLL2005) and for Wall Street Journal (WSJ) Sections 21, 23, and 24 using different training data sets (WSJ 21, WSJ 15 to 18, and WSJ 02 to 21) shown in Table 1. There are two tasks, Role classification with known arguments as input, and Boundary recognition &amp; Role classification with gold (hand-corrected) parses or auto (Charniak's) parses. In addition, execution speed, the learning curve, and some further results for exploration of kNN and PML are also included below.</Paragraph>
    <Section position="1" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
5.1 WSJ 24 with known arguments
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the results from kNN and PML with known boundaries/arguments (i.e. the systems are given the correct arguments for role classification). All training datasets (WSJ02-21) include Charniak's parse trees. The table shows that PML achieves F1: 2.69 better than kNN.</Paragraph>
    </Section>
    <Section position="2" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
5.2 Features &amp; Heuristic on WSJ 24 with
</SectionTitle>
      <Paragraph position="0"> known arguments Table 3 shows the contribution of each feature and the actor heuristic by excluding one feature or heuristic. It indicates that Head Word, Preposition, and Distance are the three features that contribute most to system accuracy, and the additional Actor heuristic is fourth. Path, Phrase type and Voice are the three features contibuting the least for both classification algorithms.</Paragraph>
    </Section>
    <Section position="3" start_page="183" end_page="183" type="sub_section">
      <SectionTitle>
5.3 Learning Curve
</SectionTitle>
      <Paragraph position="0"> Table 4 shows that performance improves as more training data is provided; and that PML outperforms kNN by about F1:2.8 on average for WSJ 24 for the three different training sets, mainly because the backoff lattice improves both recall and precision. The table shows that it is not always beneficial to include all features for labeling all roles. While P(r  |hw, pt, pre, pp) is mainly for adjunctive roles (e.g. AM-TMP), P(r | pt, di, vo, pr, pp) is mainly for core roles (e.g. A0).</Paragraph>
    </Section>
    <Section position="4" start_page="183" end_page="184" type="sub_section">
      <SectionTitle>
5.4 Performance of Execution Time
</SectionTitle>
      <Paragraph position="0"> Building (or training) time is about 2.5 minutes for both PML and kNN, whereas it takes anywhere from about 10 hours to 60 hours for other ML-based architectures (according to the data presented by McCracken http://www.lsi.upc.es/ ~srlconll/st05/slides/mccracken.pdf). Table 5 shows average execution time (in seconds) per sentence for the two algorithms. PML runs faster than kNN when all 20 training datasets are used (i.e. WSJ 02 to 21). A graphic illustration of execution speed is shown in Figure 6. The simulation formulas for PML and kNN are &amp;quot;y = 0.1734Ln(x) - 0.9046&amp;quot; and &amp;quot;y = 2.441*10-5 x + 0.0129&amp;quot; respectively. &amp;quot;x&amp;quot; denotes numbers of training sentences, and &amp;quot;y&amp;quot; denotes second per sentence related to &amp;quot;x&amp;quot; training sentences. The execution time for PML is about 8 times longer than kNN for 1.7k training sentences, but PML ultimately runs faster than kNN on all 39.8K training sentences (and, extrapolating from the graph in Figure 6, on any larger datasets). Thus PML seems generally more suitable for large training data.</Paragraph>
      <Paragraph position="1">  5.5 WSJ 24 with Gold parses and PARA Table 6 shows performance for both systems when gold (hand-corrected) parses are supplied and PARA preprocessing is employed. Compared to the results in Table 4, the performance on the combined training sets (WSJ 02 to 21) drops F1:9.24 and Lacc (label accuracy):2.4 for kNN; and drops F1:8.02 and Lacc:0.66 for PML respectively. This may indicate that PML is more error tolerant in labeling accuracy. However, both systems perform worse due largely to an idiosyncratic problem in the PARApreprocessor when dealing with hand-corrected parses--ultimately due to a particular parsing error.</Paragraph>
    </Section>
    <Section position="5" start_page="184" end_page="184" type="sub_section">
      <SectionTitle>
5.6 WSJ 24 with Charniak's parses and
PARA
</SectionTitle>
      <Paragraph position="0"> Table 7 shows the performance of both systems using auto-parsing (i.e. Charniak's parser) and PARA argument recognition. Compared to the results in Table 4, the performance on all training sets (WSJ 02 to 21) drops F1:17.25 and Lacc:0.65 for kNN, and F1:16.78 and Lacc:-0.78 (i.e. increasing Lacc) for PML respectively.</Paragraph>
      <Paragraph position="1"> Both systems drop a lot in F1 due to errors caused by the auto-parser (in particular errors relating to punctuation), whose effects are subsequently exacerbated by PARA. Even so, the label accuracy (Lacc) is more or less similar because the training dataset are parsed by Charniak's parser instead of gold parses.</Paragraph>
    </Section>
    <Section position="6" start_page="184" end_page="184" type="sub_section">
      <SectionTitle>
5.7 WSJ 23 with Charniak's parses and
PARA
</SectionTitle>
      <Paragraph position="0"> Table 8 shows the results for WSJ 23, where the performance of PML exceeds kNN by about F1:3.8. WSJ 23 is used as a comparison dataset in SRL. More comparisons with other systems are shown in Table 12.</Paragraph>
    </Section>
    <Section position="7" start_page="184" end_page="186" type="sub_section">
      <SectionTitle>
5.8 Brown corpus with Charniak's parses
and PARA
</SectionTitle>
      <Paragraph position="0"> Table 9 shows the results when moving to a different language domain--the Brown corpus.</Paragraph>
      <Paragraph position="1"> Both systems drop a lot in F1 . Compared to WSJ 23, MPL drops 10.47 in F1 and kNN, 11.65 in F1.</Paragraph>
      <Paragraph position="2"> These drops are caused partially by PARA, and partially by classifiers. PARA in Lin &amp; Smith (2006) drops about 3.1 in F1 when moving to the Brown Corpus; but more research is required to uncover the cause.</Paragraph>
      <Paragraph position="3"> 5.9 Further results on kNN with all training data Table 10 shows different results for various values of k in kNN. Both systems, GP (gold-parse) &amp; PARA and CP (Charniak's parse) &amp; PARA, perform best (as measured by F1) when K is set as one. But when the system is labeling a known argument, selection of k=5 is better in terms of both F1 and Label accuracy (Lacc).</Paragraph>
      <Paragraph position="4"> 5.10 Further results on PML with all training data Table 11 shows results for PML with different methods of calculating probabilities. &amp;quot;L+G&amp;quot; means the basic probability distribution (from Figure 2). &amp;quot;L only&amp;quot; and &amp;quot;G only&amp;quot; mean all probability is calculated only as either &amp;quot;local&amp;quot; or &amp;quot;global&amp;quot;, respectively. &amp;quot;L&gt;&gt;G&amp;quot; means that probabilities are calculated globally only when the local probability is zero. &amp;quot;L only&amp;quot; is the fastest approach, and &amp;quot;G only&amp;quot; the slowest (about five seconds per sentence). Both are poor in performance. &amp;quot;L+G&amp;quot; has the best result and &amp;quot;L&gt;&gt;G&amp;quot; is rated as intermediate in performance and execution time.</Paragraph>
      <Paragraph position="5"> 5.11 Comparison with other systems Table 12 shows results from other existing systems. In the second row (PARA+PML) is trained on all datasets (WSJ 02 to 21) for the &amp;quot;BR+RL&amp;quot; task (to recognize argument boundaries and label arguments) on the test data WSJ 23, with an improvement of F1:8.28 in comparison to the result of Palmer et al., (2005) given in the  first row. The basic kNN in the fourth row, trained by four datasets (WSJ 15 to 18 in CoNLL 2004) for the RL&amp;quot; task (to label arguments by giving the known arguments) on the test data WSJ 21, increases F1:6.68 compared to the result of Kouchnir (2004) in the third row. Execution time for our own re-implementation of Palmer (2005) is about 3.785 sec per sentence. Instead of calculating each node in a parse tree like the Palmer (2005) model, PARA+PML can only focus on essential nodes from the output of PARA, which helps to reduce the execution time as 0.941 second per sentence. Execution time by Palmer (2005) is about 4 times longer than PARA+PML on the same machine (n.b. execution times are for a computer running Linux on a P4 2.6GHz CPU with 1G MBRAM).</Paragraph>
      <Paragraph position="6"> More details from different systems and combinations of systems are described in the proceed-</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML