File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1019_evalu.xml
Size: 2,221 bytes
Last Modified: 2025-10-06 13:59:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1019"> <Title>Investigating Loss Functions and Optimization Methods for Discriminative Learning of Label Sequences</Title> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Experimental Results </SectionTitle> <Paragraph position="0"> The results summarized in Table 3 compares the perceptron and the boosting algorithm with the gradient based method. Performance of the standard perceptron algorithm fluctuates a lot, whereas the average perceptron is more stable. We report the results of the average perceptron here. Not surprisingly, it does slightly worse than CRF, since it is an approximation of CRFs. The advantage of the Perceptron algorithm is its dual formulation. In the dual form, explicit feature mapping can be avoided by using the kernel trick and one can have a large number of features efficiently. As we have seen in the previous sections, the ability to incorporate more features has a big impact on the accuracy. Therefore, a dual perceptron algorithm may have a large advantage over other methods.</Paragraph> <Paragraph position="1"> When only HMM features are used, Boosting as a sequential algorithm performs worse than the gradient based method that optimizes in a parallel fashion. This is because there is not much information in the HMM features other than the identity of the word to be labeled. Therefore, the boosting algorithm needs to include almost all the features one by one in the ensemble. When there are just a few more informative features, the boosting algorithm makes better use of them. This situation is more dramatic in POS tagging. Boosting gets 89.42% and 94.92% accuracy for a7 a14 and a7a12a9 features, whereas the gradient based method gets 94.57% and 95.25%. The gradient based method uses all of the available features, whereas boosting uses only about 10% of the features. Due to the loose upper bound that Boosting optimizes, the estimate of the updates are very conservative. Therefore, the same features are selected many times. This negatively effects the convergence time, and the other methods outperform Boosting in terms of training time.</Paragraph> </Section> class="xml-element"></Paper>