File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-0107_evalu.xml
Size: 8,645 bytes
Last Modified: 2025-10-06 13:59:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0107"> <Title>Latent Features in Automatic Tense Translation between Chinese and English Yang Ye + , Victoria Li Fossum SS</Title> <Section position="8" start_page="51" end_page="53" type="evalu"> <SectionTitle> 5 Experiments and Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 5.1 CRF learning algorithms </SectionTitle> <Paragraph position="0"> Conditional Random Fields (CRFs) are a formalism well-suited for learning and prediction on sequential data in many NLP tasks. It is a probabilistic framework proposed by (Lafferty et al., 2001) for labeling and segmenting structured data, such as sequences, trees and lattices. The conditional nature of CRFs relaxes the independence assumptions required by traditional Hidden Markov Models (HMMs). This is because the conditional model makes it unnecessary to explicitly represent and model the dependencies among the input variables, thus making it feasible to use interacting and global features from the input. CRFs also avoid the label bias problem exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs have been shown to perform well on a number of NLP problems such as shallow parsing (Sha and Pereira, 2003), table extraction (Pinto et al., 2003), and named entity recognition (McCallum and Li, 2003). For our experiments, we use the MALLET implementation of CRF's (McCallum, 2002).</Paragraph> </Section> <Section position="2" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 5.2 Experiments </SectionTitle> <Paragraph position="0"> All supervised learning algorithms require a certain amount of training data, and the reliability of the computational solutions is intricately tied to the accuracy of the annotated data. Human annotations typically suffer from errors, subjectivity, and the expertise effect. Therefore, researchers use consistency checking to validate human annotation experiments. The Kappa Statistic (Cohen, 1960) is a standard measurement of inter-annotator agreement for categorical data annotation. The Kappa score is defined by the following formula, where P(A) is the observed agreement rate from multiple annotators and P(E) is the expected rate of agreement due to pure chance:</Paragraph> <Paragraph position="2"> (2) Since tense annotation requires disambiguating grammatical meaning, which is more abstract than lexical meaning, one would expect the challenge posed by human annotators in a tense annotation experiment to be even greater than for word sense disambiguation. Nevertheless, the tense annotation experiment carried as a precursor to our tense classification task showed a kappa Statistic of 0.723 on the full taxonomy, with an observed agreement of 0.798. In those experiments, we asked three bilingual English native speakers who are fluent in Chinese to annotate the English verb tenses for the first 25 Chinese and English parallel news articles from our training data.</Paragraph> <Paragraph position="3"> We could also obtain a measurement of reliability by taking one annotator as the gold standard at one time, then averaging over the precisions of the different annotators across different gold standards. While it is true that numerically, precision would be higher than Kappa score and seems to be inflating Kappa score, we argue that the difference between Kappa score and precision is not limited to one measure being more aggressive than the other. Rather, the policies of these two measurements are different. The Kappa score cares purely about agreement without any consideration of trueness or falseness, while the procedure we described above gives equal weight to each annotator being the gold standard, and therefore considers both agreement and truthness of the annotation. The advantage of the precision-based agreement measurement is that it makes comparison of the system performance accuracy to the human performance accuracy more direct. The precision under such a scheme for the three annotators is 80% on the full tense taxonomy.</Paragraph> <Paragraph position="4"> We train a tense classifier on our data set in two stages: first on the surface features, and then on the combined space of both surface features (discussed in 4.1) and latent features (discussed in 4.24.4). It is conceivable that the granularity of sequences may matter in learning from data with sequential relationship, and in the context of verb tense tagging, it naturally maps to the granularity of discourse. (Ye, et al., 2005) shows that there is no significant difference between sentence-level sequences and paragraph-level sequences. Therefore, we experiment with only sentence-level sequences. null</Paragraph> </Section> <Section position="3" start_page="51" end_page="52" type="sub_section"> <SectionTitle> 5.2.3 Classification Tree Learning Experiments </SectionTitle> <Paragraph position="0"> To verify the stability of the utility of the latent features, we also experiment with classification tree learning on the same features space as discussed above. Classification Trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The main idea of Classification Tree is to do a recursive partitioning of the variable space to achieve good separation of the classes in the training dataset. We use the Recursive Partitioning and Regression Trees(Rpart) package provided by R statistical computing software for the implementation of classification trees. In order to avoid over-fitting, we prune the tree by setting the minimum number of objects in a node to attempt a split and the minimum number of objects in any terminal node to be 10 and 3 respectively. In the constructed classification tree when we use all features including both surface and latent features, the top split at the root node in the tree is based on telicity feature of the English verb, indicating the importance of telicity feature for English verb among all of the features.</Paragraph> </Section> <Section position="4" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 5.3 Evaluation Results </SectionTitle> <Paragraph position="0"> All results are obtained by 5-fold cross validation.</Paragraph> <Paragraph position="1"> The classifier's performance is evaluated against the tenses from the best-ranked human-generated English translation. To evaluate the performance of the CRFs tense classifier, we compute the precision, recall, general accuracy and F, which are defined as follow.</Paragraph> <Paragraph position="2"> : Total number of predictions; 2. n prediction : Number of correct predictions; 3. N hit : Total number of hits; 4. n hit : Number of correct hits; 5. S: Size of perfect hitlist; From Table 1, we see that past tense, which occurs most frequently in the training data, has the highest precision, recall and F. Future tense, which occurs least frequently, has the lowest F. Precision and recall do not show clear pattern across different tense classes.</Paragraph> <Paragraph position="3"> Table 2 presents the apparent classification accuracies for the training data, we see that latent features still outperform the surface features. Table 3 summarizes the general accuracies of the tense classification systems for CRFs and Classification Trees. The CRFs classifier and the Classification Tree classifier demonstrate similar scales of improvement from surface features, latent features to both surface and latent features.</Paragraph> </Section> <Section position="5" start_page="53" end_page="53" type="sub_section"> <SectionTitle> 5.4 Baseline Systems </SectionTitle> <Paragraph position="0"> To better evaluate our tense classifiers, we provide two baseline systems here. The first baseline system is the tense resolution from the best ranked machine translation system's translation results in the MTC corpus mentioned above. When evaluated against the reference tense tags from the best ranked human translation team, the best MT system yields a accuracy of 47%. The second base-line system is a naive system that assigns the most frequent tense in the training data set, which in our case is past tense, to all verbs in the test data set.</Paragraph> <Paragraph position="1"> Given the fact that we are deadling with newswire data, this baseline system yields a high baseline system with an accuracy of 69.5%.</Paragraph> </Section> </Section> class="xml-element"></Paper>