File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/w96-0113_evalu.xml
Size: 4,253 bytes
Last Modified: 2025-10-06 14:00:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0113"> <Title>A Re-estimation Method for Stochastic Language Modeling from Ambiguous Observations</Title> <Section position="8" start_page="162" end_page="165" type="evalu"> <SectionTitle> 6 Discussion and Future Work </SectionTitle> <Paragraph position="0"> Merialdo (1994) and Elworthy (1994) have insisted, based on their experimental results, that the maximum likelihood training using an untagged corpus does not necessarily improve tagging accuracy. However, their likelihood was the probability with all paths weighted equivalently. Since more than half of the symbols in the observations may be noise, models estimated in this way are not reliable. The credit factor was introduced to redefine the likelihood of training data. The new likelihood was based on the probability with each possible path weighted by the credit factor. The extended reestimation algorithm can approximately maximize the modified likelihood and improve the model accuracy.</Paragraph> <Paragraph position="1"> The Baum-Welch reestimation algorithm was also extended in two ways. The algorithm can be applied to an unsegmented language (e.g., Japanese), because of the extension for coping with lattice-based observations as training data. The other extension is that the algorithm can train the HMM in addition to the N-gram model. Takeuchi and Matsumoto (1995) proposed the bigram estimation method from an untagged Japanese corpus. Their algorithm divides a morpheme network into possible sequences that are then used for the normal Baum-Welch algorithm. This algorithm cannot take advantage of the scaling procedure, because it requires the synchronous calculation of all possible sequences in the morpheme network. Nagata (1996) recently proposed a generalized forward-backward algorithm that is a character synchronous method for unsegmented languages. He applied this algorithm to bigram model training from untagged Japanese text for new word extraction. However, he did not apply this algorithm to the estimation of HMM parameters.</Paragraph> <Paragraph position="2"> Two additional experiments have been planned. One is related to the limitations of estimation using untagged corpora. The other is related to assignment of the credit factor without a rule-based tagger.</Paragraph> <Paragraph position="3"> The credit factor improved the upper bound of the estimation accuracy from an untagged corpus. However, at higher levels of tagging accuracy, the reestimation method based on the Baum-Welch algorithm is limited by the noise of untagged corpora. On this point, I agree with Merialdo (1994) and Elworthy (1994). One promising direction for future work would be an integration of models estimated from tagged and untagged corpora. Although the total model estimated from an untagged corpus is worse than that from a model using a tagged corpus, a part of the model using the untagged corpus may be better, because estimations from untagged corpora can use very extensive training material. In the bigram model, we can weight each probability of a pair of tags in both models estimated from tagged or untagged corpora. A smoothing method, such as deleted interpolation (Jelinek, 1985), can be used for weighting.</Paragraph> <Paragraph position="4"> Another promising avenue for research is the development of improved methods to assign the credit factor without using rule-based taggers. Any chosen rule-based tagger will impart its own characteristic errors to credit factors it has been used to assign. Such errors can be misleading in the modeling of language. In order to assign more neutral values to the credit factor, we can use the estimated model itself. In the initial estimation of a model, an equivalent credit factor is used for estimation. After several iterations of reestimation, development data tagged by hand is used to evaluate the estimated model. The credit factors can be assigned from this evaluation process and be used in the second phase of estimation. Following the second phase of estimation, new credit factors would be decided by evaluation of the new model. Such a global iteration is a special version of error correcting learning.</Paragraph> </Section> class="xml-element"></Paper>