File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2148_intro.xml
Size: 6,063 bytes
Last Modified: 2025-10-06 14:06:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2148"> <Title>A Stochastic Language Model using Dependency and Its Improvement by Word Clustering</Title> <Section position="2" start_page="0" end_page="898" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> An effectiveness of stochastic language modeling as a methodology of natural language processing has been attested by various applications to the recognition system such as speech recognition and to the analysis system such as paxt-of-speech (POS) tagger.</Paragraph> <Paragraph position="1"> In this methodology a stochastic language model with some parameters is built and they axe estimated in order to maximize its prediction power (minimize the cross entropy) on an unknown input. Considering a single application, it might be better to estimate the parameters taking account of expected accuracy of recognition or analysis. This method is, however, heavily dependent on the problem and of_ fers no systematic solution, as fax as we know. The methodology of stochastic language modeling, however, allows us to separate, from various frameworks of natural language processing, the language description model common to them and enables us a systematic improvement of each application.</Paragraph> <Paragraph position="2"> In this framework a description on a language is represented as a map from a sequence of alphabetic characters to a probability value. The first model is C. E. Shannon's n-gram model (Shannon, 1951).</Paragraph> <Paragraph position="3"> The parameters of the model are estimated from the frequency of n character sequences of the alphabet (n-gram) on a corpus containing a large number of sentences of a language. This is the same model as 0 This work is done when the auther was at Kyoto Univ.</Paragraph> <Paragraph position="4"> used in almost all of the recent practicM applications in that it describes only relations between sequential elements. Some linguistic phenomena, however, axe better described by assuming relations between sepaxated elements. And modeling this kind of phenomena, the accuracies of various application axe generally augmented.</Paragraph> <Paragraph position="5"> As for English, there have been researches in which a stochastic context-free grammar (SCFG) (Fujisaki et ~1., 1989) is used for model description. Recently some researchers have pointed out the importance of the lexicon and proposed lexicalized models (Jelinek et al., 1994; Collins, 1997). In these models, every headword is propagated up through the derivation tree such that every parent receives a headword from the head-child. This kind of specialization may, however, be excessive if the criterion is predictive power of the model. Research ~med at estimating the best specialization level for 2-gram model (Mori et aL, 1997) shows a class-based model is more predictive than a word-based 2-gram model, a completely lexicalized model, comparing cross entropy of a POS-based 2-graxa model, a word-based 2-gram model and a class-based 2-graxa model, estimated from information theoretical point of view.</Paragraph> <Paragraph position="6"> As for a parser based on a class-based SCFG, Chaxniak (1997) reports better accuracy than the above lexicalized models, but the clustering method is not clear enough and, in addition, there is no report on predictive power (cross entropy or perplexity).</Paragraph> <Paragraph position="7"> Hogenhout and Matsumoto (1997) propose a word-clustering method based on syntactic behavior, but no language model is discussed. As the experiments in the present paper attest, word-class relation is dependent on language model.</Paragraph> <Paragraph position="8"> In this paper, taking Japanese as the object language, we propose two complete stochastic language models using dependency between bugsetsu, a sequence of one or more content words followed by zero, one or more function words, and evaluate their predictive power by cross entropy. Since the number of sorts of bunsetsu is enormous, considering it as a symbol to be predicted would surely invoke the data-sparseness problem. To cope with this problem we use the concept of class proposed for a word n-gram model (Brown et al., 1992). Each bunsetsu is represented by the class calculated from the POS of its last content word and that of its last function word.</Paragraph> <Paragraph position="9"> The relation between bunsetsu, called dependency, is described by a stochastic context-free grammar (Fu, 1974) on the classes. From the class of a bunsetsu, the content word sequence and the function word sequence are independently predicted by word n-gram models equipped with unknown word models (Mori and Yamaji, 1997).</Paragraph> <Paragraph position="10"> The above model assumes that the syntactic behavior of each bunsetsu depends only on POS. The POS system invented by grammarians may not always be the best in terms of stochastic language modeling. This is experimentally attested by the paper (Mori et al., 1997) reporting comparisons between a POS-based n-gram model and a class-based n-gram model induced automatically. SVe now propose, based on this report, a word-clustering method on the model we have mentioned above to successfully improve the predictive power. In addition, we discuss a parsing method as an application of the model.</Paragraph> <Paragraph position="11"> We also report the result of experiments conducted on EDR corpus (Jap, 1993) The corpus is divided into ten parts and the models estimated from nine of them axe tested on the rest in terms of cross entropy. As the result, the cross entropy of the POS-based dependency model is 5.3536 bits axtd that of the class-based dependency model estimated by our method is 4.9944 bits. This shows that the clustering method we propose improves the predictive power of the POS-based model notably. Additionally, a parsing experiment proved that the parser based on the improved model has a higher accuracy than the POS-based one.</Paragraph> </Section> class="xml-element"></Paper>