File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1021_intro.xml
Size: 3,212 bytes
Last Modified: 2025-10-06 14:02:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1021"> <Title>Training Connectionist Models for the Structured Language Model a0</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In many systems dealing with natural speech or language such as Automatic Speech Recognition and a3 This work was supported by the National Science Foundation under grants No.IIS-9982329 and No.IIS-0085940.</Paragraph> <Paragraph position="1"> Statistical Machine Translation, a language model is a crucial component for searching in the often prohibitively large hypothesis space. Most of the state-of-the-art systems use n-gram language models, which are simple and effective most of the time. Many smoothing techniques that improve language model probability estimation have been proposed and studied in the n-gram literature (Chen and Goodman, 1998).</Paragraph> <Paragraph position="2"> Recent efforts have studied various ways of using information from a longer context span than that usually captured by normal n-gram language models, as well as ways of using syntactical information that is not available to the word-based n-gram models (Chelba and Jelinek, 2000; Charniak, 2001; Roark, 2001; Uystel et al., 2001). All these language models are based on stochastic parsing techniques that build up parse trees for the input word sequence and condition the generation of words on syntactical and lexical information available in the parse trees.</Paragraph> <Paragraph position="3"> Since these language models capture useful hierarchical characteristics of language, they can improve the PPL significantly for various tasks. Although more improvement can be achieved by enriching the syntactical dependencies in the structured language model (SLM) (Xu et al., 2002), a severe data sparseness problem was observed in (Xu et al., 2002) when the number of conditioning features was increased.</Paragraph> <Paragraph position="4"> There has been recent promising work in using distributional representation of words and neural networks for language modeling (Bengio et al., 2001) and parsing (Henderson, 2003). One great advantage of this approach is its ability to fight data sparseness. The model size grows only sub-linearly with the number of predicting features used. It has been shown that this method improves significantly on regular n-gram models in perplexity (Bengio et al., 2001). The ability of the method to accommodate longer contexts is most appealing, since experiments have shown consistent improvements in PPL when the context of one of the components of the SLM is increased in length (Emami et al., 2003).</Paragraph> <Paragraph position="5"> Moreover, because the SLM provides an EM training procedure for its components, the connectionist models can also be improved by the EM training.</Paragraph> <Paragraph position="6"> In this paper, we will study the impact of neural network modeling on the SLM, when all of its three components are modeled with this approach. An EM training procedure will be outlined and applied to further training of the neural network models.</Paragraph> </Section> class="xml-element"></Paper>