File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1167_evalu.xml
Size: 7,223 bytes
Last Modified: 2025-10-06 13:59:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1167"> <Title>Statistical Language Modeling with Performance Benchmarks using Various Levels of Syntactic-Semantic Information</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments and Discussion </SectionTitle> <Paragraph position="0"> A statistical language model is evaluated by how well it predicts some hitherto unseen text - test data - generated by the source to be modeled. A commonly used quality measure for a given model M is related to the entropy of the underlying source and is known as perplexity(PPL). Given a word sequence w1,w2,...,wN to be used as a test corpus, the perplexity of a language model M is given by: age branching factor of the language according to the modelM and thus indicates the difficulty of a speech recognition task(Jelinek, 1999). The lower the perplexity, the better the model; usually a reduction in perplexity translates into a reduction in word error rate of a speech recognition system.</Paragraph> <Paragraph position="1"> We have implemented both the LSA and SELSA models using the BLLIP corpus1 which consists of machine-parsed English new stories from the Wall Street Journal (WSJ) for the years 1987, 1988 and 1989. We used the supertagger (Bangalore and Joshi, 1999) to supertag each word in the corpus. This had a tagging acuracy of 92.2%. The training corpus consisted of about 40 million words from the WSJ 1987, 1988 and some portion of 1989. This consists of about 87000 documents related to news stories. The test corpus was a section of WSJ 1989 with around 300,000 words. The baseline tri-gram model had a perplexity of 103.12 and bi-gram had 161.06. The vocabulary size for words was 20106 and for supertags was 449.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Perplexity Results </SectionTitle> <Paragraph position="0"> In the first experiment, we performed SELSA using supertag information for each word. The word-supertag vocabulary was about 60000.</Paragraph> <Paragraph position="1"> This resulted in a matrix of about 60000X87000 for which we performed SVD at various dimensions. Similarly we trained LSA matrix and performed its SVD. Then we used this knowledge to calculate language model probability and then integrated with tri-gram probability using geometric interpolation method (Coccaro and Jurafsky, 1998). In the process, we had assumed the knowledge of the content/function word type for the next word being predicted. Furthermore, in this experiment, we had used only content words for LSA as well as SELSA representation, while the function words were treated by tri-gram model only. We also used the supertagged test corpus, thus we knew the supertag of the next word being predicted. These results thus sets benchmarks for content word based SELSA model. With these assumptions, we obtained the perplexity values as shown in with content/function word type knowledge assumed. For SELSA, these are benchmarks with correct supertag knowledge.</Paragraph> <Paragraph position="2"> These benchmark results show that given the knowledge of the content or function word as well as the supertag of the word being predicted, SELSA model performs far better than the LSA model. This improvement in the performance is attributed to the finer level of syntactic information available now in the form of supertag. Thus given the supertag, the choice of the word becomes very limited and thus perplexity decreases. The decrease in perplexity across the SVD dimension shows that the SVD also plays an important role and thus for SELSA it is truely a latent syntactic-semantic analysis. Thus if we devise an algorithm to predict the supertag of the next word with a very high accuracy, then there is a gurantee of performance improvement by this model compared to LSA.</Paragraph> <Paragraph position="3"> Our next experiment, was based on no knowledge of content or function word type of the next word. Thus the LSA and SELSA matrices had all the words in the vocabulary. We also kept the SVD dimensions for both SELSA and LSA to 125. The results are shown in Table 2. In this case, we observe that LSA achieves the perplexity of 88.20 compared to the base-line tri-gram 103.12. However this is more than LSA perplexity of 68.42 when the knowledge of content/function words was assumed. This relative increase is mainly due to poor modeling of function words in the LSA-space. However for SELSA, we can observe that its perplexity of 36.37 is less than 50.39 value in the case of knowledge about content/function words. This is again attributed to better modeling of syntactically regular function words in SELSA. This can be better understood from the observation that there were 305 function words compared to 19801 content words in the vocabulary spanning 19.8 and 20.3 million words respectively in the training corpus. Apart from this, there were 152,145 and 147 supertags anchoring function word only, content word only and both types of words respectively. Thus given a supertag belonging to function word specific supertags, the 'vocabulary' for the target word is reduced by orders of magnitude compared to the case for content word specific supertags. It is also worth observing that the 125-dimensional SVD case of SELSA is better than the 0-dimensional SVD or uniform SELSA case. Thus the SVD plays a role in deciphering the syntactic-semantically important dimensions of the information space.</Paragraph> <Paragraph position="4"> word knowledge. For SELSA, these are benchmarks with correct supertag knowledge.</Paragraph> <Paragraph position="5"> We also performed experiments using the phrase-type (NP, VP, others) knowledge and incorporated them within SELSA framework.</Paragraph> <Paragraph position="6"> The resultant model was also used to calculate perplexity values and the results on content/function type assumption set compares favourably with LSA by improving the performance. In another experiment we used the part-of-speech tag of the previous word (prevtag) within SELSA, but it couldn't improve against the plain LSA. These results shows that phrase level information is somewhat useful if it can be predicted correctly, but previous POS tags are not useful.</Paragraph> <Paragraph position="7"> SELSA with the knowledge of content/function word type and the correct phrase/prevtag Finally the utility of this language model can be tested in a speech recognition experiment.</Paragraph> <Paragraph position="8"> Here it can be most suitably applied in a second-pass rescoring framework where the output of first-pass could be the N-best list of either joint word-tag sequences (Wang and Harper, 2002) or word sequences which are then passed through a syntax tagger. Both these approaches allow a direct application of the results shown in above experiments, however there is a possibility of error propagation if some word is incorrectly tagged. The other approach is to predict the tag left-to-right from the word-tag partial prefix followed by word prediction and then repeating the procedure for the next word.</Paragraph> </Section> </Section> class="xml-element"></Paper>